Carnegie Mellon University

Below is a sampling of Computational Biology Department research projects.  Additional research projects are described on individual faculty homepages.

Algorithms in Nature

Recent advances in our ability to generate and analyze biological data, including new genomics and imaging techniques, allows us to design experiments to test in more detail how biological systems operate; In turn, we can use the inferred activity to develop and apply new algorithms for computational problems. Ziv Bar-Joseph’s group has been working on such bi-directional studies for the last 6 years. We demonstrated the usefulness of such approaches by studying how drosophila (fruitfly) cells solve the Maximum Independent Set problem, how the structure of yeast molecular interaction networks leads to resilience to noise and attacks and how the brain is using pruning algorithms during development to construct adaptive and efficient neural networks. Most recently we have looked at how bacteria cells solve the Distributed Gradient Descent problem when searching for food in a constrained environment. Understanding of the computation performed by these processes led to novel algorithms for distributed computational systems, some of which have already been used by engineers, for example when designing wireless networks.

      • DECOnvolved Discriminative motif discovery (DECOD) – DECOD is a tool for finding discriminative DNA motifs, i.e. motifs that are over-represented in one set of sequences but are depleted from another.
        DECOD uses a k-mer count table and so its running time is independent of the size of the input set. By deconvolving the k-mers DECOD considers context information without using the sequences directly. DECOD is written in Java and therefore can be run on multiple platforms. It has an easy-to-use GUI interface, and it can also be run in command line mode for batch processing.
        Most recent reference: P. Huggins*, S. Zhong*, I. Shiff*, R. Beckerman, O. Laptenko, C. Prives, M.H. Schulz, I. Simon, Z. Bar-Joseph. DECOD: Fast and Accurate Discriminative DNA Motif Finding. Bioinformatics, 27(17):2361-7, 2011.
      • Dynamic Regulatory Events Miner (DREM 2.0) – Method for integrating time series and static data to reconstruct dynamic regulatory networks.
        The Dynamic Regulatory Events Miner (DREM) allows one to model, analyze, and visualize transcriptional gene regulation dynamics. The method of DREM takes as input time series gene expression data and static or dynamic transcription factor-gene interaction data (e.g. ChIP-chip data), and produces as output a dynamic regulatory map. The dynamic regulatory map highlights major bifurcation events in the time series expression data and transcription factors potentially responsible for them. See the manual and paper below for more details.
        Most recent reference: M.H. Schulz, W.E. Devanny, A. Gitter, S. Zhong, J. Ernst, Z. Bar-Joseph. DREM 2.0: Improved reconstruction of dynamic regulatory networks from time-series expression data. BMC Systems Biology, 6:1, 2012
      • SEquencing Error CorrEction for Rna reads (SEECER) – A method for error correction o de novo RNA Seq data that does not require a reference genome.
        SEECER is a sequencing error correction algorithm for RNA-seq data sets. It takes the raw read sequences produced by a next generation sequencing platform like machines from Illumina or Roche. SEECER removes mismatch and indel errors from the raw reads and significantly improves downstream analysis of the data. Especially if the RNA-Seq data is used to produce a de novo transcriptome assembly, running SEECER can have tremendous impact on the quality of the assembly.
        Most recent reference: H.S. Le*, M. H. Schulz*, B.M. McCauley, V.F. Hinman, Bar-Joseph. Probabilistic error correction for RNA sequencing. Nucleic Acids Research, nar/gkt215, 2013.
      • Short Time-series Expression Miner (STEM) – Cluster and visualize short time series gene expression data.
        The Short Time-series Expression Miner (STEM) is a Java program for clustering, comparing, and visualizing short time series gene expression data from microarray experiments (~8 time points or fewer). STEM allows researchers to identify significant temporal expression profiles and the genes associated with these profiles and to compare the behavior of these genes across multiple conditions. STEM is fully integrated with the Gene Ontology (GO) database supporting GO category gene enrichment analyses for sets of genes having the same temporal expression pattern. STEM also supports the ability to easily determine and visualize the behavior of genes belonging to a given GO category or user defined gene set, identifying which temporal expression profiles were enriched for these genes. (Note: While STEM is designed primarily to analyze data from short time course experiments it can be used to analyze data from any small set of experiments which can naturally be ordered sequentially including dose response experiments.)
        Most recent reference: J. Ernst, Z. Bar-Joseph. STEM: a tool for the analysis of short time series gene expression data. BMC Bioinformatics, 7:191, 2006

Statistical Machine Learning Methods for Genetical Genomics Analysis

The standard approach for identifying gene networks is based on experimental perturbations of gene regulatory systems such as gene knock-down experiments, followed by a genome-wide profiling of differential gene expressions. However, this approach is significantly limited in that it is not possible to perturb more than one or two genes simultaneously to discover complex gene interactions or to distinguish between direct and indirect downstream regulations of the differentially-expressed genes. As an alternative, genetical genomics study has been proposed to treat naturally-occurring genetic variants as potential perturbants of gene regulatory system and to recover gene networks via analysis of population gene-expression and genotype data. Seyoung Kim’s group is developing a statistical framework for learning gene networks that overcomes the limitations of experimental perturbation methods and addresses the computational challenges of genetical genomics analysis. Our computational approach is based on probabilistic graphical models and we are developing efficient learning and inference algorithms that can be applied to datasets with tens of thousands of gene expressions and millions of genetic variants. We are applying our computational technique to study gene regulatory networks and their genetic control mechanisms in yeast and humans.

      • GFLasso – Graph-guided fused lasso estimates a sparse multi-response regression model, while leveraging a weighted network structure over response variables to find covariates that are jointly relevant to multiple correlated responses.
      • Tree-Guided Group Lasso – Tree-guided group lasso estimates a sparse multi-response regression model, while leveraging a hierarhical clustering tree structure over response variables
    •  

Illustration of Kingsford Group Software ProgramsCarl Kingsford’s group works on developing novel algorithms for various computational problems that arise when trying to understand how the genome encodes information and how it affects phenotypes. The focus of our research is defining new computational problems and providing rigorous, innovative solutions to them. Occasionally, we encode those rigorous solutions in software so that the approaches can be more widely used. We are primarily interested in developing new computer science to solve important biological problems.

We have developed techniques for estimating isoform expression levels from RNA-seq (our Sailfish and Salmon methods), fast ways to search thousands of short-read sequencing experiments (our Sequence Bloom Trees data structure), accurate approaches for estimating properties of the spatial structural of chromatin (our Armatus algorithm), transcript assembly approaches (Scallop and Scallop-LR), and methods for human leukocyte antigen (HLA) typing via whole-genome sequencing (Kourami), new formalized methods for finding structural variants from RNA-seq (Squid), and defined new computational techniques for identifying coverage anomalies (SAD), among many other new computational methods.

Genomic analysis software:

  • Salmon - Quantify transcript expression from RNA-seq data.
  • Scallop - Assemble transcripts from RNA-seq data.
  • Squid - Find structural variants from RNA-seq data.
  • Jellyfish – Count k-mers quickly in DNA and RNA sequencing data.
  • Armatus - Find topologically associating domains (TADs) in Hi-C data.
  • Kourami - Subtype human leukocyte antigen (HLA) from whole-genome sequencing data.
  • Sailfish - Quantify transcript expression from RNA-seq data (deprecated; use Salmon)

Network, trees, and graph software:

  • Ghost – Global network alignment using multiscale spectral signatures.
  • PARANA2 – Parsimonious ancestral network reconstruction.
  • GIRAF – Computational identification of influenza reassortments via graph mining.

Bringing machine learning to the clinic

Chris Langmead’s group is working to use machine learning for clinical application. Genomic medicine promises to revolutionize the way patients are treated, enabling quantification of risks leading to proactive preventative care and creating treatments that are precisely tailored to a particular person. Efforts are now underway to obtain not only the genome sequences of thousands of individuals, but also of healthy cells and of cancer tumor cells for the same individual.  As the cost of sequencing continues to drop, it will become routine to sequence patients in a medical setting. However, a number of computational and practical challenges remain blocking the use of genomic sequencing for clinical decision making.

Due to lack of good predictive models and computational tools for connecting disease expertise with machine learning tools, the adoption of genetic predictions is currently slow, except in a few well-studied cases. In this project, we are aiming to increase the usage of predictive systems based on machine learning techniques and genomics. First, in collaboration with UPMC, we are developing models for predicting outcomes in prostate cancer patients. Second, we are creating a user-friendly, automated computational pipeline that will make it easier for clinical researchers to build and refine predictive models for many diseases. This pipeline will also make it simple to share and apply these models at both large and small clinics.

Molecular Design and Simulation

The ability to design novel proteins and small molecules is essential to the pharmaceutical, energy, and other industries. Chris Langmead’s group has developed novel methods for revealing the physical factors that govern the properties of natural molecules, and for efficiently searching through combinatorial spaces to find optimal designs for a given task. Our methods encode the design space using graphical models (including graphical games) and then use message-passing methods to perform key tasks, such as free-energy calculations, resistance modeling, and protein design. We also use Molecular Dynamics simulations to further evaluate and refine our designs.


Computational Medicine

Chris Langmead’s and Wei Wu’s groups are collaborating on the Computational Medicine project. Modern medicine increasingly relies on data science and predictive modeling to enhance our understanding of the basis of human disease and to guide treatment in a patient-specific fashion (personalized and precision medicine).  The size, volume, heterogeneity, and complexity of the data available to physicians pose many computational and modeling challenges in pursuit of the overarching goal of improving human health.  My group develops novel methods and models to address these challenges.  Specific examples include: triage algorithms for anticipating complications in acute care settings (e.g., pancreatitis, sepsis); biomarker discovery algorithms from genomic, proteomic, and electronic health records (e.g., pancreatic cancer, melanoma, lung cancer, prostate cancer); treatment optimization methods that avoid drug resistance; and cohort design methods that reduce the cost of acquiring training data. Algorithmically, my group uses a combination of Machine Learning (including Probabilistic Graphical Models, Active Learning and Deep Learning), Game Theory, and Dynamical Systems Modeling.


Active learning of cell organization

A team led by Bob Murphy, Head of the Computational Biology Department and a faculty member in the Machine Learning Department, is combining image-derived modeling methods with active learning to build a continuously updating, comprehensive model of protein localization.  The organization of eukaryotic cells into distinct organelles and structures is critical to cell health and function.  Alterations in protein localization have been associated with many diseases, including cancer.  Obtaining a complete picture of the localization of all proteins in cells and how it changes under various conditions is therefore an important but daunting task, given that there are on the order of a hundred cell types in a human body, tens of thousands of proteins expressed in each cell types, and over a million conditions (which include presence of potential drugs or disease-causing mutations).  Automated microscopy can help by allowing large numbers of images to be acquired rapidly, but even with automation it will not be possible to directly determine the localization of all proteins in all cell types under all conditions.  As described by Dr. Murphy in a recent Commentary in Nature Chemical Biology, the alternative is to construct a predictive model of protein localization from all existing information, and use it iteratively to identify those experiments that will are expected to maximally improve the model.  This is an example of Active Learning, a powerful machine learning technique.  The fundamental assumption when applied to biological experiments is that there are correlations between the results of experiments for different variables and that learning these correlations can allow accurate prediction of results for experiments that have not been (and may never be!) done.

Preliminary results in test systems have been very encouraging.  Using active learning on data from Pubchem, a repository of results from high throughput drug screens,  nearly 60% of drug “hits” were found while doing only 3% of all possible experiments.  Encouraging results have also been obtained in extensive computer simulations of learning the effects of many drugs on many proteins.  Based on these results, active learning-driven automated imaging of cell lines expressing different fluorescently-tagged proteins is currently being carried out in Dr. Murphy’s laboratory.  The project draws on a range of cutting edge methods, and collaborators and consultants include a number of faculty members from the Machine Learning and Computational Biology Departments, including Chris Langmead, Aarti Singh, Gustavo Rohde, Jaime Carbonell, and Jeff Schneider.

      • CellOrganizer – Open source system for learning and using generative models of cells from images.
      • OMERO.searcher – Open source addon to OMERO to enable content-based image searching.
      • PatternUnmixer – Open-source Matlab program for learning mixture models of subcellular patterns.
      • SLIF – Structured Literature Image Finder – open source tools for extracting, analyzing and searching images from primary biomedical literature; additional information about SLIF can be found at http://murphylab.web.cmu.edu/services/SLIF/.
    •  

How The Genome Evolved for Vocal Learning and Speech Production

 

The Neurogenomics laboratory led by Assistant Prof. Andreas Pfenning is studying how the genome evolved for vocal learning, an aspect of speech production that sets us apart from other primates.

Although nothing is more central to our identity as human beings than behavior, we still don’t know how the billions of nucleotides in our genome have evolved to give us complex thoughts or the ability to communicate those thoughts through speech. The component of speech that gives us a window into the genetic basis of behavior is vocal learning, which is the ability to reproduce complex vocalizations. Although speech itself is uniquely human, vocal learning is an example of convergent evolution, having evolved independently in multiple species, including songbirds/parrots, hummingbirds, bats, elephants, and also in humans relative to chimpanzees. This allows the Neurogenomics Laboratory to take a comparative genomic approach to understanding how vocal learning evolved: what features do the genes and the genomes of vocal learning species have in common relative to those without the ability?

We approach this problem using a wide set of techniques. The Neurogenomics laboratory develops and adapts computational methods to identify regulatory elements in the genome associated with vocal learning behavior. We then use genomic techniques, such as high-throughput reporter assays and genome editing, to validate and refine the computational models. The results of these experiments will not only yield insights into how human speech evolved, but also provide a foundation for others who are aiming to connect genome sequence to organism traits using the wealth of new genomic data available.

The Immune Basis of Alzheimer’s Disease

The Neurogenomics laboratory led by Assistant Prof. Andreas Pfenning is studying how the immune system influences Alzheimer’s disease predisposition.

Alzheimer’s disease is a complex neurodegenerative disorder that has an enourmous impact on society, but is still poorly understood. Dr. Pfenning’s previous research has shown that sequence mutations associated with Alzheimer’s Disease are more likely to be found in regions of the genome that are active in the immune cells, and not within the brain. The goal is this research is to uncover how the immune system influences Alzheimer’s disease predisposition and progression.

We are tackling this problem by building computational models that combine human clinical data, large-scale genetics, epigenomics, and mouse models of neurodegeneration. These experiments aim to uncover biomarkers for Alzheimer’s disease and provide the foundation for immune-based therapeutics.


Computational Biology Department Head Russell Schwartz is directing a research project to apply methods from phylogenetics (i.e., the inference of evolutionary trees) to identify patterns of evolution of cell lineages within cancers. Cancer is a class of diseases defined by the accumulation of successive mutations that disable normal controls on cell growth, allowing rapid cell proliferation that typically results in tumor growth and eventually metastasis. A single tumor is therefore a kind of evolving population, characterized which rapid mutation and strong selection for cells that can proliferate quickly and evade normal cellular controls on growth or anti-tumor therapies. This observation led to an approach called tumor phylogenetics (Desper et al., 1999) in which one applies phylogenetic algorithms to genetic profiles of tumors to infer evolutionary trees that describe the common features of progression across tumors. These trees then describe the sequences of mutations, also known as progression pathways, that characterize common tumor types. These tumor phylogenies can help researchers identify common features among seemingly different tumors, discover particularly important steps in evolution, and gain insight into the mechanistic reasons why particular patterns of progression seem to recur across many patients. It is hoped that such insights will help us identify new targets for anti-tumor drugs and develop better diagnostic tests to determine which patients are likely to benefit from which therapies.

This work arose from a collaboration between Schwartz and Dr. Stanley Shackney and is now being carried out in collaboration with Alejandro Schäffer, Kerstin Heselmeyer-Haddad, and Thomas Ried of the National Institutes of Health. A particular focus of their work is the question of cell-to-cell heterogeneity in single tumors. The Schwartz lab has pursued this problem from several directions. These include the use of direct single cell data, through a special focus on the use of fluorescence in situ hybridization (FISH) data as well as algorithmic work on the use of single-cell sequence data sets. A second direction is the use of deconvolution (unmixing) methods to reconstruct the genomics of cell subpopulations from bulk genomic data. There work spans modeling and algorithmic directions in better representing the process of tumor evolution and heterogeneity and improving algorithms to fit data to those models, machine learning methods for applying these reconstructions to predicting tumor progression and patient prognosis, and application of these methods with research collaborators.

A talk by Schwartz at the Simons Institute for Theoretical Computer Science on some of his directions in FISH-based tumor phylogenetics is shown below.