Carnegie Mellon University

Faculty Research Pages

Active learning of cell organization

A team led by Bob Murphy, faculty member in the Computational Biology and Machine Learning Departments, is combining image-derived modeling methods with active learning to build a continuously updating, comprehensive model of protein localization.  The organization of eukaryotic cells into distinct organelles and structures is critical to cell health and function.  Alterations in protein localization have been associated with many diseases, including cancer.  Obtaining a complete picture of the localization of all proteins in cells and how it changes under various conditions is therefore an important but daunting task, given that there are on the order of a hundred cell types in a human body, tens of thousands of proteins expressed in each cell types, and over a million conditions (which include presence of potential drugs or disease-causing mutations).  Automated microscopy can help by allowing large numbers of images to be acquired rapidly, but even with automation it will not be possible to directly determine the localization of all proteins in all cell types under all conditions.  As described by Dr. Murphy in Commentary in Nature Chemical Biology, the alternative is to construct a predictive model of protein localization from all existing information, and use it iteratively to identify those experiments that will are expected to maximally improve the model.  This is an example of Active Learning, a powerful machine learning technique.  The fundamental assumption when applied to biological experiments is that there are correlations between the results of experiments for different variables and that learning these correlations can allow accurate prediction of results for experiments that have not been (and may never be!) done.

Preliminary results in test systems have been very encouraging.  Using active learning on data from Pubchem, a repository of results from high throughput drug screens,  nearly 60% of drug “hits” were found while doing only 3% of all possible experiments.  Encouraging results have also been obtained in extensive computer simulations of learning the effects of many drugs on many proteins.  Based on these results, active learning-driven automated imaging of cell lines expressing different fluorescently-tagged proteins is currently being carried out in Dr. Murphy’s laboratory.  The project draws on a range of cutting edge methods, and collaborators and consultants include a number of faculty members from the Machine Learning and Computational Biology Departments, including Chris Langmead, Aarti Singh, Gustavo Rohde, Jaime Carbonell, and Jeff Schneider.

Algorithms for Large-scale Genomics

Kingsford Group Software IconsCarl Kingsford’s group works on developing novel algorithms for various computational problems that arise when trying to understand how the genome encodes information and how it affects phenotypes. The focus of our research is defining new computational problems and providing rigorous, innovative solutions to them. Occasionally, we encode those rigorous solutions in software so that the approaches can be more widely used. We are primarily interested in developing new computer science to solve important biological problems.

 We have developed techniques for estimating isoform expression levels from RNA-seq (our Sailfish and Salmon methods), fast ways to search thousands of short-read sequencing experiments (our Sequence Bloom Trees data structure), accurate approaches for estimating properties of the spatial structural of chromatin (our Armatus algorithm), transcript assembly approaches (Scallop and Scallop-LR), and methods for human leukocyte antigen (HLA) typing via whole-genome sequencing (Kourami), new formalized methods for finding structural variants from RNA-seq (Squid), and defined new computational techniques for identifying coverage anomalies (SAD), among many other new computational methods.

Algorithms in Nature

Recent advances in our ability to generate and analyze biological data, including new genomics and imaging techniques, allows us to design experiments to test in more detail how biological systems operate; In turn, we can use the inferred activity to develop and apply new algorithms for computational problems. Ziv Bar-Joseph’s group has been working on such bi-directional studies for the last 6 years. We demonstrated the usefulness of such approaches by studying how drosophila (fruitfly) cells solve the Maximum Independent Set problem, how the structure of yeast molecular interaction networks leads to resilience to noise and attacks and how the brain is using pruning algorithms during development to construct adaptive and efficient neural networks. Most recently we have looked at how bacteria cells solve the Distributed Gradient Descent problem when searching for food in a constrained environment. Understanding of the computation performed by these processes led to novel algorithms for distributed computational systems, some of which have already been used by engineers, for example when designing wireless networks.

Bringing machine learning to the clinic

Chris Langmead’s group is working to use machine learning for clinical application. Genomic medicine promises to revolutionize the way patients are treated, enabling quantification of risks leading to proactive preventative care and creating treatments that are precisely tailored to a particular person. Efforts are now underway to obtain not only the genome sequences of thousands of individuals, but also of healthy cells and of cancer tumor cells for the same individual.  As the cost of sequencing continues to drop, it will become routine to sequence patients in a medical setting. However, a number of computational and practical challenges remain blocking the use of genomic sequencing for clinical decision making.

Due to lack of good predictive models and computational tools for connecting disease expertise with machine learning tools, the adoption of genetic predictions is currently slow, except in a few well-studied cases. In this project, we are aiming to increase the usage of predictive systems based on machine learning techniques and genomics. First, in collaboration with UPMC, we are developing models for predicting outcomes in prostate cancer patients. Second, we are creating a user-friendly, automated computational pipeline that will make it easier for clinical researchers to build and refine predictive models for many diseases. This pipeline will also make it simple to share and apply these models at both large and small clinics.

Computational Medicine

Chris Langmead’s and Wei Wu’s groups are collaborating on the Computational Medicine project. Modern medicine increasingly relies on data science and predictive modeling to enhance our understanding of the basis of human disease and to guide treatment in a patient-specific fashion (personalized and precision medicine).  The size, volume, heterogeneity, and complexity of the data available to physicians pose many computational and modeling challenges in pursuit of the overarching goal of improving human health.  My group develops novel methods and models to address these challenges.  Specific examples include: triage algorithms for anticipating complications in acute care settings (e.g., pancreatitis, sepsis); biomarker discovery algorithms from genomic, proteomic, and electronic health records (e.g., pancreatic cancer, melanoma, lung cancer, prostate cancer); treatment optimization methods that avoid drug resistance; and cohort design methods that reduce the cost of acquiring training data. Algorithmically, my group uses a combination of Machine Learning (including Probabilistic Graphical Models, Active Learning and Deep Learning), Game Theory, and Dynamical Systems Modeling.

How The Genome Evolved for Vocal Learning and Speech Production

The Neurogenomics laboratory led by Assistant Prof. Andreas Pfenning is studying how the genome evolved for vocal learning, an aspect of speech production that sets us apart from other primates.

Although nothing is more central to our identity as human beings than behavior, we still don’t know how the billions of nucleotides in our genome have evolved to give us complex thoughts or the ability to communicate those thoughts through speech. The component of speech that gives us a window into the genetic basis of behavior is vocal learning, which is the ability to reproduce complex vocalizations. Although speech itself is uniquely human, vocal learning is an example of convergent evolution, having evolved independently in multiple species, including songbirds/parrots, hummingbirds, bats, elephants, and also in humans relative to chimpanzees. This allows the Neurogenomics Laboratory to take a comparative genomic approach to understanding how vocal learning evolved: what features do the genes and the genomes of vocal learning species have in common relative to those without the ability?

We approach this problem using a wide set of techniques. The Neurogenomics laboratory develops and adapts computational methods to identify regulatory elements in the genome associated with vocal learning behavior. We then use genomic techniques, such as high-throughput reporter assays and genome editing, to validate and refine the computational models. The results of these experiments will not only yield insights into how human speech evolved, but also provide a foundation for others who are aiming to connect genome sequence to organism traits using the wealth of new genomic data available.

Molecular Design and Simulation

The ability to design novel proteins and small molecules is essential to the pharmaceutical, energy, and other industries. Chris Langmead’s group has developed novel methods for revealing the physical factors that govern the properties of natural molecules, and for efficiently searching through combinatorial spaces to find optimal designs for a given task. Our methods encode the design space using graphical models (including graphical games) and then use message-passing methods to perform key tasks, such as free-energy calculations, resistance modeling, and protein design. We also use Molecular Dynamics simulations to further evaluate and refine our designs.

Statistical Machine Learning Methods for Genetical Genomics Analysis

The standard approach for identifying gene networks is based on experimental perturbations of gene regulatory systems such as gene knock-down experiments, followed by a genome-wide profiling of differential gene expressions. However, this approach is significantly limited in that it is not possible to perturb more than one or two genes simultaneously to discover complex gene interactions or to distinguish between direct and indirect downstream regulations of the differentially-expressed genes. As an alternative, genetical genomics study has been proposed to treat naturally-occurring genetic variants as potential perturbants of gene regulatory system and to recover gene networks via analysis of population gene-expression and genotype data. Seyoung Kim’s group is developing a statistical framework for learning gene networks that overcomes the limitations of experimental perturbation methods and addresses the computational challenges of genetical genomics analysis. Our computational approach is based on probabilistic graphical models and we are developing efficient learning and inference algorithms that can be applied to datasets with tens of thousands of gene expressions and millions of genetic variants. We are applying our computational technique to study gene regulatory networks and their genetic control mechanisms in yeast and humans.

The Immune Basis of Alzheimer’s Disease

The Neurogenomics laboratory led by Assistant Prof. Andreas Pfenning is studying how the immune system influences Alzheimer’s disease predisposition.

Alzheimer’s disease is a complex neurodegenerative disorder that has an enormous impact on society, but is still poorly understood. Dr. Pfenning’s previous research has shown that sequence mutations associated with Alzheimer’s Disease are more likely to be found in regions of the genome that are active in the immune cells, and not within the brain. The goal is this research is to uncover how the immune system influences Alzheimer’s disease predisposition and progression.

We are tackling this problem by building computational models that combine human clinical data, large-scale genetics, epigenomics, and mouse models of neurodegeneration. These experiments aim to uncover biomarkers for Alzheimer’s disease and provide the foundation for immune-based therapeutics.

Tumor Phylogeny

Computational Biology Department Head Russell Schwartz is directing a research project to apply methods from phylogenetics (i.e., the inference of evolutionary trees) to identify patterns of evolution of cell lineages within cancers. Cancer is a class of diseases defined by the accumulation of successive mutations that disable normal controls on cell growth, allowing rapid cell proliferation that typically results in tumor growth and eventually metastasis. A single tumor is therefore a kind of evolving population, characterized which rapid mutation and strong selection for cells that can proliferate quickly and evade normal cellular controls on growth or anti-tumor therapies. This observation led to an approach called tumor phylogenetics (Desper et al., 1999) in which one applies phylogenetic algorithms to genetic profiles of tumors to infer evolutionary trees that describe the common features of progression across tumors. These trees then describe the sequences of mutations, also known as progression pathways, that characterize common tumor types. These tumor phylogenies can help researchers identify common features among seemingly different tumors, discover particularly important steps in evolution, and gain insight into the mechanistic reasons why particular patterns of progression seem to recur across many patients. It is hoped that such insights will help us identify new targets for anti-tumor drugs and develop better diagnostic tests to determine which patients are likely to benefit from which therapies.

This work arose from a collaboration between Schwartz and Dr. Stanley Shackney and is now being carried out in collaboration with Alejandro Schäffer, Kerstin Heselmeyer-Haddad, and Thomas Ried of the National Institutes of Health. A particular focus of their work is the question of cell-to-cell heterogeneity in single tumors. The Schwartz lab has pursued this problem from several directions. These include the use of direct single cell data, through a special focus on the use of fluorescence in situ hybridization (FISH) data as well as algorithmic work on the use of single-cell sequence data sets. A second direction is the use of deconvolution (unmixing) methods to reconstruct the genomics of cell subpopulations from bulk genomic data. There work spans modeling and algorithmic directions in better representing the process of tumor evolution and heterogeneity and improving algorithms to fit data to those models, machine learning methods for applying these reconstructions to predicting tumor progression and patient prognosis, and application of these methods with research collaborators.

A recent talk by Schwartz at the Simons Institute for Theoretical Computer Science on some of his directions in FISH-based tumor phylogenetics is shown below.