Carnegie Mellon University

What is Computational Biology?

Modern biology is in the middle of a paradigm shift…

Robert F. Murphy
Ray and Stephanie Lane Professor of Computational Biology Emeritus

Computational biology is the science that answers the question “How can we learn and use models of biological systems constructed from experimental measurements?”  These models may describe what biological tasks are carried out by particular nucleic acid or peptide sequences, which gene (or genes) when expressed produce a particular phenotype or behavior, what sequence of changes in gene or protein expression or localization lead to a particular disease, and how changes in cell organization influence cell behavior.   This field is sometimes referred to as bioinformatics, but many scientists use the latter term to describe the field that answers the question “How can I efficiently store, annotate, search and compare information from biological measurements and observations?”  (This subject has been discussed previously by an early NIH task force report and by Raul Isea.)

A number of factors contribute to the confusion between the terms, including the fact that one of the top journals in computational biology is entitled “Bioinformatics” and that in German for example, computer science is referred to as “informatik” and computational biology is referred to as “bioinformatik.”  Some also feel that bioinformatics emphasizes the information flow in biology.  In any case, the two fields are closely linked, since “bioinformatics” systems typically are needed to provide data to “computational biology” systems that create models, and the results of those models are often returned for storage in “bioinformatics” databases.

Computational biology is a very broad discipline, in that it seeks to build models for diverse types of experimental data (e.g., concentrations, sequences, images, etc.) and biological systems (e.g., molecules, cells, tissues, organs, etc.), and that it uses methods from a wide range of mathematical and computational fields (e.g., complexity theory, algorithmics, machine learning, robotics, etc.).

Perhaps the most important task that computational biologists carry out (and that training in computational biology should equip prospective computational biologists to do) is to frame biomedical problems as computational problems.  This often means looking at a biological system in a new way, challenging current assumptions or theories about the relationships between parts of the system, or integrating different sources of information to make a more comprehensive model than had been attempted before.  In this context, it is worth noting that the primary goal need not be to increase human understanding of the system; even small biological systems can be sufficiently complex that scientists cannot fully comprehend or predict their properties.  Thus the goal can be the creation of the model itself; the model should account for as much currently available experimental data as possible.  Note that this does not mean that the model has been proven, even if the model makes one or more correct predictions about new experiments.  With the exception of very restricted cases, it is not possible to prove that a model is correct, only to disprove it and then improve it by modifying it to incorporate the new results.

This view emphasizes the importance of machine learning for constructing models.  In most current machine learning applications, statistical and computational methods are used to construct models from large existing datasets and those models are used to process new data.  Examples include learning to classify spam emails, to enable fingerprint access to your phone, and to recognize human speech.  However, an increasing number of machine learning applications don’t stop learning after their initial training.  They can either learn from additional data as it becomes available, or, even choose what additional data they would like to learn from.  This last area is termed active machine learning, and it promises to play a very important role in biomedical research in the coming years.

Once the problem has been framed, the second major task of computational biologists begins.  This is to borrow, refine, or invent methods to solve the problem.  Current computational biology research can be divided into a number of broad areas, mainly based on the type of experimental data that is analyzed or modeled.  Among these are analysis of protein and nucleic acid structure and function, gene and protein sequence, evolutionary genomics and proteomics, population genomics, regulatory and metabolic networks, biomedical image analysis and modeling, gene-disease associations, and development and spread of disease.