Carnegie Mellon University

What is Automated Science?

Robert F. Murphy
Ray and Stephanie Lane Professor of Computational Biology Emeritus

Automated Science is the practice of scientific research without the need for significant human intervention.  The goal is to develop “self-driving instruments” along similar lines to self-driving cars.  Self-driving cars were enabled by the development of “drive-by-wire” technologies for automobiles combined with advances in machine learning.  Drive-by-wire instruments have been increasingly used in science over the past few decades; readily available laboratory automation systems can carry out large numbers of experiments by themselves and such instruments have been critical in areas like genome sequencing and drug screening.  However, the choice of what experiments to do using such instruments is typically left to a human scientist.  Advances in artificial intelligence and machine learning now make it possible to instead have experiments chosen by an intelligent system in order to improve the accuracy and efficiency with which science is advanced.

Early efforts toward automated science, including pioneering work by Herb Simon and colleagues at Carnegie Mellon, focused on trying to replicate the process by which humans derive scientific theories.  This was largely based on the enormous progress in the 19th century in mathematics and physics, disciplines in which all theories can be proven from an initial set of postulates.

The need for a new type of automated science stems from the realization that for many disciplines, like biology, there is no set of rules or laws that can be learned and from which everything else can be predicted.  This fundamentally changes the paradigm of scientific research from a hypothesis-driven search for such laws to a data-driven construction of empirical models.  Because many systems have too many variables and are too complex for humans to be able to think about, we need automated ways of constructing empirical models.  Active machine learning holds the key.

What is active machine learning and how does it work?

Most machine learning is passive: a large set of data is assembled, a “machine learner” constructs a model from it, and then the model is used.  Active machine learning involves giving the learner the ability to request new data that is not currently available.  This could take the form of answering a question (such as “Is this a picture of a sunset?”) or of making an experimental measurement.  The focus of most passive machine learning is trying to prove that a model is good (by testing its predictions), while the focus of active machine learning is trying to improve a model by getting new data.  This is typically done by using an iterative process until some goal is reached.

Much past and current science involves searching for the answer to a single question, such as which gene causes a disease or what drug can reverse it.  However, scientific problems increasingly end up being a search for a particular combination of characteristics or components. For example, a set of genes might be involved in a particular process or disease and we want to learn that set, or we might want to find chemical compounds that are effective against a particular disease but without having side effects.  For illustration, imagine that we want to search for a combination of two drugs that blocks a particular disease symptom (because we have found that no single drug can do it).  We can view this search like playing the game Battleship.  The “board” consists of a square matrix of size equal to the number of drugs we want to test (with one row and column for each drug) and we play the game by asking what happens when we do an experiment for a particular pair given by a row and column (to see if we get a “hit”).  Just like in the game, we don’t want to do all of the experiments (all of our ships would be sunk well before then!) so we want to find hits with as few as possible.  The active learning approach would be to build a model from the first few experiments and use it to avoid doing experiments for which we think we can predict the answer; we instead do the ones that we can’t predict well.

How can active machine learning help scientists do better science?

What are some examples of Automated Science?

Two studies from biomedical research are illustrative.  The first was a pioneering study that was based on extensive prior knowledge about yeast genetics and biochemistry.  Its goal was to efficiently identify which genes encode particular enzymatic activities required for yeast growth.  A team led by Ross King constructed a “Robot Scientist” named ADAM for this task.  They first constructed specialized hardware that could carry out a very simple type of experiment: measuring the growth of strains of yeast containing deletions in particular genes in the presence or absence of various metabolites.  They combined this with software to select experiments using logic programming. This program maintains a set of competing hypotheses consistent with previous observations and prior knowledge, and then selects an experiment that is expected to invalidate as many of these hypotheses as possible.  The chosen experiment is then executed in an automated fashion and the new observations are used to select the next experiment.  ADAM correctly determined the functional roles of several genes in an automated fashion using fewer experiments than alternative techniques for selecting experiments (ex. selecting the least expensive one).  You can watch a video of ADAM in action here.

The second study was based on essentially no prior knowledge and used commercially available hardware.  The goal was to learn the effects of different drugs upon the distribution of different proteins within mammalian cells.  A combination of liquid handling robots and an automated microscope were used to execute experiments in which one drug was added to one cell line expressing a fluorescently-tagged protein.  The output was a set of microscope images, and the choice of experiments was decided on the fly by constructing a predictive model for experiments that had not been done yet and using an active machine learning strategy to choose experiments for which the confidence of predictions was low.  It was the first use of computationally-driven experimentation in which the set of possible outcomes was not known in advance.  The results showed that a model that was 92% accurate could be learned while doing only 29% of the possible experiments. You can watch a video of the algorithm in action here.

What is so exciting about Automated Science?