Carnegie Mellon University

2020 Pre-College Program in Computational Biology Curriculum

Due to COVID-19 and Carnegie Mellon University holding its summer session online this year, we transitioned the 2020 offering of the Pre-College program in computational biology to an online format.

Programming bootcamp

Preparatory materials from Professor Compeau's Programming for Lovers project were provided to admitted students in advance of the program to give them fundamental programming skills.  No previous programming experience was needed for students to be successful, and we were amazed at how strong our students became at programming after just a few weeks of preparatory material.

Our commitment to a cutting edge remote classroom experience

In 2020, we transitioned our approach into an active learning model in which students continually work in teams to solve computational challenges and then apply their code to biological datasets to answer real research questions. We feel that the pedagogical aspects of the program improved significantly as a result, even though we taught the program remotely.

Our data sources

In the 2019 program, our students set sail on Pittsburgh’s three rivers to collect water samples and sequenced the DNA contained in our samples.  These samples were used as a primary dataset for study during Module 1 and Module 2 of the 2020 program.

We also realized that the COVID-19 pandemic presented us with an unprecedented opportunity to study a global biological event while it was happening. Accordingly, we captured real viral genome datasets from public databases and guided students through studying the novel coronavirus in modules 3, 4, and 5.

Module 1: Diversity within the Three Rivers’ microbiome

When we capture DNA from an environmental sample, we obtain a "dictionary" of DNA fragments along with the number of times each fragment occurs. This dataset is not unlike cataloging the number of different species that you see in an ecosystem, except that rather than counting species (lions, tigers, bears, etc.), we are counting strings of DNA.

Once we have such a dictionary, we ask questions such as “How diverse is a given DNA sample?” and “How distinct are different samples from each other?” Both of these questions require us to think quantitatively to obtain a solution, and we will see that there are different ways of answering each one.  We then will convert our ideas to code and write a program that can determine the diversity present in the samples that we obtained.

Module 2: Mapping DNA to a database

Once we can read this dictionary of DNA strands from different species present at the sampling location, a natural question is to identify the species to which each strand of DNA corresponds.  We will discuss algorithms for comparing strings, and see how to identify bacterial species from only short strands of DNA.

Module 3: Reconstructing a genome

In order to determine the genome of an RNA virus like the novel coronavirus, multiple copies of the virus's RNA are chopped into many short fragments, converted to DNA, and then read using cutting-edge DNA sequencing technologies that are the result of decades of effort and billions of dollars of research investment.  Computational biologists now must use overlapping information from the fragments to put together the original genome. Students will learn and implement algorithms for accomplishing this task of genome assembly, and then turn their algorithms to assembling a real coronavirus genome.

Module 4: Gene identification and genome annotation

Once we have a reconstructed a virus genome, we want to learn about how the virus functions. Students will learn how to recognize the patterns corresponding to genes and "annotate" the virus genome with the locations of these genes. The annotation process is a vital first step for researchers to design vaccines to target the virus's proteins.

Module 5: Evolutionary tree construction

To understand the evolutionary relationships between organisms, we can generate evolutionary trees using sequencing data generated from multiple organisms.  Students will learn about and implement algorithms for building evolutionary trees and learn how to apply their techniques to real data, in order to both build a Tree of Life for bacteria living in Pittsburgh’s three rivers as well as for coronaviruses so that we can visualize the mutations acquired by the virus as it spread around the world.

Final Project Presentations

The 2020 program culminated in each group producing a video presentation based on their findings on a particular topic in computational biology. View their videos below and on our YouTube playlist!

Finding Transcription Factor Binding Sites

Modeling Spread of Infectious Diseases

Designing Vaccines                                                        

Analyzing Gene Evolution in Heterogeneous Tumors

How Do We Identify Someone's Ethnic Background?

DNA Computing                                                              

Neural Networks and Tumor Detection

Using CryoEM to Determine 3-D Protein Structure

Game Theory in Evolution: The Optimization of Life


Computational Biology and Gene Regulation

Motifs in Biological Networks: A Brief Dive

Using Computers to Find Organelles