Carnegie Mellon University

Machine Learning for Scientists

Course Number: 02-620

With advances in scientific instruments and high-throughput technology, scientific discoveries are increasingly made from analyzing large-scale data generated from experiments or collected from observational studies. Machine learning methods that have been widely used to extract complex patterns from large speech, text, and image data are now being routinely applied to answer scientific questions in biology, bioengineering, and medicine. This course is intended for graduate students interested in learning machine learning methods for scientific data analysis and modeling. It will cover classification and regression techniques such a logistic regression, random forest regression, Gaussian process regression, decision trees, and support vector machines; unsupervised learning methods such as clustering algorithms, mixture models, and hidden Markov models; probabilistic graphical models and deep learning methods; and learning theories such as PAC learning and VC dimension. The course will focus on applications of these methods in genomics and medicine. Programming skills and basic knowledge of linear algebra, probability, statistics are assumed.

Course Relevance: Graduate students in computational biology and graduate students who are interested in machine learning methods for scientific data analysis.

Background Knowledge: 

Key Topics: Programming skill. Basic knowledge of linear algebra, probability, and statistics

  • Supervised learning methods: logistic regression, random forest regression, Gaussian process regres-
    sion, decision tree, support vector machine, and regularization
  • Unsupervised learning methods: clustering algorithms, mixture models, hidden Markov models, and EM algorithm
  • Probabilistic graphical models: Bayesian networks and undirected probabilistic graphical models
  • Deep learning methods: convolutional neural nets, graph neural nets, neural nets as Gaussian processes
  • Learning theories: probably approximately correct (PAC) learning and VapnikChervonenkis (VC) dimension

 

Semester(s): Spring
Units: 12
Prerequisite(s): 02-680 or an equivalent class

Learning Objectives

Students who complete this course will be able to:
  • understand the concept of different learning strategies and implement the algorithms
  • understand the strengths and weaknesses of each learning method and apply this knowledge to find the right methods for data analysis tasks at hand
  • identify and apply appropriate algorithms and learning strategies to answer scientific questions from data
  • interpret the results of applying machine learning methods to data

Assessment Structure: 

Coursework will consist of the following components. No late assignments will be accepted.

Homework assignments. (45% of grade) Written homework assignments will test your knowledge of the material covered in class.

Attendance and participation. (10% of grade) Attendance will be taken, and we will have occasional in-class exercises that serve to reinforce the concepts we have covered. These exercises will not be graded, but participation will be expected in order to receive a complete grade for that day. You are allowed three “dropped” attendance grades without penalty. These can be used for any purpose.

Examinations. (45% of grade) The exams will test your knowledge of the material from the class. The two midterms will be held in class, and the final exam will be held during the university’s scheduled time. The exam dates are:

  • Midterm 1 (15% of grade): Feb 21 in class
  • Midterm 2 (15% of grade): April 3 in class
  • Final exam (15% of grade): Time and location TBD (will be posted when set by university)

The midterms will not be cumulative: midterm 2 will cover material encountered after midterm 1. The final exam will cover the material from the entire semester.