New AI New AI Tool Could Help Fight Rare Diseases
By Marylee Williams
Identifying the genetic variants associated with certain diseases, such as lung cancer or Crohn's Disease, could lead doctors to new medicines or targeted patient treatments.
To make meaningful connections between variants in the human genome and a specific trait or disease, researchers need access to genetic data from a large number of patients. But if the disease is rare, getting the genetic information that could lead to medical breakthroughs poses a challenge.
To help overcome that challenge and do more with limited data, researchers have developed a new deep learning method, KGWAS, that improves the detection of genetic variants and associated traits for rare diseases, potentially enabling the discovery of new drugs or treatments.
Researchers from Carnegie Mellon University's School of Computer Science, Stanford University, the Broad Institute of MIT and Harvard University, and biomedical research organizations detail their discovery in "Small-Cohort GWAS Discovery With AI Over Massive Functional Genomics Knowledge Graph."
GWAS refers to genome-wide association study, a method of scanning the genome of large groups of people to identify certain genetic variants associated with particular diseases or traits.
"GWAS is vital to the entire drug-discovery ecosystem," said Martin Zhang, an assistant professor in CMU's Ray and Stephanie Lane Computational Biology Department. "By design, it works by collecting the genetic information for a bunch of people, and then correlating the genetic mutations with the disease status. But you need to see a lot of people with the disease to do the correlation. If you only see one person with the disease, then the correlation will be low, and you won't have a lot of statistical power to faithfully detect the associations. For rare diseases that affect only .1% or even .01% of the population, GWAS is fundamentally limited."
When performing GWAS analyses, genetic information from 100,000 to a million people may be available. If about 10,000 people within that sample have a certain trait or disease, a researcher can confidently make a correlation between a mutation and that disease. But for rarer diseases, that number could be somewhere between 300 to 1,000, making correlation difficult to identify.
To generate a sample of these rarer groups, researchers can go to hospitals and specifically seek out people with certain diseases. While this technique, called a case-controlled study, can be a solution for rare diseases, the process requires extensive effort. For a disease like Alzheimer's, people have the resources and drive to gather case-controlled data. But for something like myasthenia gravis — a rare autoimmune disorder that affects roughly 35,000 people in the U.S. — researchers can't feasibly create a case-controlled study. Instead, they rely on large-scale biobanks that have genetic information on hundreds of thousands of people.
In their current study, which also included Computational Biology Department faculty member Andreas Pfenning, researchers developed a new method called Knowledge Graph GWAS (KGWAS) that combines a variety of genetic information to make associations between gene variants and specific traits for rarer diseases. The knowledge graph combines information from GWAS with information about a gene's function and interaction, known as comprehensive functional genomics data.
"There are so many different technologies to measure the same thing," said Kexin Huang, a doctoral student at Stanford University's Computer Science Department. "All of these measurements capture some part of the biology of the gene. Since we wanted to improve the power of GWAS, we decided to bring as much information as possible to the process. So we needed a framework to unify these measurement technologies. The knowledge graph is a natural way to bridge everything."
In this work, the knowledge graph links the functions and interactions between genetic variants, genes and gene programs — predefined groups of genes with shared functions. The study's KGWAS knowledge graph is one of the largest to date, with 11 million links between genetic variants, genes and gene programs. Then, KGWAS trains an AI model to use the knowledge graph to predict the likelihood or strength of an association between each genetic variant and a given disease based on aggregate GWAS evidence. Along with predicting associations, the method also cuts through the noise in data, making improvements when distinguishing actual disease-associated variants from false ones.
When applied to a rare disease with limited data, KGWAS can be used to make better predictions for the genetic variants linked to certain diseases. Researchers found that KGWAS identified up to 100% more statistically significant associations than state-of-the-art GWAS methods. It also achieved the same detection power with about 2.7 times fewer samples.
"KGWAS's applications are pretty diverse, ranging from helping in rare disease diagnosis to drug discovery," said Huang. "On the more technical side, it's also a change to the fundamental algorithm of human genetics. By making a better GWAS, we can unlock a variety of different downstream tasks. For rare diseases, KGWAS has the potential to make real improvements."
When researchers can make stronger connections between genetic variants and certain diseases, scientists could develop more targeted treatment applications.
"With KGWAS, we are trying to put everything together," Zhang said. "It's like a framework that can automatically transform the functional data we have into discoveries."
Aaron Aupperlee | 412-268-9068 | aaupperlee@cmu.edu