How Can Genomics Benefit From Machine Learning?

February 14, 2022 • April Miller


Genomics is defined as the branch of molecular biology concerned with the structure, function evolution, and mapping of genomes. This field has made significant advances in recent years. Geneticists fully mapped the human genome in 2003. Despite that monumental achievement, the Human Genome Project left roughly 8% of the genetic information in human DNA entirely unexplored. Scientists have declared the human genome fully mapped as of 2021, but that is only part of the equation. We still need the resources to understand all of that information, and that’s where machine learning genomics comes in. 

Machine learning algorithms are programmed to get smarter as they’re exposed to more raw data. This technology is beginning to make an impact in all sorts of industries. How can the study of genomics benefit from machine learning? Where can we expect the technology to go in the future?

Applying Machine Learning to Genomics

The average human genetic code contains more than 3 billion base pairs. Each of these pairs contains some genetic material or instructions that control how the human body forms and functions. Even if we can map the entire human genome, there is still much that we don’t understand. Specifically, we need to know more about how these base pairs interact with one another or the regions that control how each gene is expressed. 

To put it another way, it’s like trying to find patterns by sorting through the information stored in a stack of 200 phone books.  It’s not an impossible task, but it is a challenging one. This sort of manual genome sequencing is possible but can take an incredibly long time. The human mind has limits as well that we need to overcome. The potential for human error also makes things more difficult. One error is all it takes to throw off an entire project. Researchers might think that things are going well until the project ends and they discover the errors.

Machine learning programs can be an invaluable boon for this sort of genetic research. These systems can learn and grow as they consume more data. From there, they can begin to draw conclusions, find patterns, and help to make sense of the massive amounts of data that exist within the human genome. 

Applications for Machine Sequenced Genomics

There are numerous applications for machine learning genomics in a number of different therapeutic pipelines. One study broke the potential applications down into four pipelines: 

  • Target Discovery: Biomedical research that allows researchers to create therapeutics for novel disease targets.
  • Therapeutic Discovery: Screens for the development and creation of safe and potent therapeutics. 
  • Clinical Study: Evaluating the application of those therapeutics for in vivo and in vitro testing, as well as through the clinical trials. 
  • Post-Market Study: Studying the long-term efficacy of these treatments once they’re ready for the wide market

Let’s take a closer look at these four therapeutic pipelines. 

Target Discovery

The target discovery pipeline gives two primary sub-pipelines to work with. It starts by improving our overall understanding of human biology and works to identify druggable biomarkers. Our understanding of human biology has expanded dramatically in the last few decades. Despite this, there is still a lot that we simply do not yet understand.  This is especially true where things like gene expression are concerned.

Researching druggable biomarkers gives us a better understanding of how different medications and chemicals impact the human body. This, in turn, can help researchers make the changes necessary to create more effective treatments. These druggable biomarkers are the foundation of pharmacogenomics. This practice allows doctors to prescribe specific medications for patients based on the individual sequenced genome. 

Therapeutic Discovery

The therapeutic discovery pipeline opens doorways for context-specific drug response — another aspect of the pharmacogenomics industry that we mentioned earlier — and starts to lay the foundation for gene therapy.

Gene therapy has the potential to help create both on-target and off-target CRISPR therapies, as well as work with viral vector designs that could eventually be used as a delivery method for that gene therapy we mentioned earlier. 

Clinical Study

The clinical study pipeline is where things start to get exciting. There are three sub-pipelines here that researchers can work with, including animal to human translation, cohort curation, and causal effects. Many early stages of clinical trials and testing are conducted with animal subjects. Researchers have to take the information they’ve collected from those tests and translate them into data that can be applied to human subjects. A machine learning program has the potential to do that for them in the blink of an eye. Cohort creation makes it easier for researchers to find clinical trial participants that fit a specific demographic. This isn’t always necessary, especially for products that are being designed for the general population, but for treatments that are designed to target a specific genetic demographic, this can save a lot of time and wasted effort. 

Finally, the causal effects sub-pipeline let us start making the most of one of a machine learning system’s greatest skills — prediction. These systems can’t tell the future, but if they have access to enough information, they can start to identify patterns and predict occurrences based on observances of the past. 

Post-Market Study

The work doesn’t stop once companies release a new therapeutic into the marketplace. Researchers and manufacturers need to continue to study the long-term effects of a new drug or treatment. The post-market study pipeline lets researchers collect and use real-world evidence as well as mine for data in clinical texts and biomedical literature. 

Machine learning, both as a whole and as it relates to genomics, is a fairly new field. The technology is far from perfect, and there is a lot of room for improvement. What are some of the most frequent challenges that researchers and scientists face when using this technology for machine learning genomics?

The Risk of Human Error

Regardless of advancements in technology, at its core, humans are still the mind behind machine learning systems.. They can only be as accurate as their original programming. If one bit of code is off in the wrong direction, these systems could spend thousands of hours sifting through all that important information just slightly wrong. It might not seem significant at first but as time goes on and the database grows, the error will compound.

The Challenge of Teaching a New System

Machine learning systems are not unlike newborns when you first bring them online. They aren’t useful or even functional until you teach them how to think and start feeding them information. Thankfully, it doesn’t take as long to bring a machine learning system online as it does to raise a child, but it does take some effort to get these systems off the ground. 

Supervised vs Unsupervised Learning

Left unchecked, a machine learning system can easily sort through terabytes of data without batting a virtual eye, but we don’t want these systems to just start pulling in random data without any sort of guidance. Unsupervised learning might seem like the better option but strict controls are necessary to ensure that these systems don’t inadvertently bog themselves down with random data that has nothing to do with genomics. 

Looking Toward the Future of Genomics

Machine learning genomics is a field that is just starting to get off the ground but it has the potential to vastly improve our understanding of the human genome. One we can crack that code, both literally and figuratively, the potential applications for human medicine moving into the future are limitless.