Exploring the “Dark Matter” of Human Chemistry: New Nature Paper Shows How AI Can Reveal New Metabolites

Published

Jan 27, 2026

While the past few decades have seen unparalleled advancements in medical research, much of what goes on in the human body still remains a mystery. Now, a landmark study published in the prestigious scientific journal Nature, co-authored by Amii researchers, is revealing how artificial intelligence is accelerating our understanding of one of the human body’s biggest mysteries: unknown metabolites. It is research that holds significant promise for the future of disease diagnosis and drug discovery.

The paper, Language model-guided anticipation and discovery of mammalian metabolites, was published on January 14 in the prestigious scientific journal by an international team including members from the University of Alberta, the University of British Columbia, and Princeton University.  The team included U of A computing science student Fei Wang, working alongside his supervisors — Amii Fellow & Canada CIFAR Chair Russ Greiner and metabolics researcher and Distinguished University Professor at the University of Alberta, David S. Wishart.

In their paper, the researchers introduced DeepMet, a sophisticated language model that is specifically trained on chemical structures, which allowed this tool to reveal dozens of previously unknown metabolites in human and mouse tissue.

“We are trying to solve this thing called the dark matter of chemistry, where there are an estimated hundreds of thousands of unknown molecular structures in a biological system,” Wang says.


Metabolites: a snapshot of the body

Metabolites are tiny molecules, such as sugars and amino acids, in our bodies that are of great importance. These tiny molecules are produced by your body, either internally or broken down from whatever you encounter — such as breathing, drinking, eating food, taking drugs.

Metabolites are a key part of what allows a body to function: some provide the energy an organism uses, while others are used to build the structures within a body. Yet other types of metabolites send signals that allow different parts of the body to communicate with one another,  as well as play all kinds of other vital roles.

Metabolites are also one of the ways that medical professionals determine what is happening inside a patient’s body. Metabolic tests are an extremely common diagnostic tool: if you’ve ever had a blood or urine test as part of a check-up, there’s a good chance that it was checking the level of some specific metabolites. Blood sugar levels, which measure the levels of the metabolite glucose, are just one such common test.

Understanding the metabolic pathways in the body — the routes that some metabolites take to carry signals between parts of the body — could lead to new pharmaceutical drugs that take advantage of those pathways to work more effectively.

“Genes are like the blueprint for your body, describing what you were born with. Metabolites are what's happening in your body right now,  which is clearly important,” Greiner says.

“Genes are like the blueprint for your body, describing what you were born with.

Metabolites are what's happening in your body right now,  which is clearly important."

Amii Fellow and Canada CIFAR AI Chair Russ Greiner

A vast unknown territory

Even though metabolites are important to healthcare,  unfortunately, we actually know very few of them. Today’s scientists estimate that we have identified around 3,000 to 4,000 metabolites in the human body. But there are hundreds of thousands, maybe even millions, of others that have not yet been identified. We can see evidence of chemicals that don't match any known metabolite in our blood and tissue, so we know they are there. But as we don’t know their chemical structures, we do not know their role or use.

Identifying new metabolites is expensive and time-consuming, Wang explains. To do so, you need to know two things: what atoms the metabolite is made out of, and how those atoms fit together – that is, (the molecule’s structure. 

Researchers might take a human blood sample and isolate the various molecules in it. Then, they’ll use methods such as mass spectrometry, which breaks down the molecules into smaller (then even smaller) fragments, producing a spectrum showing the masses of these fragments. They can then compare the resulting spectrum to a database that contains the spectra corresponding to each of a large set of already-known metabolites. 

Unfortunately, this only works for known molecules with known spectra. This is why we have computational tools that go the other way: rather than predict the molecule from the spectrum, these tools instead predict the spectrum of a given molecule. Then we can add that pairing to the database of known [molecule, spectrum] pairs. 

But the challenge is deciding what molecules to investigate. The unexplored chemical space is vast: it is estimated that there are one hundred quinvigintillion (that’s  10^80, or 1 followed by 80 zeros) possible small molecules. Only a tiny fraction of these are metabolites that appear in mammals. How do we know which ones we should be considering?

Discovery with a SMILES

In the Nature paper, the research team introduces a new tool, called DeepMet, to suggest likely candidates.

DeepMet introduces a chemical language model trained to do just that. Most people are familiar with large language models like ChatGPT, which are trained on immense datasets of natural language text. When you ask ChatGPT to generate a paragraph, it uses what it has learned about sentence structure, grammar, and word meanings to build sentences piece-by-piece, predicting what words are most likely to follow other words.

DeepMet works similarly. Instead of words, the researchers took a database of known metabolites and represented their chemical structures with short sequences of letters, called simplified molecular-input line-entry system (SMILES) strings. DeepMet was then trained on the SMILES strings from about 2,000 known mammal metabolites, which allowed it to learn insights into the logic of a metabolite’s chemical structure.

Using that knowledge, DeepMet is able to reverse the process: generating new potential metabolites by adding characters, one after another, to form a SMILES string, the same way a sentence is built with words. The model produced a sequence of letters, corresponding to chemical structures that might form new mammal metabolites.  "The model also gives each predicted structure a score to indicate the likelihood that it would correspond with a metabolite found in mammals."

The research team then used a modified version of some existing spectra-predicting tools, trained on known metabolites, to create the spectrum for each of these proposed metabolites.  This produced a new set of millions [molecule, spectrum] pairs, which can be used to identify a novel metabolite molecule by matching the spectrum of this molecule against this database. While this is a large number, it is much smaller than a hundred quinvigintillion!  This smaller set of options means metabolite researchers have a much smaller list of much more likely candidates, greatly reducing the time and resources used to investigate dead ends.

 A lot of work still needs to be done to synthesise and test these DeepMet predictions, to see if they are viable molecules, and yet more work to confirm that they appear in a biological sample.  

To test the accuracy of their model, the team held back many known mammal metabolites from DeepMet’s training set to see if the learned model could then accurately predict their existence. They found that the model’s most frequent predictions matched those withheld metabolites about 29% of the time. 

DeepMet enabled the discovery of 36 previously unrecognized mammalian metabolites across mouse tissues and human biofluids. For context, Wang says that identifying a single new unknown metabolite, without DeepMet, is an arduous task that could take a dedicated researcher several months to even years.

“Before this, you're a single person with one fishing rod, sitting there forever, hopefully one day you get something from the sea of molecules,” Wang says. Now, it feels like we're building a fishing trolley with a sonar, roaming around autonomously in the oceans of chemicals, right?  At some point, it's going to require human intervention, but it's just much more efficient.”

Building the foundations of AI-assisted metabolite discovery

Greiner says that DeepMet is still in the “newborn baby” phase, and we are only now building the foundation for AI-assisted metabolite discovery. But he thinks the potential for technology could be vast. A better understanding of the metabolites in the human body and their roles within our bodies, could provide answers to questions about our health that we haven’t even thought to ask. Greiner says it can lead to new biomarkers that would aid in diagnosing conditions or to help monitor whether a treatment is effective.

He also says it could aid in the discovery of new drugs, as well as a deeper understanding of how they travel through the body. That could lead to safer, more effective treatments, as well as sharply cutting down on the time needed to research new pharmaceutical treatments.

Eventually, Greiner believes this kind of research will get us to a grand goal: a full understanding of all the metabolites in the human body, with the kind of potential and impact that the Human Genome Project had when it was completed two decades ago.

“Before you can get to these results, you need to get the foundations.  And this tool provides some of these critical ideas,” he says.

Russ Greiner

Fei Wang

Share