Viruses are doing mysterious things everywhere – AI can help researchers understand what they’re up to in the oceans and in your gut

Viruses are a mysterious and poorly understood force in microbial ecosystems. Researchers know they can infect, kill and manipulate human and bacterial cells in almost any environment, from the oceans to your intestines. But scientists don’t yet have a complete picture of how viruses affect their environments largely because of their extraordinary diversity and ability to evolve quickly.

Communities of microbes are difficult to study in a laboratory setting. Many microbes are difficult to grow, and their natural environments have many more characteristics that influence their success or failure than scientists can recreate in a laboratory.

So systems biologists like me often sequence all the DNA present in a sample – for example, a fecal sample from a patient – ​​separate out the viral DNA sequences and then annotate the parts of the viral genome that code for proteins. These notes on the location, structure, and other characteristics of genes help researchers understand the functions that viruses can perform in the environment and help identify different types of viruses. Researchers annotate viruses by matching viral sequences in a sample with previously annotated sequences available in public databases of viral genetic sequences.

However, scientists are identifying viral sequences in DNA collected from the environment at a rate that far exceeds our ability to annotate those genes. This means that researchers are publishing findings about viruses in microbial ecosystems using unacceptably small fractions of the available data.

To improve researchers’ ability to study viruses around the world, my team and I have developed a new approach to annotating viral sequences using artificial intelligence. Through protein language models that resemble large language models such as ChatGPT but are specific to proteins, we were able to classify previously unseen viral sequences. This opens the door for researchers to not only learn more about viruses, but also to answer biological questions that are difficult to answer with current techniques.

Annotating viruses with AI

Large language models use relationships between words in large data sets of text to provide potential answers to questions to which they have not explicitly ‘learned’ the answer. If you ask a chatbot: “What is the capital of France?” For example, the model does not look up the answer in a table with capitals. Instead, it uses its training on large datasets of documents and information to derive the answer: “The capital of France is Paris.”

Similarly, protein language models are AI algorithms trained to recognize relationships between billions of protein sequences from environments around the world. This training may help them deduce something about the essence of viral proteins and their functions.

We wondered whether protein language models could answer this question: “Given all annotated viral genetic sequences, what is the function of this new sequence?”

In our proof of concept, we trained neural networks on previously annotated viral protein sequences in pre-trained protein language models and then used them to predict the annotation of new viral protein sequences. Our approach allows us to investigate what the model ‘sees’ in a given viral sequence that leads to a particular annotation. This helps identify candidate proteins of interest, either based on their specific functions or how their genome is arranged, clearing the search space of huge data sets.

Microscopy image of spherical bacteria, colored bright green

By identifying further related viral gene functions, protein language models can complement current methods to provide new insights into microbiology. For example, my team and I were able to use our model to discover a previously unrecognized integrase – a type of protein that can move genetic information in and out of cells – in the globally occurring marine picocyanobacteria. Prochlorococci And Synechococci. Specifically, this integrase can move genes in and out of these bacterial populations in the oceans and allow these microbes to better adapt to changing environments.

Our language model also identified a novel viral capsid protein that is widespread in the oceans. We’ve taken the first picture of how the genes are arranged, which shows that it could contain different sets of genes that we think indicates that this virus performs different functions in its environment.

These preliminary findings represent just two of the thousands of annotations generated by our approach.

Analyzing the unknown

Most of the hundreds of thousands of newly discovered viruses are still unclassified. Many viral genetic sequences correspond to protein families with no known function or have never been seen before. Our work shows that similar protein language models can help study the threat and promise of our planet’s many uncharacterized viruses.

Although our research focused on viruses in the global oceans, improved annotation of viral proteins is critical to better understanding the role viruses play in health and disease in the human body. We and other researchers have hypothesized that viral activity in the human gut microbiome may change when you are sick. This means that viruses can help identify stress in microbial communities.

However, our approach is also limited because it requires high-quality annotations. Researchers are developing newer protein language models that integrate other ‘tasks’ as part of their training, particularly predicting protein structures to detect similar proteins, to make them more powerful.

Overall, by making all AI tools available through FAIR Data Principles – data that is discoverable, accessible, interoperable and reusable – researchers can realize the potential of these new ways of annotating protein sequences, which could lead to discoveries that impact human health benefit.

This article is republished from The Conversation, an independent nonprofit organization providing facts and trusted analysis to help you understand our complex world. It was written by: Libusha Kelly, Albert Einstein College of Medicine

Read more:

Libusha Kelly receives funding from the National Institutes of Health.

Leave a Comment