New Search Engine Solves Massive Genetic Data Retrieval Challenge

Nov 07, 20252 Mins read11

The internet has Google, and now the field of biology has MetaGraph. This search engine can rapidly sift through the massive amounts of biological data stored in public databases. The related research was published in Nature on October 8th. “This is a remarkable achievement,” said Rayan Chikhi of the Institut Pasteur in France. “They have set a new standard for analyzing raw biological data.” This data includes DNA, RNA, and protein sequences, sourced from databases potentially containing quadrillions of DNA bases, equivalent to petabytes of information—a volume that even surpasses all the webpages in Google’s vast index.

Although MetaGraph is seen as the “DNA Google,” Chikhi prefers to compare it to a “YouTube search engine” because the computational tasks behind it are more challenging. Just like searching on YouTube can retrieve all videos featuring a “red balloon,” even if this keyword doesn’t appear in the title, tags, or description. Similarly, MetaGraph can find these patterns hidden deep within massive sequencing datasets without needing explicit prior annotation of genetic patterns.

MetaGraph was initially developed to address the accessibility issues of sequencing datasets. Over the past few decades, the size of biological databases has exploded, posing challenges for scientists using this data—raw sequencing reads are fragmented, noisy, and too vast to search directly. “Paradoxically, the data volume has become the biggest obstacle to actually using this data,” noted Artem Babaian from the University of Toronto, Canada.

André Kahles, a co-corresponding author of the paper from ETH Zurich in Switzerland, stated that MetaGraph can help researchers ask biological questions of databases like the Sequence Read Archive (SRA). The SRA, a public database, already contains over 100 quintillion DNA bases. The research team solved the data retrieval challenge using mathematical “graphs.”

These graphs can connect overlapping DNA fragments, much like sentences using the same words arranged in a book index. The researchers integrated data from seven publicly funded databases to build a sequence collection covering all biological groups, including viruses, bacteria, fungi, plants, animals, and humans. This collection comprises 18.8 million unique DNA and RNA sequence sets and 21 billion amino acid sequence sets. Simultaneously, they developed a search engine for these sequences, allowing users to retrieve these integrated raw data archives simply through text prompts.

“This is a completely new way to interact with this type of data,” Kahles said. “The data is compressed but allows for instant access.” To demonstrate MetaGraph’s practical value, the research team used it to search 241,384 human gut microbiome samples globally, aiming to find genetic markers for global antibiotic resistance. This study builds upon previous work where researchers used an older version of MetaGraph to track resistant bacterial strains in the subway systems of major cities worldwide.

According to the latest news, this analysis took only about an hour on a high-performance computer. MetaGraph isn’t the only large-scale sequence retrieval tool available. For instance, Chikhi and Babaian co-developed a platform called Logan, which can assemble billions of short sequencing reads into longer, more organized DNA fragments. This architectural design allows it to identify complete genes and their variations within sets of sequencing reads that can be larger than those handled by MetaGraph. “Our tool has fewer features but greater performance,” Chikhi said.

With its broader search capabilities, Logan helped researchers discover over 200 million naturally occurring “plastic-eating enzyme” variants originating from various bacteria, fungi, and insects, some of which are even more active than lab-designed enzymes. This finding was posted on the preprint server bioRxiv in September. Babaian believes that such discoveries rely on open-source retrieval tools and the public sequencing databases they depend on. Currently, some biological databases are facing threats of funding cuts. He emphasizes that these innovations in retrieval technology precisely highlight that “open data sharing is crucial.” “These resources are driving global scientific progress and have opened up a whole new field of ‘petabyte-scale genomics’.”