Pesquisa | Portal Regional da BVS

Are genomic language models all you need? Exploring genomic language models on protein downstream tasks.

Boshar, Sam; Trop, Evan; de Almeida, Bernardo P; Copoiu, Liviu; Pierrot, Thomas.

Bioinformatics ; 40(9)2024 Sep 02.

Artigo em Inglês | MEDLINE | ID: mdl-39212609

RESUMO

MOTIVATION: Large language models, trained on enormous corpora of biological sequences, are state-of-the-art for downstream genomic and proteomic tasks. Since the genome contains the information to encode all proteins, genomic language models (gLMs) hold the potential to make downstream predictions not only about DNA sequences, but also about proteins. However, the performance of gLMs on protein tasks remains unknown, due to few tasks pairing proteins with the coding DNA sequences (CDS) that can be processed by gLMs. RESULTS: In this work, we curated five such datasets and used them to evaluate the performance of gLMs and proteomic language models (pLMs). We show that gLMs are competitive and even outperform their pLMs counterparts on some tasks. The best performance was achieved using the retrieved CDS compared to sampling strategies. We found that training a joint genomic-proteomic model outperforms each individual approach, showing that they capture different but complementary sequence representations, as we demonstrate through model interpretation of their embeddings. Lastly, we explored different genomic tokenization schemes to improve downstream protein performance. We trained a new Nucleotide Transformer (50M) foundation model with 3mer tokenization that outperforms its 6mer counterpart on protein tasks while maintaining performance on genomics tasks. The application of gLMs to proteomics offers the potential to leverage rich CDS data, and in the spirit of the central dogma, the possibility of a unified and synergistic approach to genomics and proteomics. AVAILABILITY AND IMPLEMENTATION: We make our inference code, 3mer pre-trained model weights and datasets available.

Assuntos

Genômica , Proteômica , Genômica/métodos , Proteômica/métodos , Proteínas/metabolismo , Proteínas/química , Humanos

A foundational large language model for edible plant genomes.

Mendoza-Revilla, Javier; Trop, Evan; Gonzalez, Liam; Roller, Masa; Dalla-Torre, Hugo; de Almeida, Bernardo P; Richard, Guillaume; Caton, Jonathan; Lopez Carranza, Nicolas; Skwark, Marcin; Laterre, Alex; Beguir, Karim; Pierrot, Thomas; Lopez, Marie.

Commun Biol ; 7(1): 835, 2024 Jul 09.

Artigo em Inglês | MEDLINE | ID: mdl-38982288

RESUMO

Significant progress has been made in the field of plant genomics, as demonstrated by the increased use of high-throughput methodologies that enable the characterization of multiple genome-wide molecular phenotypes. These findings have provided valuable insights into plant traits and their underlying genetic mechanisms, particularly in model plant species. Nonetheless, effectively leveraging them to make accurate predictions represents a critical step in crop genomic improvement. We present AgroNT, a foundational large language model trained on genomes from 48 plant species with a predominant focus on crop species. We show that AgroNT can obtain state-of-the-art predictions for regulatory annotations, promoter/terminator strength, tissue-specific gene expression, and prioritize functional variants. We conduct a large-scale in silico saturation mutagenesis analysis on cassava to evaluate the regulatory impact of over 10 million mutations and provide their predicted effects as a resource for variant characterization. Finally, we propose the use of the diverse datasets compiled here as the Plants Genomic Benchmark (PGB), providing a comprehensive benchmark for deep learning-based methods in plant genomic research. The pre-trained AgroNT model is publicly available on HuggingFace at https://huggingface.co/InstaDeepAI/agro-nucleotide-transformer-1b for future research purposes.

Assuntos

Genoma de Planta , Plantas Comestíveis/genética , Genômica/métodos , Aprendizado Profundo , Manihot/genética

Early computational detection of potential high-risk SARS-CoV-2 variants.

Beguir, Karim; Skwark, Marcin J; Fu, Yunguan; Pierrot, Thomas; Carranza, Nicolas Lopez; Laterre, Alexandre; Kadri, Ibtissem; Korched, Abir; Lowegard, Anna U; Lui, Bonny Gaby; Sänger, Bianca; Liu, Yunpeng; Poran, Asaf; Muik, Alexander; Sahin, Ugur.

Comput Biol Med ; 155: 106618, 2023 03.

Artigo em Inglês | MEDLINE | ID: mdl-36774893

RESUMO

The ongoing COVID-19 pandemic is leading to the discovery of hundreds of novel SARS-CoV-2 variants daily. While most variants do not impact the course of the pandemic, some variants pose an increased risk when the acquired mutations allow better evasion of antibody neutralisation or increased transmissibility. Early detection of such high-risk variants (HRVs) is paramount for the proper management of the pandemic. However, experimental assays to determine immune evasion and transmissibility characteristics of new variants are resource-intensive and time-consuming, potentially leading to delays in appropriate responses by decision makers. Presented herein is a novel in silico approach combining spike (S) protein structure modelling and large protein transformer language models on S protein sequences to accurately rank SARS-CoV-2 variants for immune escape and fitness potential. Both metrics were experimentally validated using in vitro pseudovirus-based neutralisation test and binding assays and were subsequently combined to explore the changing landscape of the pandemic and to create an automated Early Warning System (EWS) capable of evaluating new variants in minutes and risk-monitoring variant lineages in near real-time. The system accurately pinpoints the putatively dangerous variants by selecting on average less than 0.3% of the novel variants each week. The EWS flagged all 16 variants designated by the World Health Organization (WHO) as variants of interest (VOIs) if applicable or variants of concern (VOCs) otherwise with an average lead time of more than one and a half months ahead of their designation as such.

Assuntos

COVID-19 , SARS-CoV-2 , Humanos , Pandemias , Benchmarking , Mutação

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA