Pesquisa | Secretaria de Estado da Saúde

Embeddings from protein language models predict conservation and variant effects.

Marquet, Céline; Heinzinger, Michael; Olenyi, Tobias; Dallago, Christian; Erckert, Kyra; Bernhofer, Michael; Nechaev, Dmitrii; Rost, Burkhard.

Hum Genet ; 141(10): 1629-1647, 2022 Oct.

Artigo em Inglês | MEDLINE | ID: mdl-34967936

RESUMO

The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient-MCC-for ProtT5 embeddings of 0.596 ± 0.006 vs. 0.608 ± 0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Finally, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~ 20 k proteins) within 40 min on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA , and PredictProtein.

Assuntos

COVID-19 , SARS-CoV-2 , Algoritmos , Aminoácidos , COVID-19/genética , Humanos , Idioma , Proteoma , SARS-CoV-2/genética

Assessing the role of evolutionary information for enhancing protein language model embeddings.

Erckert, Kyra; Rost, Burkhard.

Sci Rep ; 14(1): 20692, 2024 09 05.

Artigo em Inglês | MEDLINE | ID: mdl-39237735

RESUMO

Embeddings from protein Language Models (pLMs) are replacing evolutionary information from multiple sequence alignments (MSAs) as the most successful input for protein prediction. Is this because embeddings capture evolutionary information? We tested various approaches to explicitly incorporate evolutionary information into embeddings on various protein prediction tasks. While older pLMs (SeqVec, ProtBert) significantly improved through MSAs, the more recent pLM ProtT5 did not benefit. For most tasks, pLM-based outperformed MSA-based methods, and the combination of both even decreased performance for some (intrinsic disorder). We highlight the effectiveness of pLM-based methods and find limited benefits from integrating MSAs.

Assuntos

Evolução Molecular , Proteínas , Alinhamento de Sequência , Proteínas/metabolismo , Proteínas/genética , Proteínas/química , Alinhamento de Sequência/métodos , Biologia Computacional/métodos , Algoritmos , Software , Análise de Sequência de Proteína/métodos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

Detalhe da pesquisa