Pesquisa | BVS Doenças Infecciosas e Parasitárias

1.

Novel machine learning approaches revolutionize protein knowledge.

Bordin, Nicola; Dallago, Christian; Heinzinger, Michael; Kim, Stephanie; Littmann, Maria; Rauer, Clemens; Steinegger, Martin; Rost, Burkhard; Orengo, Christine.

Trends Biochem Sci ; 48(4): 345-359, 2023 04.

Artigo em Inglês | MEDLINE | ID: mdl-36504138

RESUMO

Breakthrough methods in machine learning (ML), protein structure prediction, and novel ultrafast structural aligners are revolutionizing structural biology. Obtaining accurate models of proteins and annotating their functions on a large scale is no longer limited by time and resources. The most recent method to be top ranked by the Critical Assessment of Structure Prediction (CASP) assessment, AlphaFold 2 (AF2), is capable of building structural models with an accuracy comparable to that of experimental structures. Annotations of 3D models are keeping pace with the deposition of the structures due to advancements in protein language models (pLMs) and structural aligners that help validate these transferred annotations. In this review we describe how recent developments in ML for protein science are making large-scale structural bioinformatics available to the general scientific community.

Assuntos

Aprendizado de Máquina , Proteínas , Proteínas/química , Biologia Computacional/métodos , Conformação Proteica

2.

ProteomicsDB: toward a FAIR open-source resource for life-science research.

Lautenbacher, Ludwig; Samaras, Patroklos; Muller, Julian; Grafberger, Andreas; Shraideh, Marwin; Rank, Johannes; Fuchs, Simon T; Schmidt, Tobias K; The, Matthew; Dallago, Christian; Wittges, Holger; Rost, Burkhard; Krcmar, Helmut; Kuster, Bernhard; Wilhelm, Mathias.

Nucleic Acids Res ; 50(D1): D1541-D1552, 2022 01 07.

Artigo em Inglês | MEDLINE | ID: mdl-34791421

RESUMO

ProteomicsDB (https://www.ProteomicsDB.org) is a multi-omics and multi-organism resource for life science research. In this update, we present our efforts to continuously develop and expand ProteomicsDB. The major focus over the last two years was improving the findability, accessibility, interoperability and reusability (FAIR) of the data as well as its implementation. For this purpose, we release a new application programming interface (API) that provides systematic access to essentially all data in ProteomicsDB. Second, we release a new open-source user interface (UI) and show the advantages the scientific community gains from such software. With the new interface, two new visualizations of protein primary, secondary and tertiary structure as well an updated spectrum viewer were added. Furthermore, we integrated ProteomicsDB with our deep-neural-network Prosit that can predict the fragmentation characteristics and retention time of peptides. The result is an automatic processing pipeline that can be used to reevaluate database search engine results stored in ProteomicsDB. In addition, we extended the data content with experiments investigating different human biology as well as a newly supported organism.

Assuntos

Bases de Dados de Proteínas , Proteínas/classificação , Proteômica/classificação , Software , Disciplinas das Ciências Biológicas , Humanos , Redes Neurais de Computação , Proteínas/química

3.

PredictProtein - Predicting Protein Structure and Function for 29 Years.

Bernhofer, Michael; Dallago, Christian; Karl, Tim; Satagopam, Venkata; Heinzinger, Michael; Littmann, Maria; Olenyi, Tobias; Qiu, Jiajun; Schütze, Konstantin; Yachdav, Guy; Ashkenazy, Haim; Ben-Tal, Nir; Bromberg, Yana; Goldberg, Tatyana; Kajan, Laszlo; O'Donoghue, Sean; Sander, Chris; Schafferhans, Andrea; Schlessinger, Avner; Vriend, Gerrit; Mirdita, Milot; Gawron, Piotr; Gu, Wei; Jarosz, Yohan; Trefois, Christophe; Steinegger, Martin; Schneider, Reinhard; Rost, Burkhard.

Nucleic Acids Res ; 49(W1): W535-W540, 2021 07 02.

Artigo em Inglês | MEDLINE | ID: mdl-33999203

RESUMO

Since 1992 PredictProtein (https://predictprotein.org) is a one-stop online resource for protein sequence analysis with its main site hosted at the Luxembourg Centre for Systems Biomedicine (LCSB) and queried monthly by over 3,000 users in 2020. PredictProtein was the first Internet server for protein predictions. It pioneered combining evolutionary information and machine learning. Given a protein sequence as input, the server outputs multiple sequence alignments, predictions of protein structure in 1D and 2D (secondary structure, solvent accessibility, transmembrane segments, disordered regions, protein flexibility, and disulfide bridges) and predictions of protein function (functional effects of sequence variation or point mutations, Gene Ontology (GO) terms, subcellular localization, and protein-, RNA-, and DNA binding). PredictProtein's infrastructure has moved to the LCSB increasing throughput; the use of MMseqs2 sequence search reduced runtime five-fold (apparently without lowering performance of prediction methods); user interface elements improved usability, and new prediction methods were added. PredictProtein recently included predictions from deep learning embeddings (GO and secondary structure) and a method for the prediction of proteins and residues binding DNA, RNA, or other proteins. PredictProtein.org aspires to provide reliable predictions to computational and experimental biologists alike. All scripts and methods are freely available for offline execution in high-throughput settings.

Assuntos

Conformação Proteica , Software , Sítios de Ligação , Proteínas do Nucleocapsídeo de Coronavírus/química , Proteínas de Ligação a DNA/química , Fosfoproteínas/química , Estrutura Secundária de Proteína , Proteínas/química , Proteínas/fisiologia , Proteínas de Ligação a RNA/química , Alinhamento de Sequência , Análise de Sequência de Proteína

4.

Embeddings from protein language models predict conservation and variant effects.

Marquet, Céline; Heinzinger, Michael; Olenyi, Tobias; Dallago, Christian; Erckert, Kyra; Bernhofer, Michael; Nechaev, Dmitrii; Rost, Burkhard.

Hum Genet ; 141(10): 1629-1647, 2022 Oct.

Artigo em Inglês | MEDLINE | ID: mdl-34967936

RESUMO

The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient-MCC-for ProtT5 embeddings of 0.596 ± 0.006 vs. 0.608 ± 0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Finally, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~ 20 k proteins) within 40 min on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA , and PredictProtein.

Assuntos

COVID-19 , SARS-CoV-2 , Algoritmos , Aminoácidos , COVID-19/genética , Humanos , Idioma , Proteoma , SARS-CoV-2/genética

5.

Clustering FunFams using sequence embeddings improves EC purity.

Littmann, Maria; Bordin, Nicola; Heinzinger, Michael; Schütze, Konstantin; Dallago, Christian; Orengo, Christine; Rost, Burkhard.

Bioinformatics ; 37(20): 3449-3455, 2021 Oct 25.

Artigo em Inglês | MEDLINE | ID: mdl-33978744

RESUMO

MOTIVATION: Classifying proteins into functional families can improve our understanding of protein function and can allow transferring annotations within one family. For this, functional families need to be 'pure', i.e., contain only proteins with identical function. Functional Families (FunFams) cluster proteins within CATH superfamilies into such groups of proteins sharing function. 11% of all FunFams (22 830 of 203 639) contain EC annotations and of those, 7% (1526 of 22 830) have inconsistent functional annotations. RESULTS: We propose an approach to further cluster FunFams into functionally more consistent sub-families by encoding their sequences through embeddings. These embeddings originate from language models transferring knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker). Using distances between embeddings and DBSCAN to cluster FunFams and identify outliers, doubled the number of pure clusters per FunFam compared to random clustering. Our approach was not limited to FunFams but also succeeded on families created using sequence similarity alone. Complementing EC annotations, we observed similar results for binding annotations. Thus, we expect an increased purity also for other aspects of function. Our results can help generating FunFams; the resulting clusters with improved functional consistency allow more reliable inference of annotations. We expect this approach to succeed equally for any other grouping of proteins by their phenotypes. AVAILABILITY AND IMPLEMENTATION: Code and embeddings are available via GitHub: https://github.com/Rostlab/FunFamsClustering. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

6.

SARS-CoV-2 structural coverage map reveals viral protein assembly, mimicry, and hijacking mechanisms.

O'Donoghue, Seán I; Schafferhans, Andrea; Sikta, Neblina; Stolte, Christian; Kaur, Sandeep; Ho, Bosco K; Anderson, Stuart; Procter, James B; Dallago, Christian; Bordin, Nicola; Adcock, Matt; Rost, Burkhard.

Mol Syst Biol ; 17(9): e10079, 2021 09.

Artigo em Inglês | MEDLINE | ID: mdl-34519429

RESUMO

We modeled 3D structures of all SARS-CoV-2 proteins, generating 2,060 models that span 69% of the viral proteome and provide details not available elsewhere. We found that Ë6% of the proteome mimicked human proteins, while Ë7% was implicated in hijacking mechanisms that reverse post-translational modifications, block host translation, and disable host defenses; a further Ë29% self-assembled into heteromeric states that provided insight into how the viral replication and translation complex forms. To make these 3D models more accessible, we devised a structural coverage map, a novel visualization method to show what is-and is not-known about the 3D structure of the viral proteome. We integrated the coverage map into an accompanying online resource (https://aquaria.ws/covid) that can be used to find and explore models corresponding to the 79 structural states identified in this work. The resulting Aquaria-COVID resource helps scientists use emerging structural data to understand the mechanisms underlying coronavirus infection and draws attention to the 31% of the viral proteome that remains structurally unknown or dark.

Assuntos

Enzima de Conversão de Angiotensina 2/metabolismo , Interações Hospedeiro-Patógeno/genética , Processamento de Proteína Pós-Traducional , SARS-CoV-2/metabolismo , Glicoproteína da Espícula de Coronavírus/metabolismo , Sistemas de Transporte de Aminoácidos Neutros/química , Sistemas de Transporte de Aminoácidos Neutros/genética , Sistemas de Transporte de Aminoácidos Neutros/metabolismo , Enzima de Conversão de Angiotensina 2/química , Enzima de Conversão de Angiotensina 2/genética , Sítios de Ligação , COVID-19/genética , COVID-19/metabolismo , COVID-19/virologia , Biologia Computacional/métodos , Proteínas do Envelope de Coronavírus/química , Proteínas do Envelope de Coronavírus/genética , Proteínas do Envelope de Coronavírus/metabolismo , Proteínas do Nucleocapsídeo de Coronavírus/química , Proteínas do Nucleocapsídeo de Coronavírus/genética , Proteínas do Nucleocapsídeo de Coronavírus/metabolismo , Humanos , Proteínas de Transporte da Membrana Mitocondrial/química , Proteínas de Transporte da Membrana Mitocondrial/genética , Proteínas de Transporte da Membrana Mitocondrial/metabolismo , Proteínas do Complexo de Importação de Proteína Precursora Mitocondrial , Modelos Moleculares , Mimetismo Molecular , Neuropilina-1/química , Neuropilina-1/genética , Neuropilina-1/metabolismo , Fosfoproteínas/química , Fosfoproteínas/genética , Fosfoproteínas/metabolismo , Ligação Proteica , Conformação Proteica em alfa-Hélice , Conformação Proteica em Folha beta , Domínios e Motivos de Interação entre Proteínas , Mapeamento de Interação de Proteínas/métodos , Multimerização Proteica , SARS-CoV-2/química , SARS-CoV-2/genética , Glicoproteína da Espícula de Coronavírus/química , Glicoproteína da Espícula de Coronavírus/genética , Proteínas da Matriz Viral/química , Proteínas da Matriz Viral/genética , Proteínas da Matriz Viral/metabolismo , Proteínas Viroporinas/química , Proteínas Viroporinas/genética , Proteínas Viroporinas/metabolismo , Replicação Viral

7.

Pathway Commons 2019 Update: integration, analysis and exploration of pathway data.

Rodchenkov, Igor; Babur, Ozgun; Luna, Augustin; Aksoy, Bulent Arman; Wong, Jeffrey V; Fong, Dylan; Franz, Max; Siper, Metin Can; Cheung, Manfred; Wrana, Michael; Mistry, Harsh; Mosier, Logan; Dlin, Jonah; Wen, Qizhi; O'Callaghan, Caitlin; Li, Wanxin; Elder, Geoffrey; Smith, Peter T; Dallago, Christian; Cerami, Ethan; Gross, Benjamin; Dogrusoz, Ugur; Demir, Emek; Bader, Gary D; Sander, Chris.

Nucleic Acids Res ; 48(D1): D489-D497, 2020 01 08.

Artigo em Inglês | MEDLINE | ID: mdl-31647099

RESUMO

Pathway Commons (https://www.pathwaycommons.org) is an integrated resource of publicly available information about biological pathways including biochemical reactions, assembly of biomolecular complexes, transport and catalysis events and physical interactions involving proteins, DNA, RNA, and small molecules (e.g. metabolites and drug compounds). Data is collected from multiple providers in standard formats, including the Biological Pathway Exchange (BioPAX) language and the Proteomics Standards Initiative Molecular Interactions format, and then integrated. Pathway Commons provides biologists with (i) tools to search this comprehensive resource, (ii) a download site offering integrated bulk sets of pathway data (e.g. tables of interactions and gene sets), (iii) reusable software libraries for working with pathway information in several programming languages (Java, R, Python and Javascript) and (iv) a web service for programmatically querying the entire dataset. Visualization of pathways is supported using the Systems Biological Graphical Notation (SBGN). Pathway Commons currently contains data from 22 databases with 4794 detailed human biochemical processes (i.e. pathways) and â¼2.3 million interactions. To enhance the usability of this large resource for end-users, we develop and maintain interactive web applications and training materials that enable pathway exploration and advanced analysis.

Assuntos

Bases de Dados Factuais , Redes e Vias Metabólicas , Software , Genoma Humano , Genômica/métodos , Humanos , Metabolômica/métodos

8.

The EVcouplings Python framework for coevolutionary sequence analysis.

Hopf, Thomas A; Green, Anna G; Schubert, Benjamin; Mersmann, Sophia; Schärfe, Charlotta P I; Ingraham, John B; Toth-Petroczy, Agnes; Brock, Kelly; Riesselman, Adam J; Palmedo, Perry; Kang, Chan; Sheridan, Robert; Draizen, Eli J; Dallago, Christian; Sander, Chris; Marks, Debora S.

Bioinformatics ; 35(9): 1582-1584, 2019 05 01.

Artigo em Inglês | MEDLINE | ID: mdl-30304492

RESUMO

SUMMARY: Coevolutionary sequence analysis has become a commonly used technique for de novo prediction of the structure and function of proteins, RNA, and protein complexes. We present the EVcouplings framework, a fully integrated open-source application and Python package for coevolutionary analysis. The framework enables generation of sequence alignments, calculation and evaluation of evolutionary couplings (ECs), and de novo prediction of structure and mutation effects. The combination of an easy to use, flexible command line interface and an underlying modular Python package makes the full power of coevolutionary analyses available to entry-level and advanced users. AVAILABILITY AND IMPLEMENTATION: https://github.com/debbiemarkslab/evcouplings.

Assuntos

Análise de Sequência , Software , Proteínas , RNA , Alinhamento de Sequência

9.

Modeling aspects of the language of life through transfer-learning protein sequences.

Heinzinger, Michael; Elnaggar, Ahmed; Wang, Yu; Dallago, Christian; Nechaev, Dmitrii; Matthes, Florian; Rost, Burkhard.

BMC Bioinformatics ; 20(1): 723, 2019 Dec 17.

Artigo em Inglês | MEDLINE | ID: mdl-31847804

RESUMO

BACKGROUND: Predicting protein function and structure from sequence is one important challenge for computational biology. For 26 years, most state-of-the-art approaches combined machine learning and evolutionary information. However, for some applications retrieving related proteins is becoming too time-consuming. Additionally, evolutionary information is less powerful for small families, e.g. for proteins from the Dark Proteome. Both these problems are addressed by the new methodology introduced here. RESULTS: We introduced a novel way to represent protein sequences as continuous vectors (embeddings) by using the language model ELMo taken from natural language processing. By modeling protein sequences, ELMo effectively captured the biophysical properties of the language of life from unlabeled big data (UniRef50). We refer to these new embeddings as SeqVec (Sequence-to-Vector) and demonstrate their effectiveness by training simple neural networks for two different tasks. At the per-residue level, secondary structure (Q3 = 79% ± 1, Q8 = 68% ± 1) and regions with intrinsic disorder (MCC = 0.59 ± 0.03) were predicted significantly better than through one-hot encoding or through Word2vec-like approaches. At the per-protein level, subcellular localization was predicted in ten classes (Q10 = 68% ± 1) and membrane-bound were distinguished from water-soluble proteins (Q2 = 87% ± 1). Although SeqVec embeddings generated the best predictions from single sequences, no solution improved over the best existing method using evolutionary information. Nevertheless, our approach improved over some popular methods using evolutionary information and for some proteins even did beat the best. Thus, they prove to condense the underlying principles of protein sequences. Overall, the important novelty is speed: where the lightning-fast HHblits needed on average about two minutes to generate the evolutionary information for a target protein, SeqVec created embeddings on average in 0.03 s. As this speed-up is independent of the size of growing sequence databases, SeqVec provides a highly scalable approach for the analysis of big data in proteomics, i.e. microbiome or metaproteome analysis. CONCLUSION: Transfer-learning succeeded to extract information from unlabeled sequence databases relevant for various protein prediction tasks. SeqVec modeled the language of life, namely the principles underlying protein sequences better than any features suggested by textbooks and prediction methods. The exception is evolutionary information, however, that information is not available on the level of a single sequence.

Assuntos

Aprendizado de Máquina , Sequência de Aminoácidos , Biologia Computacional/métodos , Bases de Dados de Ácidos Nucleicos , Bases de Dados de Proteínas , Processamento de Linguagem Natural , Redes Neurais de Computação , Proteínas/química , Proteômica/métodos , Análise de Sequência

10.

From sequence to function through structure: Deep learning for protein design.

Ferruz, Noelia; Heinzinger, Michael; Akdel, Mehmet; Goncearenco, Alexander; Naef, Luca; Dallago, Christian.

Comput Struct Biotechnol J ; 21: 238-250, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-36544476

RESUMO

The process of designing biomolecules, in particular proteins, is witnessing a rapid change in available tooling and approaches, moving from design through physicochemical force fields, to producing plausible, complex sequences fast via end-to-end differentiable statistical models. To achieve conditional and controllable protein design, researchers at the interface of artificial intelligence and biology leverage advances in natural language processing (NLP) and computer vision techniques, coupled with advances in computing hardware to learn patterns from growing biological databases, curated annotations thereof, or both. Once learned, these patterns can be used to provide novel insights into mechanistic biology and the design of biomolecules. However, navigating and understanding the practical applications for the many recent protein design tools is complex. To facilitate this, we 1) document recent advances in deep learning (DL) assisted protein design from the last three years, 2) present a practical pipeline that allows to go from de novo-generated sequences to their predicted properties and web-powered visualization within minutes, and 3) leverage it to suggest a generated protein sequence which might be used to engineer a biosynthetic gene cluster to produce a molecular glue-like compound. Lastly, we discuss challenges and highlight opportunities for the protein design field.

11.

Structural Analysis of Genomic and Proteomic Signatures Reveal Dynamic Expression of Intrinsically Disordered Regions in Breast Cancer and Tissue.

Zatorski, Nicole; Sun, Yifei; Elmas, Abdulkadir; Dallago, Christian; Karl, Timothy; Stein, David; Rost, Burkhard; Huang, Kuan-Lin; Walsh, Martin; Schlessinger, Avner.

bioRxiv ; 2023 Feb 24.

Artigo em Inglês | MEDLINE | ID: mdl-36865220

RESUMO

Structural features of proteins capture underlying information about protein evolution and function, which enhances the analysis of proteomic and transcriptomic data. Here we develop Structural Analysis of Gene and protein Expression Signatures (SAGES), a method that describes expression data using features calculated from sequence-based prediction methods and 3D structural models. We used SAGES, along with machine learning, to characterize tissues from healthy individuals and those with breast cancer. We analyzed gene expression data from 23 breast cancer patients and genetic mutation data from the COSMIC database as well as 17 breast tumor protein expression profiles. We identified prominent expression of intrinsically disordered regions in breast cancer proteins as well as relationships between drug perturbation signatures and breast cancer disease signatures. Our results suggest that SAGES is generally applicable to describe diverse biological phenomena including disease states and drug effects.

12.

LambdaPP: Fast and accessible protein-specific phenotype predictions.

Olenyi, Tobias; Marquet, Céline; Heinzinger, Michael; Kröger, Benjamin; Nikolova, Tiha; Bernhofer, Michael; Sändig, Philip; Schütze, Konstantin; Littmann, Maria; Mirdita, Milot; Steinegger, Martin; Dallago, Christian; Rost, Burkhard.

Protein Sci ; 32(1): e4524, 2023 01.

Artigo em Inglês | MEDLINE | ID: mdl-36454227

RESUMO

The availability of accurate and fast artificial intelligence (AI) solutions predicting aspects of proteins are revolutionizing experimental and computational molecular biology. The webserver LambdaPP aspires to supersede PredictProtein, the first internet server making AI protein predictions available in 1992. Given a protein sequence as input, LambdaPP provides easily accessible visualizations of protein 3D structure, along with predictions at the protein level (GeneOntology, subcellular location), and the residue level (binding to metal ions, small molecules, and nucleotides; conservation; intrinsic disorder; secondary structure; alpha-helical and beta-barrel transmembrane segments; signal-peptides; variant effect) in seconds. The structure prediction provided by LambdaPP-leveraging ColabFold and computed in minutes-is based on MMseqs2 multiple sequence alignments. All other feature prediction methods are based on the pLM ProtT5. Queried by a protein sequence, LambdaPP computes protein and residue predictions almost instantly for various phenotypes, including 3D structure and aspects of protein function. LambdaPP is freely available for everyone to use under embed.predictprotein.org, the interactive results for the case study can be found under https://embed.predictprotein.org/o/Q9NZC2. The frontend of LambdaPP can be found on GitHub (github.com/sacdallago/embed.predictprotein.org), and can be freely used and distributed under the academic free use license (AFL-2). For high-throughput applications, all methods can be executed locally via the bio-embeddings (bioembeddings.com) python package, or docker image at ghcr.io/bioembeddings/bio_embeddings, which also includes the backend of LambdaPP.

Assuntos

Inteligência Artificial , Proteínas , Proteínas/química , Sequência de Aminoácidos , Estrutura Secundária de Proteína , Alinhamento de Sequência , Software

13.

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.

Elnaggar, Ahmed; Heinzinger, Michael; Dallago, Christian; Rehawi, Ghalia; Wang, Yu; Jones, Llion; Gibbs, Tom; Feher, Tamas; Angerer, Christoph; Steinegger, Martin; Bhowmik, Debsindhu; Rost, Burkhard.

IEEE Trans Pattern Anal Mach Intell ; 44(10): 7112-7127, 2022 10.

Artigo em Inglês | MEDLINE | ID: mdl-34232869

RESUMO

Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The protein LMs (pLMs) were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw pLM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks: (1) a per-residue (per-token) prediction of protein secondary structure (3-state accuracy Q3=81%-87%); (2) per-protein (pooling) predictions of protein sub-cellular location (ten-state accuracy: Q10=81%) and membrane versus water-soluble (2-state accuracy Q2=91%). For secondary structure, the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without multiple sequence alignments (MSAs) or evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that pLMs learned some of the grammar of the language of life. All our models are available through https://github.com/agemagician/ProtTrans.

Assuntos

Algoritmos , Processamento de Linguagem Natural , Biologia Computacional/métodos , Proteínas/química , Aprendizado de Máquina Supervisionado

14.

GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics.

Zvyagin, Maxim; Brace, Alexander; Hippe, Kyle; Deng, Yuntian; Zhang, Bin; Bohorquez, Cindy Orozco; Clyde, Austin; Kale, Bharat; Perez-Rivera, Danilo; Ma, Heng; Mann, Carla M; Irvin, Michael; Pauloski, J Gregory; Ward, Logan; Hayot-Sasson, Valerie; Emani, Murali; Foreman, Sam; Xie, Zhen; Lin, Diangen; Shukla, Maulik; Nie, Weili; Romero, Josh; Dallago, Christian; Vahdat, Arash; Xiao, Chaowei; Gibbs, Thomas; Foster, Ian; Davis, James J; Papka, Michael E; Brettin, Thomas; Stevens, Rick; Anandkumar, Anima; Vishwanath, Venkatram; Ramanathan, Arvind.

bioRxiv ; 2022 Nov 23.

Artigo em Inglês | MEDLINE | ID: mdl-36451881

RESUMO

We seek to transform how new and emergent variants of pandemic-causing viruses, specifically SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pre-training on over 110 million prokaryotic gene sequences and fine-tuning a SARS-CoV-2-specific model on 1.5 million genomes, we show that GenSLMs can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLMs represents one of the first whole genome scale foundation models which can generalize to other prediction tasks. We demonstrate scaling of GenSLMs on GPU-based supercomputers and AI-hardware accelerators utilizing 1.63 Zettaflops in training runs with a sustained performance of 121 PFLOPS in mixed precision and peak of 850 PFLOPS. We present initial scientific insights from examining GenSLMs in tracking evolutionary dynamics of SARS-CoV-2, paving the path to realizing this on large biological data.

15.

A roadmap for the functional annotation of protein families: a community perspective.

de Crécy-Lagard, Valérie; Amorin de Hegedus, Rocio; Arighi, Cecilia; Babor, Jill; Bateman, Alex; Blaby, Ian; Blaby-Haas, Crysten; Bridge, Alan J; Burley, Stephen K; Cleveland, Stacey; Colwell, Lucy J; Conesa, Ana; Dallago, Christian; Danchin, Antoine; de Waard, Anita; Deutschbauer, Adam; Dias, Raquel; Ding, Yousong; Fang, Gang; Friedberg, Iddo; Gerlt, John; Goldford, Joshua; Gorelik, Mark; Gyori, Benjamin M; Henry, Christopher; Hutinet, Geoffrey; Jaroch, Marshall; Karp, Peter D; Kondratova, Liudmyla; Lu, Zhiyong; Marchler-Bauer, Aron; Martin, Maria-Jesus; McWhite, Claire; Moghe, Gaurav D; Monaghan, Paul; Morgat, Anne; Mungall, Christopher J; Natale, Darren A; Nelson, William C; O'Donoghue, Seán; Orengo, Christine; O'Toole, Katherine H; Radivojac, Predrag; Reed, Colbie; Roberts, Richard J; Rodionov, Dmitri; Rodionova, Irina A; Rudolf, Jeffrey D; Saleh, Lana; Sheynkman, Gloria.

Database (Oxford) ; 20222022 08 12.

Artigo em Inglês | MEDLINE | ID: mdl-35961013

RESUMO

Over the last 25 years, biology has entered the genomic era and is becoming a science of 'big data'. Most interpretations of genomic analyses rely on accurate functional annotations of the proteins encoded by more than 500 000 genomes sequenced to date. By different estimates, only half the predicted sequenced proteins carry an accurate functional annotation, and this percentage varies drastically between different organismal lineages. Such a large gap in knowledge hampers all aspects of biological enterprise and, thereby, is standing in the way of genomic biology reaching its full potential. A brainstorming meeting to address this issue funded by the National Science Foundation was held during 3-4 February 2022. Bringing together data scientists, biocurators, computational biologists and experimentalists within the same venue allowed for a comprehensive assessment of the current state of functional annotations of protein families. Further, major issues that were obstructing the field were identified and discussed, which ultimately allowed for the proposal of solutions on how to move forward.

Assuntos

Genômica , Proteínas , Sequência de Bases , Biologia Computacional , Genoma , Anotação de Sequência Molecular

16.

Protein matchmaking through representation learning.

Heinzinger, Michael; Dallago, Christian; Rost, Burkhard.

Cell Syst ; 12(10): 948-950, 2021 10 20.

Artigo em Inglês | MEDLINE | ID: mdl-34672956

RESUMO

Sledzieski, Singh, Cowen, and Berger employ representation learning to predict protein interactions and associations, additionally identifying binding residues between protein pairs. Generalizability is showcased by training on one organism while evaluating on others. The work exemplifies how transfer of AI-learned representations can advance knowledge in molecular biology.

Assuntos

Conhecimento , Aprendizado de Máquina

17.

Light attention predicts protein location from the language of life.

Stärk, Hannes; Dallago, Christian; Heinzinger, Michael; Rost, Burkhard.

Bioinform Adv ; 1(1): vbab035, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-36700108

RESUMO

Summary: Although knowing where a protein functions in a cell is important to characterize biological processes, this information remains unavailable for most known proteins. Machine learning narrows the gap through predictions from expert-designed input features leveraging information from multiple sequence alignments (MSAs) that is resource expensive to generate. Here, we showcased using embeddings from protein language models for competitive localization prediction without MSAs. Our lightweight deep neural network architecture used a softmax weighted aggregation mechanism with linear complexity in sequence length referred to as light attention. The method significantly outperformed the state-of-the-art (SOTA) for 10 localization classes by about 8 percentage points (Q10). So far, this might be the highest improvement of just embeddings over MSAs. Our new test set highlighted the limits of standard static datasets: while inviting new models, they might not suffice to claim improvements over the SOTA. Availability and implementation: The novel models are available as a web-service at http://embed.protein.properties. Code needed to reproduce results is provided at https://github.com/HannesStark/protein-localization. Predictions for the human proteome are available at https://zenodo.org/record/5047020. Supplementary information: Supplementary data are available at Bioinformatics Advances online.

18.

Embeddings from deep learning transfer GO annotations beyond homology.

Littmann, Maria; Heinzinger, Michael; Dallago, Christian; Olenyi, Tobias; Rost, Burkhard.

Sci Rep ; 11(1): 1160, 2021 01 13.

Artigo em Inglês | MEDLINE | ID: mdl-33441905

RESUMO

Knowing protein function is crucial to advance molecular and medical biology, yet experimental function annotations through the Gene Ontology (GO) exist for fewer than 0.5% of all known proteins. Computational methods bridge this sequence-annotation gap typically through homology-based annotation transfer by identifying sequence-similar proteins with known function or through prediction methods using evolutionary information. Here, we propose predicting GO terms through annotation transfer based on proximity of proteins in the SeqVec embedding rather than in sequence space. These embeddings originate from deep learned language models (LMs) for protein sequences (SeqVec) transferring the knowledge gained from predicting the next amino acid in 33 million protein sequences. Replicating the conditions of CAFA3, our method reaches an Fmax of 37 ± 2%, 50 ± 3%, and 57 ± 2% for BPO, MFO, and CCO, respectively. Numerically, this appears close to the top ten CAFA3 methods. When restricting the annotation transfer to proteins with < 20% pairwise sequence identity to the query, performance drops (Fmax BPO 33 ± 2%, MFO 43 ± 3%, CCO 53 ± 2%); this still outperforms naïve sequence-based transfer. Preliminary results from CAFA4 appear to confirm these findings. Overall, this new concept is likely to change the annotation of proteins, in particular for proteins from smaller families or proteins with intrinsically disordered regions.

Assuntos

Biologia Computacional/métodos , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Aminoácidos/química , Aprendizado Profundo , Ontologia Genética , Humanos , Aprendizado de Máquina , Anotação de Sequência Molecular/métodos , Proteínas/química , Homologia de Sequência de Aminoácidos , Software

19.

Protein embeddings and deep learning predict binding residues for various ligand classes.

Littmann, Maria; Heinzinger, Michael; Dallago, Christian; Weissenow, Konstantin; Rost, Burkhard.

Sci Rep ; 11(1): 23916, 2021 12 13.

Artigo em Inglês | MEDLINE | ID: mdl-34903827

RESUMO

One important aspect of protein function is the binding of proteins to ligands, including small molecules, metal ions, and macromolecules such as DNA or RNA. Despite decades of experimental progress many binding sites remain obscure. Here, we proposed bindEmbed21, a method predicting whether a protein residue binds to metal ions, nucleic acids, or small molecules. The Artificial Intelligence (AI)-based method exclusively uses embeddings from the Transformer-based protein Language Model (pLM) ProtT5 as input. Using only single sequences without creating multiple sequence alignments (MSAs), bindEmbed21DL outperformed MSA-based predictions. Combination with homology-based inference increased performance to F1 = 48 ± 3% (95% CI) and MCC = 0.46 ± 0.04 when merging all three ligand classes into one. All results were confirmed by three independent data sets. Focusing on very reliably predicted residues could complement experimental evidence: For the 25% most strongly predicted binding residues, at least 73% were correctly predicted even when ignoring the problem of missing experimental annotations. The new method bindEmbed21 is fast, simple, and broadly applicable-neither using structure nor MSAs. Thereby, it found binding residues in over 42% of all human proteins not otherwise implied in binding and predicted about 6% of all residues as binding to metal ions, nucleic acids, or small molecules.

Assuntos

Aprendizado Profundo , Simulação de Acoplamento Molecular/métodos , Análise de Sequência de Proteína/métodos , Sítios de Ligação , Ligantes , Metais/química , Ácidos Nucleicos/química , Ligação Proteica , Conformação Proteica , Software

20.

Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets.

Dallago, Christian; Schütze, Konstantin; Heinzinger, Michael; Olenyi, Tobias; Littmann, Maria; Lu, Amy X; Yang, Kevin K; Min, Seonwoo; Yoon, Sungroh; Morton, James T; Rost, Burkhard.

Curr Protoc ; 1(5): e113, 2021 May.

Artigo em Inglês | MEDLINE | ID: mdl-33961736

RESUMO

Models from machine learning (ML) or artificial intelligence (AI) increasingly assist in guiding experimental design and decision making in molecular biology and medicine. Recently, Language Models (LMs) have been adapted from Natural Language Processing (NLP) to encode the implicit language written in protein sequences. Protein LMs show enormous potential in generating descriptive representations (embeddings) for proteins from just their sequences, in a fraction of the time with respect to previous approaches, yet with comparable or improved predictive ability. Researchers have trained a variety of protein LMs that are likely to illuminate different angles of the protein language. By leveraging the bio_embeddings pipeline and modules, simple and reproducible workflows can be laid out to generate protein embeddings and rich visualizations. Embeddings can then be leveraged as input features through machine learning libraries to develop methods predicting particular aspects of protein function and structure. Beyond the workflows included here, embeddings have been leveraged as proxies to traditional homology-based inference and even to align similar protein sequences. A wealth of possibilities remain for researchers to harness through the tools provided in the following protocols. © 2021 The Authors. Current Protocols published by Wiley Periodicals LLC. The following protocols are included in this manuscript: Basic Protocol 1: Generic use of the bio_embeddings pipeline to plot protein sequences and annotations Basic Protocol 2: Generate embeddings from protein sequences using the bio_embeddings pipeline Basic Protocol 3: Overlay sequence annotations onto a protein space visualization Basic Protocol 4: Train a machine learning classifier on protein embeddings Alternate Protocol 1: Generate 3D instead of 2D visualizations Alternate Protocol 2: Visualize protein solubility instead of protein subcellular localization Support Protocol: Join embedding generation and sequence space visualization in a pipeline.

Assuntos

Inteligência Artificial , Aprendizado Profundo , Aprendizado de Máquina , Processamento de Linguagem Natural , Proteínas

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA