Search | VHL Regional Portal

1.

Artificial Intelligence Learns Protein Prediction.

Heinzinger, Michael; Rost, Burkhard.

Cold Spring Harb Perspect Biol ; 2024 Jun 10.

Article in English | MEDLINE | ID: mdl-38858069

ABSTRACT

From AlphaGO over StableDiffusion to ChatGPT, the recent decade of exponential advances in artificial intelligence (AI) has been altering life. In parallel, advances in computational biology are beginning to decode the language of life: AlphaFold2 leaped forward in protein structure prediction, and protein language models (pLMs) replaced expertise and evolutionary information from multiple sequence alignments with information learned from reoccurring patterns in databases of billions of proteins without experimental annotations other than the amino acid sequences. None of those tools could have been developed 10 years ago; all will increase the wealth of experimental data and speed up the cycle from idea to proof. AI is affecting molecular and medical biology at giant steps, and the most important might be the leap toward more powerful protein design.

2.

Protein embeddings predict binding residues in disordered regions.

Jahn, Laura R; Marquet, Céline; Heinzinger, Michael; Rost, Burkhard.

Sci Rep ; 14(1): 13566, 2024 06 12.

Article in English | MEDLINE | ID: mdl-38866950

ABSTRACT

The identification of protein binding residues helps to understand their biological processes as protein function is often defined through ligand binding, such as to other proteins, small molecules, ions, or nucleotides. Methods predicting binding residues often err for intrinsically disordered proteins or regions (IDPs/IDPRs), often also referred to as molecular recognition features (MoRFs). Here, we presented a novel machine learning (ML) model trained to specifically predict binding regions in IDPRs. The proposed model, IDBindT5, leveraged embeddings from the protein language model (pLM) ProtT5 to reach a balanced accuracy of 57.2 ± 3.6% (95% confidence interval). Assessed on the same data set, this did not differ at the 95% CI from the state-of-the-art (SOTA) methods ANCHOR2 and DeepDISOBind that rely on expert-crafted features and evolutionary information from multiple sequence alignments (MSAs). Assessed on other data, methods such as SPOT-MoRF reached higher MCCs. IDBindT5's SOTA predictions are much faster than other methods, easily enabling full-proteome analyses. Our findings emphasize the potential of pLMs as a promising approach for exploring and predicting features of disordered proteins. The model and a comprehensive manual are publicly available at https://github.com/jahnl/binding_in_disorder .

Subject(s)

Intrinsically Disordered Proteins , Machine Learning , Protein Binding , Intrinsically Disordered Proteins/chemistry , Intrinsically Disordered Proteins/metabolism , Binding Sites , Computational Biology/methods , Databases, Protein , Humans

3.

Critical assessment of missense variant effect predictors on disease-relevant variant data.

Rastogi, Ruchir; Chung, Ryan; Li, Sindy; Li, Chang; Lee, Kyoungyeul; Woo, Junwoo; Kim, Dong-Wook; Keum, Changwon; Babbi, Giulia; Martelli, Pier Luigi; Savojardo, Castrense; Casadio, Rita; Chennen, Kirsley; Weber, Thomas; Poch, Olivier; Ancien, François; Cia, Gabriel; Pucci, Fabrizio; Raimondi, Daniele; Vranken, Wim; Rooman, Marianne; Marquet, Céline; Olenyi, Tobias; Rost, Burkhard; Andreoletti, Gaia; Kamandula, Akash; Peng, Yisu; Bakolitsa, Constantina; Mort, Matthew; Cooper, David N; Bergquist, Timothy; Pejaver, Vikas; Liu, Xiaoming; Radivojac, Predrag; Brenner, Steven E; Ioannidis, Nilah M.

bioRxiv ; 2024 Jun 08.

Article in English | MEDLINE | ID: mdl-38895200

ABSTRACT

Regular, systematic, and independent assessment of computational tools used to predict the pathogenicity of missense variants is necessary to evaluate their clinical and research utility and suggest directions for future improvement. Here, as part of the sixth edition of the Critical Assessment of Genome Interpretation (CAGI) challenge, we assess missense variant effect predictors (or variant impact predictors) on an evaluation dataset of rare missense variants from disease-relevant databases. Our assessment evaluates predictors submitted to the CAGI6 Annotate-All-Missense challenge, predictors commonly used by the clinical genetics community, and recently developed deep learning methods for variant effect prediction. To explore a variety of settings that are relevant for different clinical and research applications, we assess performance within different subsets of the evaluation data and within high-specificity and high-sensitivity regimes. We find strong performance of many predictors across multiple settings. Meta-predictors tend to outperform their constituent individual predictors; however, several individual predictors have performance similar to that of commonly used meta-predictors. The relative performance of predictors differs in high-specificity and high-sensitivity regimes, suggesting that different methods may be best suited to different use cases. We also characterize two potential sources of bias. Predictors that incorporate allele frequency as a predictive feature tend to have reduced performance when distinguishing pathogenic variants from very rare benign variants, and predictors supervised on pathogenicity labels from curated variant databases often learn label imbalances within genes. Overall, we find notable advances over the oldest and most cited missense variant effect predictors and continued improvements among the most recently developed tools, and the CAGI Annotate-All-Missense challenge (also termed the Missense Marathon) will continue to assess state-of-the-art methods as the field progresses. Together, our results help illuminate the current clinical and research utility of missense variant effect predictors and identify potential areas for future development.

4.

Rendering protein mutation movies with MutAmore.

Weissenow, Konstantin; Rost, Burkhard.

BMC Bioinformatics ; 24(1): 469, 2023 Dec 12.

Article in English | MEDLINE | ID: mdl-38087198

ABSTRACT

BACKGROUND: The success of AlphaFold2 in reliable protein three-dimensional (3D) structure prediction, assists the move of structural biology toward studies of protein dynamics and mutational impact on structure and function. This transition needs tools that qualitatively assess alternative 3D conformations. RESULTS: We introduce MutAmore, a bioinformatics tool that renders individual images of protein 3D structures for, e.g., sequence mutations into a visually intuitive movie format. MutAmore streamlines a pipeline casting single amino-acid variations (SAVs) into a dynamic 3D mutation movie providing a qualitative perspective on the mutational landscape of a protein. By default, the tool first generates all possible variants of the sequence reachable through SAVs (L*19 for proteins with L residues). Next, it predicts the structural conformation for all L*19 variants using state-of-the-art models. Finally, it visualizes the mutation matrix and produces a color-coded 3D animation. Alternatively, users can input other types of variants, e.g., from experimental structures. CONCLUSION: MutAmore samples alternative protein configurations to study the dynamical space accessible from SAVs in the post-AlphaFold2 era of structural biology. As the field shifts towards the exploration of alternative conformations of proteins, MutAmore aids in the understanding of the structural impact of mutations by providing a flexible pipeline for the generation of protein mutation movies using current and future structure prediction models.

Subject(s)

Motion Pictures , Proteins , Proteins/genetics , Mutation , Amino Acids/genetics , Protein Conformation

5.

Alignment-based Protein Mutational Landscape Prediction: Doing More with Less.

Abakarova, Marina; Marquet, Céline; Rera, Michael; Rost, Burkhard; Laine, Elodie.

Genome Biol Evol ; 15(11)2023 Nov 01.

Article in English | MEDLINE | ID: mdl-37936309

ABSTRACT

The wealth of genomic data has boosted the development of computational methods predicting the phenotypic outcomes of missense variants. The most accurate ones exploit multiple sequence alignments, which can be costly to generate. Recent efforts for democratizing protein structure prediction have overcome this bottleneck by leveraging the fast homology search of MMseqs2. Here, we show the usefulness of this strategy for mutational outcome prediction through a large-scale assessment of 1.5M missense variants across 72 protein families. Our study demonstrates the feasibility of producing alignment-based mutational landscape predictions that are both high-quality and compute-efficient for entire proteomes. We provide the community with the whole human proteome mutational landscape and simplified access to our predictive pipeline.

Subject(s)

Computational Biology , Proteins , Humans , Computational Biology/methods , Proteins/chemistry , Genomics , Sequence Alignment , Mutation, Missense

6.

Prevalent bee venom genes evolved before the aculeate stinger and eusociality.

Koludarov, Ivan; Velasque, Mariana; Senoner, Tobias; Timm, Thomas; Greve, Carola; Hamadou, Alexander Ben; Gupta, Deepak Kumar; Lochnit, Günter; Heinzinger, Michael; Vilcinskas, Andreas; Gloag, Rosalyn; Harpur, Brock A; Podsiadlowski, Lars; Rost, Burkhard; Jackson, Timothy N W; Dutertre, Sebastien; Stolle, Eckart; von Reumont, Björn M.

BMC Biol ; 21(1): 229, 2023 10 23.

Article in English | MEDLINE | ID: mdl-37867198

ABSTRACT

BACKGROUND: Venoms, which have evolved numerous times in animals, are ideal models of convergent trait evolution. However, detailed genomic studies of toxin-encoding genes exist for only a few animal groups. The hyper-diverse hymenopteran insects are the most speciose venomous clade, but investigation of the origin of their venom genes has been largely neglected. RESULTS: Utilizing a combination of genomic and proteo-transcriptomic data, we investigated the origin of 11 toxin genes in 29 published and 3 new hymenopteran genomes and compiled an up-to-date list of prevalent bee venom proteins. Observed patterns indicate that bee venom genes predominantly originate through single gene co-option with gene duplication contributing to subsequent diversification. CONCLUSIONS: Most Hymenoptera venom genes are shared by all members of the clade and only melittin and the new venom protein family anthophilin1 appear unique to the bee lineage. Most venom proteins thus predate the mega-radiation of hymenopterans and the evolution of the aculeate stinger.

Subject(s)

Bee Venoms , Bees/genetics , Animals , Gene Expression Profiling , Transcriptome , Genomics , Gene Duplication

7.

The importance of planning CT-based imaging features for machine learning-based prediction of pain response.

Llorián-Salvador, Óscar; Akhgar, Joachim; Pigorsch, Steffi; Borm, Kai; Münch, Stefan; Bernhardt, Denise; Rost, Burkhard; Andrade-Navarro, Miguel A; Combs, Stephanie E; Peeken, Jan C.

Sci Rep ; 13(1): 17427, 2023 10 13.

Article in English | MEDLINE | ID: mdl-37833283

ABSTRACT

Patients suffering from painful spinal bone metastases (PSBMs) often undergo palliative radiation therapy (RT), with an efficacy of approximately two thirds of patients. In this exploratory investigation, we assessed the effectiveness of machine learning (ML) models trained on radiomics, semantic and clinical features to estimate complete pain response. Gross tumour volumes (GTV) and clinical target volumes (CTV) of 261 PSBMs were segmented on planning computed tomography (CT) scans. Radiomics, semantic and clinical features were collected for all patients. Random forest (RFC) and support vector machine (SVM) classifiers were compared using repeated nested cross-validation. The best radiomics classifier was trained on CTV with an area under the receiver-operator curve (AUROC) of 0.62 ± 0.01 (RFC; 95% confidence interval). The semantic model achieved a comparable AUROC of 0.63 ± 0.01 (RFC), significantly below the clinical model (SVM, AUROC: 0.80 ± 0.01); and slightly lower than the spinal instability neoplastic score (SINS; LR, AUROC: 0.65 ± 0.01). A combined model did not improve performance (AUROC: 0,74 ± 0,01). We could demonstrate that radiomics and semantic analyses of planning CTs allowed for limited prediction of therapy response to palliative RT. ML predictions based on established clinical parameters achieved the best results.

Subject(s)

Neoplasms , Tomography, X-Ray Computed , Humans , ROC Curve , Tomography, X-Ray Computed/methods , Neoplasms/radiotherapy , Machine Learning , Pain , Retrospective Studies

8.

Domain loss enabled evolution of novel functions in the snake three-finger toxin gene superfamily.

Koludarov, Ivan; Senoner, Tobias; Jackson, Timothy N W; Dashevsky, Daniel; Heinzinger, Michael; Aird, Steven D; Rost, Burkhard.

Nat Commun ; 14(1): 4861, 2023 08 11.

Article in English | MEDLINE | ID: mdl-37567881

ABSTRACT

Three-finger toxins (3FTXs) are a functionally diverse family of toxins, apparently unique to venoms of caenophidian snakes. Although the ancestral function of 3FTXs is antagonism of nicotinic acetylcholine receptors, redundancy conferred by the accumulation of duplicate genes has facilitated extensive neofunctionalization, such that derived members of the family interact with a range of targets. 3FTXs are members of the LY6/UPAR family, but their non-toxin ancestor remains unknown. Combining traditional phylogenetic approaches, manual synteny analysis, and machine learning techniques (including AlphaFold2 and ProtT5), we have reconstructed a detailed evolutionary history of 3FTXs. We identify their immediate ancestor as a non-secretory LY6, unique to squamate reptiles, and propose that changes in molecular ecology resulting from loss of a membrane-anchoring domain and changes in gene expression, paved the way for the evolution of one of the most important families of snake toxins.

Subject(s)

Three Finger Toxins , Toxins, Biological , Animals , Phylogeny , Snakes/genetics , Toxins, Biological/genetics , Reptiles , Elapid Venoms/genetics , Evolution, Molecular

9.

Development and Evaluation of MR-Based Radiogenomic Models to Differentiate Atypical Lipomatous Tumors from Lipomas.

Foreman, Sarah C; Llorián-Salvador, Oscar; David, Diana E; Rösner, Verena K N; Rischewski, Jon F; Feuerriegel, Georg C; Kramp, Daniel W; Luiken, Ina; Lohse, Ann-Kathrin; Kiefer, Jurij; Mogler, Carolin; Knebel, Carolin; Jung, Matthias; Andrade-Navarro, Miguel A; Rost, Burkhard; Combs, Stephanie E; Makowski, Marcus R; Woertler, Klaus; Peeken, Jan C; Gersing, Alexandra S.

Cancers (Basel) ; 15(7)2023 Apr 05.

Article in English | MEDLINE | ID: mdl-37046811

ABSTRACT

BACKGROUND: The aim of this study was to develop and validate radiogenomic models to predict the MDM2 gene amplification status and differentiate between ALTs and lipomas on preoperative MR images. METHODS: MR images were obtained in 257 patients diagnosed with ALTs (n = 65) or lipomas (n = 192) using histology and the MDM2 gene analysis as a reference standard. The protocols included T2-, T1-, and fat-suppressed contrast-enhanced T1-weighted sequences. Additionally, 50 patients were obtained from a different hospital for external testing. Radiomic features were selected using mRMR. Using repeated nested cross-validation, the machine-learning models were trained on radiomic features and demographic information. For comparison, the external test set was evaluated by three radiology residents and one attending radiologist. RESULTS: A LASSO classifier trained on radiomic features from all sequences performed best, with an AUC of 0.88, 70% sensitivity, 81% specificity, and 76% accuracy. In comparison, the radiology residents achieved 60-70% accuracy, 55-80% sensitivity, and 63-77% specificity, while the attending radiologist achieved 90% accuracy, 96% sensitivity, and 87% specificity. CONCLUSION: A radiogenomic model combining features from multiple MR sequences showed the best performance in predicting the MDM2 gene amplification status. The model showed a higher accuracy compared to the radiology residents, though lower compared to the attending radiologist.

10.

Structural Analysis of Genomic and Proteomic Signatures Reveal Dynamic Expression of Intrinsically Disordered Regions in Breast Cancer and Tissue.

Zatorski, Nicole; Sun, Yifei; Elmas, Abdulkadir; Dallago, Christian; Karl, Timothy; Stein, David; Rost, Burkhard; Huang, Kuan-Lin; Walsh, Martin; Schlessinger, Avner.

bioRxiv ; 2023 Feb 24.

Article in English | MEDLINE | ID: mdl-36865220

ABSTRACT

Structural features of proteins capture underlying information about protein evolution and function, which enhances the analysis of proteomic and transcriptomic data. Here we develop Structural Analysis of Gene and protein Expression Signatures (SAGES), a method that describes expression data using features calculated from sequence-based prediction methods and 3D structural models. We used SAGES, along with machine learning, to characterize tissues from healthy individuals and those with breast cancer. We analyzed gene expression data from 23 breast cancer patients and genetic mutation data from the COSMIC database as well as 17 breast tumor protein expression profiles. We identified prominent expression of intrinsically disordered regions in breast cancer proteins as well as relationships between drug perturbation signatures and breast cancer disease signatures. Our results suggest that SAGES is generally applicable to describe diverse biological phenomena including disease states and drug effects.

11.

AlphaFold2 reveals commonalities and novelties in protein structure space for 21 model organisms.

Bordin, Nicola; Sillitoe, Ian; Nallapareddy, Vamsi; Rauer, Clemens; Lam, Su Datt; Waman, Vaishali P; Sen, Neeladri; Heinzinger, Michael; Littmann, Maria; Kim, Stephanie; Velankar, Sameer; Steinegger, Martin; Rost, Burkhard; Orengo, Christine.

Commun Biol ; 6(1): 160, 2023 02 08.

Article in English | MEDLINE | ID: mdl-36755055

ABSTRACT

Deep-learning (DL) methods like DeepMind's AlphaFold2 (AF2) have led to substantial improvements in protein structure prediction. We analyse confident AF2 models from 21 model organisms using a new classification protocol (CATH-Assign) which exploits novel DL methods for structural comparison and classification. Of ~370,000 confident models, 92% can be assigned to 3253 superfamilies in our CATH domain superfamily classification. The remaining cluster into 2367 putative novel superfamilies. Detailed manual analysis on 618 of these, having at least one human relative, reveal extremely remote homologies and further unusual features. Only 25 novel superfamilies could be confirmed. Although most models map to existing superfamilies, AF2 domains expand CATH by 67% and increases the number of unique 'global' folds by 36% and will provide valuable insights on structure function relationships. CATH-Assign will harness the huge expansion in structural data provided by DeepMind to rationalise evolutionary changes driving functional divergence.

Subject(s)

Furylfuramide , Proteins , Humans , Databases, Protein , Proteins/chemistry

12.

CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models.

Nallapareddy, Vamsi; Bordin, Nicola; Sillitoe, Ian; Heinzinger, Michael; Littmann, Maria; Waman, Vaishali P; Sen, Neeladri; Rost, Burkhard; Orengo, Christine.

Bioinformatics ; 39(1)2023 01 01.

Article in English | MEDLINE | ID: mdl-36648327

ABSTRACT

MOTIVATION: CATH is a protein domain classification resource that exploits an automated workflow of structure and sequence comparison alongside expert manual curation to construct a hierarchical classification of evolutionary and structural relationships. The aim of this study was to develop algorithms for detecting remote homologues missed by state-of-the-art hidden Markov model (HMM)-based approaches. The method developed (CATHe) combines a neural network with sequence representations obtained from protein language models. It was assessed using a dataset of remote homologues having less than 20% sequence identity to any domain in the training set. RESULTS: The CATHe models trained on 1773 largest and 50 largest CATH superfamilies had an accuracy of 85.6 ± 0.4% and 98.2 ± 0.3%, respectively. As a further test of the power of CATHe to detect more remote homologues missed by HMMs derived from CATH domains, we used a dataset consisting of protein domains that had annotations in Pfam, but not in CATH. By using highly reliable CATHe predictions (expected error rate <0.5%), we were able to provide CATH annotations for 4.62 million Pfam domains. For a subset of these domains from Homo sapiens, we structurally validated 90.86% of the predictions by comparing their corresponding AlphaFold2 structures with structures from the CATH superfamilies to which they were assigned. AVAILABILITY AND IMPLEMENTATION: The code for the developed models is available on https://github.com/vam-sin/CATHe, and the datasets developed in this study can be accessed on https://zenodo.org/record/6327572. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Algorithms , Proteins , Humans , Sequence Homology, Amino Acid , Proteins/chemistry , Databases, Protein

13.

LambdaPP: Fast and accessible protein-specific phenotype predictions.

Olenyi, Tobias; Marquet, Céline; Heinzinger, Michael; Kröger, Benjamin; Nikolova, Tiha; Bernhofer, Michael; Sändig, Philip; Schütze, Konstantin; Littmann, Maria; Mirdita, Milot; Steinegger, Martin; Dallago, Christian; Rost, Burkhard.

Protein Sci ; 32(1): e4524, 2023 01.

Article in English | MEDLINE | ID: mdl-36454227

ABSTRACT

The availability of accurate and fast artificial intelligence (AI) solutions predicting aspects of proteins are revolutionizing experimental and computational molecular biology. The webserver LambdaPP aspires to supersede PredictProtein, the first internet server making AI protein predictions available in 1992. Given a protein sequence as input, LambdaPP provides easily accessible visualizations of protein 3D structure, along with predictions at the protein level (GeneOntology, subcellular location), and the residue level (binding to metal ions, small molecules, and nucleotides; conservation; intrinsic disorder; secondary structure; alpha-helical and beta-barrel transmembrane segments; signal-peptides; variant effect) in seconds. The structure prediction provided by LambdaPP-leveraging ColabFold and computed in minutes-is based on MMseqs2 multiple sequence alignments. All other feature prediction methods are based on the pLM ProtT5. Queried by a protein sequence, LambdaPP computes protein and residue predictions almost instantly for various phenotypes, including 3D structure and aspects of protein function. LambdaPP is freely available for everyone to use under embed.predictprotein.org, the interactive results for the case study can be found under https://embed.predictprotein.org/o/Q9NZC2. The frontend of LambdaPP can be found on GitHub (github.com/sacdallago/embed.predictprotein.org), and can be freely used and distributed under the academic free use license (AFL-2). For high-throughput applications, all methods can be executed locally via the bio-embeddings (bioembeddings.com) python package, or docker image at ghcr.io/bioembeddings/bio_embeddings, which also includes the backend of LambdaPP.

Subject(s)

Artificial Intelligence , Proteins , Proteins/chemistry , Amino Acid Sequence , Protein Structure, Secondary , Sequence Alignment , Software

14.

Novel machine learning approaches revolutionize protein knowledge.

Bordin, Nicola; Dallago, Christian; Heinzinger, Michael; Kim, Stephanie; Littmann, Maria; Rauer, Clemens; Steinegger, Martin; Rost, Burkhard; Orengo, Christine.

Trends Biochem Sci ; 48(4): 345-359, 2023 04.

Article in English | MEDLINE | ID: mdl-36504138

ABSTRACT

Breakthrough methods in machine learning (ML), protein structure prediction, and novel ultrafast structural aligners are revolutionizing structural biology. Obtaining accurate models of proteins and annotating their functions on a large scale is no longer limited by time and resources. The most recent method to be top ranked by the Critical Assessment of Structure Prediction (CASP) assessment, AlphaFold 2 (AF2), is capable of building structural models with an accuracy comparable to that of experimental structures. Annotations of 3D models are keeping pace with the deposition of the structures due to advancements in protein language models (pLMs) and structural aligners that help validate these transferred annotations. In this review we describe how recent developments in ML for protein science are making large-scale structural bioinformatics available to the general scientific community.

Subject(s)

Machine Learning , Proteins , Proteins/chemistry , Computational Biology/methods , Protein Conformation

15.

Nearest neighbor search on embeddings rapidly identifies distant protein relations.

Schütze, Konstantin; Heinzinger, Michael; Steinegger, Martin; Rost, Burkhard.

Front Bioinform ; 2: 1033775, 2022.

Article in English | MEDLINE | ID: mdl-36466147

ABSTRACT

Since 1992, all state-of-the-art methods for fast and sensitive identification of evolutionary, structural, and functional relations between proteins (also referred to as "homology detection") use sequences and sequence-profiles (PSSMs). Protein Language Models (pLMs) generalize sequences, possibly capturing the same constraints as PSSMs, e.g., through embeddings. Here, we explored how to use such embeddings for nearest neighbor searches to identify relations between protein pairs with diverged sequences (remote homology detection for levels of <20% pairwise sequence identity, PIDE). While this approach excelled for proteins with single domains, we demonstrated the current challenges applying this to multi-domain proteins and presented some ideas how to overcome existing limitations, in principle. We observed that sufficiently challenging data set separations were crucial to provide deeply relevant insights into the behavior of nearest neighbor search when applied to the protein embedding space, and made all our methods readily available for others.

16.

Engineering indel and substitution variants of diverse and ancient enzymes using Graphical Representation of Ancestral Sequence Predictions (GRASP).

Foley, Gabriel; Mora, Ariane; Ross, Connie M; Bottoms, Scott; Sützl, Leander; Lamprecht, Marnie L; Zaugg, Julian; Essebier, Alexandra; Balderson, Brad; Newell, Rhys; Thomson, Raine E S; Kobe, Bostjan; Barnard, Ross T; Guddat, Luke; Schenk, Gerhard; Carsten, Jörg; Gumulya, Yosephine; Rost, Burkhard; Haltrich, Dietmar; Sieber, Volker; Gillam, Elizabeth M J; Bodén, Mikael.

PLoS Comput Biol ; 18(10): e1010633, 2022 10.

Article in English | MEDLINE | ID: mdl-36279274

ABSTRACT

Ancestral sequence reconstruction is a technique that is gaining widespread use in molecular evolution studies and protein engineering. Accurate reconstruction requires the ability to handle appropriately large numbers of sequences, as well as insertion and deletion (indel) events, but available approaches exhibit limitations. To address these limitations, we developed Graphical Representation of Ancestral Sequence Predictions (GRASP), which efficiently implements maximum likelihood methods to enable the inference of ancestors of families with more than 10,000 members. GRASP implements partial order graphs (POGs) to represent and infer insertion and deletion events across ancestors, enabling the identification of building blocks for protein engineering. To validate the capacity to engineer novel proteins from realistic data, we predicted ancestor sequences across three distinct enzyme families: glucose-methanol-choline (GMC) oxidoreductases, cytochromes P450, and dihydroxy/sugar acid dehydratases (DHAD). All tested ancestors demonstrated enzymatic activity. Our study demonstrates the ability of GRASP (1) to support large data sets over 10,000 sequences and (2) to employ insertions and deletions to identify building blocks for engineering biologically active ancestors, by exploring variation over evolutionary time.

Subject(s)

Evolution, Molecular , INDEL Mutation , INDEL Mutation/genetics , Proteins/genetics , Biological Evolution , Phylogeny

17.

SETH predicts nuances of residue disorder from protein embeddings.

Ilzhöfer, Dagmar; Heinzinger, Michael; Rost, Burkhard.

Front Bioinform ; 2: 1019597, 2022.

Article in English | MEDLINE | ID: mdl-36304335

ABSTRACT

Predictions for millions of protein three-dimensional structures are only a few clicks away since the release of AlphaFold2 results for UniProt. However, many proteins have so-called intrinsically disordered regions (IDRs) that do not adopt unique structures in isolation. These IDRs are associated with several diseases, including Alzheimer's Disease. We showed that three recent disorder measures of AlphaFold2 predictions (pLDDT, "experimentally resolved" prediction and "relative solvent accessibility") correlated to some extent with IDRs. However, expert methods predict IDRs more reliably by combining complex machine learning models with expert-crafted input features and evolutionary information from multiple sequence alignments (MSAs). MSAs are not always available, especially for IDRs, and are computationally expensive to generate, limiting the scalability of the associated tools. Here, we present the novel method SETH that predicts residue disorder from embeddings generated by the protein Language Model ProtT5, which explicitly only uses single sequences as input. Thereby, our method, relying on a relatively shallow convolutional neural network, outperformed much more complex solutions while being much faster, allowing to create predictions for the human proteome in about 1 hour on a consumer-grade PC with one NVIDIA GeForce RTX 3060. Trained on a continuous disorder scale (CheZOD scores), our method captured subtle variations in disorder, thereby providing important information beyond the binary classification of most methods. High performance paired with speed revealed that SETH's nuanced disorder predictions for entire proteomes capture aspects of the evolution of organisms. Additionally, SETH could also be used to filter out regions or proteins with probable low-quality AlphaFold2 3D structures to prioritize running the compute-intensive predictions for large data sets. SETH is freely publicly available at: https://github.com/Rostlab/SETH.

18.

TMbed: transmembrane proteins predicted through language model embeddings.

Bernhofer, Michael; Rost, Burkhard.

BMC Bioinformatics ; 23(1): 326, 2022 Aug 08.

Article in English | MEDLINE | ID: mdl-35941534

ABSTRACT

BACKGROUND: Despite the immense importance of transmembrane proteins (TMP) for molecular biology and medicine, experimental 3D structures for TMPs remain about 4-5 times underrepresented compared to non-TMPs. Today's top methods such as AlphaFold2 accurately predict 3D structures for many TMPs, but annotating transmembrane regions remains a limiting step for proteome-wide predictions. RESULTS: Here, we present TMbed, a novel method inputting embeddings from protein Language Models (pLMs, here ProtT5), to predict for each residue one of four classes: transmembrane helix (TMH), transmembrane strand (TMB), signal peptide, or other. TMbed completes predictions for entire proteomes within hours on a single consumer-grade desktop machine at performance levels similar or better than methods, which are using evolutionary information from multiple sequence alignments (MSAs) of protein families. On the per-protein level, TMbed correctly identified 94 ± 8% of the beta barrel TMPs (53 of 57) and 98 ± 1% of the alpha helical TMPs (557 of 571) in a non-redundant data set, at false positive rates well below 1% (erred on 30 of 5654 non-membrane proteins). On the per-segment level, TMbed correctly placed, on average, 9 of 10 transmembrane segments within five residues of the experimental observation. Our method can handle sequences of up to 4200 residues on standard graphics cards used in desktop PCs (e.g., NVIDIA GeForce RTX 3060). CONCLUSIONS: Based on embeddings from pLMs and two novel filters (Gaussian and Viterbi), TMbed predicts alpha helical and beta barrel TMPs at least as accurately as any other method but at lower false positive rates. Given the few false positives and its outstanding speed, TMbed might be ideal to sieve through millions of 3D structures soon to be predicted, e.g., by AlphaFold2.

Subject(s)

Language , Membrane Proteins , Databases, Protein , Membrane Proteins/chemistry , Protein Conformation, alpha-Helical

19.

Contrastive learning on protein embeddings enlightens midnight zone.

Heinzinger, Michael; Littmann, Maria; Sillitoe, Ian; Bordin, Nicola; Orengo, Christine; Rost, Burkhard.

NAR Genom Bioinform ; 4(2): lqac043, 2022 Jun.

Article in English | MEDLINE | ID: mdl-35702380

ABSTRACT

Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to a query without any annotation. A recent alternative expands the concept of HBI from sequence-distance lookup to embedding-based annotation transfer (EAT). These embeddings are derived from protein Language Models (pLMs). Here, we introduce using single protein representations from pLMs for contrastive learning. This learning procedure creates a new set of embeddings that optimizes constraints captured by hierarchical classifications of protein 3D structures defined by the CATH resource. The approach, dubbed ProtTucker, has an improved ability to recognize distant homologous relationships than more traditional techniques such as threading or fold recognition. Thus, these embeddings have allowed sequence comparison to step into the 'midnight zone' of protein similarity, i.e. the region in which distantly related sequences have a seemingly random pairwise sequence similarity. The novelty of this work is in the particular combination of tools and sampling techniques that ascertained good performance comparable or better to existing state-of-the-art sequence comparison methods. Additionally, since this method does not need to generate alignments it is also orders of magnitudes faster. The code is available at https://github.com/Rostlab/EAT.

20.

Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction.

Weissenow, Konstantin; Heinzinger, Michael; Rost, Burkhard.

Structure ; 30(8): 1169-1177.e4, 2022 08 04.

Article in English | MEDLINE | ID: mdl-35609601

ABSTRACT

Advanced protein structure prediction requires evolutionary information from multiple sequence alignments (MSAs) from evolutionary couplings that are not always available. Artificial intelligence (AI)-based predictions inputting only single sequences are faster but so inaccurate as to render speed irrelevant. Here, we described a competitive prediction of inter-residue distances (2D structure) exclusively inputting embeddings from pre-trained protein language models (pLMs), namely ProtT5, from single sequences into a convolutional neural network (CNN) with relatively few layers. The major advance used the ProtT5 attention heads. Our new method, EMBER2, which never requires any MSAs, performed similarly to other methods that fully rely on co-evolution. Although clearly not reaching AlphaFold2, our leaner solution came somehow close at substantially lower costs. By generating protein-specific rather than family-averaged predictions, EMBER2 might better capture some features of particular protein structures. Results from using protein engineering and deep mutational scanning (DMS) experiments provided at least a proof of principle for such a speculation.

Subject(s)

Computational Biology , Language , Artificial Intelligence , Computational Biology/methods , Proteins/chemistry , Sequence Alignment

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL