RESUMO
Breakthrough methods in machine learning (ML), protein structure prediction, and novel ultrafast structural aligners are revolutionizing structural biology. Obtaining accurate models of proteins and annotating their functions on a large scale is no longer limited by time and resources. The most recent method to be top ranked by the Critical Assessment of Structure Prediction (CASP) assessment, AlphaFold 2 (AF2), is capable of building structural models with an accuracy comparable to that of experimental structures. Annotations of 3D models are keeping pace with the deposition of the structures due to advancements in protein language models (pLMs) and structural aligners that help validate these transferred annotations. In this review we describe how recent developments in ML for protein science are making large-scale structural bioinformatics available to the general scientific community.
Assuntos
Aprendizado de Máquina , Proteínas , Proteínas/química , Biologia Computacional/métodos , Conformação ProteicaRESUMO
We show that amino acid covariation in proteins, extracted from the evolutionary sequence record, can be used to fold transmembrane proteins. We use this technique to predict previously unknown 3D structures for 11 transmembrane proteins (with up to 14 helices) from their sequences alone. The prediction method (EVfold_membrane) applies a maximum entropy approach to infer evolutionary covariation in pairs of sequence positions within a protein family and then generates all-atom models with the derived pairwise distance constraints. We benchmark the approach with blinded de novo computation of known transmembrane protein structures from 23 families, demonstrating unprecedented accuracy of the method for large transmembrane proteins. We show how the method can predict oligomerization, functional sites, and conformational changes in transmembrane proteins. With the rapid rise in large-scale sequencing, more accurate and more comprehensive information on evolutionary constraints can be decoded from genetic variation, greatly expanding the repertoire of transmembrane proteins amenable to modeling by this method.
Assuntos
Algoritmos , Proteínas de Membrana/química , Proteínas de Membrana/genética , Sequência de Aminoácidos , Animais , Sequência Conservada , Evolução Molecular , Humanos , Modelos Moleculares , Conformação Proteica , Estrutura Secundária de Proteína , Alinhamento de Sequência , Homologia Estrutural de ProteínaRESUMO
MOTIVATION: CATH is a protein domain classification resource that exploits an automated workflow of structure and sequence comparison alongside expert manual curation to construct a hierarchical classification of evolutionary and structural relationships. The aim of this study was to develop algorithms for detecting remote homologues missed by state-of-the-art hidden Markov model (HMM)-based approaches. The method developed (CATHe) combines a neural network with sequence representations obtained from protein language models. It was assessed using a dataset of remote homologues having less than 20% sequence identity to any domain in the training set. RESULTS: The CATHe models trained on 1773 largest and 50 largest CATH superfamilies had an accuracy of 85.6 ± 0.4% and 98.2 ± 0.3%, respectively. As a further test of the power of CATHe to detect more remote homologues missed by HMMs derived from CATH domains, we used a dataset consisting of protein domains that had annotations in Pfam, but not in CATH. By using highly reliable CATHe predictions (expected error rate <0.5%), we were able to provide CATH annotations for 4.62 million Pfam domains. For a subset of these domains from Homo sapiens, we structurally validated 90.86% of the predictions by comparing their corresponding AlphaFold2 structures with structures from the CATH superfamilies to which they were assigned. AVAILABILITY AND IMPLEMENTATION: The code for the developed models is available on https://github.com/vam-sin/CATHe, and the datasets developed in this study can be accessed on https://zenodo.org/record/6327572. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Algoritmos , Proteínas , Humanos , Homologia de Sequência de Aminoácidos , Proteínas/química , Bases de Dados de ProteínasRESUMO
ProteomicsDB (https://www.ProteomicsDB.org) is a multi-omics and multi-organism resource for life science research. In this update, we present our efforts to continuously develop and expand ProteomicsDB. The major focus over the last two years was improving the findability, accessibility, interoperability and reusability (FAIR) of the data as well as its implementation. For this purpose, we release a new application programming interface (API) that provides systematic access to essentially all data in ProteomicsDB. Second, we release a new open-source user interface (UI) and show the advantages the scientific community gains from such software. With the new interface, two new visualizations of protein primary, secondary and tertiary structure as well an updated spectrum viewer were added. Furthermore, we integrated ProteomicsDB with our deep-neural-network Prosit that can predict the fragmentation characteristics and retention time of peptides. The result is an automatic processing pipeline that can be used to reevaluate database search engine results stored in ProteomicsDB. In addition, we extended the data content with experiments investigating different human biology as well as a newly supported organism.
Assuntos
Bases de Dados de Proteínas , Proteínas/classificação , Proteômica/classificação , Software , Disciplinas das Ciências Biológicas , Humanos , Redes Neurais de Computação , Proteínas/químicaRESUMO
BACKGROUND: Venoms, which have evolved numerous times in animals, are ideal models of convergent trait evolution. However, detailed genomic studies of toxin-encoding genes exist for only a few animal groups. The hyper-diverse hymenopteran insects are the most speciose venomous clade, but investigation of the origin of their venom genes has been largely neglected. RESULTS: Utilizing a combination of genomic and proteo-transcriptomic data, we investigated the origin of 11 toxin genes in 29 published and 3 new hymenopteran genomes and compiled an up-to-date list of prevalent bee venom proteins. Observed patterns indicate that bee venom genes predominantly originate through single gene co-option with gene duplication contributing to subsequent diversification. CONCLUSIONS: Most Hymenoptera venom genes are shared by all members of the clade and only melittin and the new venom protein family anthophilin1 appear unique to the bee lineage. Most venom proteins thus predate the mega-radiation of hymenopterans and the evolution of the aculeate stinger.
Assuntos
Venenos de Abelha , Abelhas/genética , Animais , Perfilação da Expressão Gênica , Transcriptoma , Genômica , Duplicação GênicaRESUMO
BACKGROUND: The success of AlphaFold2 in reliable protein three-dimensional (3D) structure prediction, assists the move of structural biology toward studies of protein dynamics and mutational impact on structure and function. This transition needs tools that qualitatively assess alternative 3D conformations. RESULTS: We introduce MutAmore, a bioinformatics tool that renders individual images of protein 3D structures for, e.g., sequence mutations into a visually intuitive movie format. MutAmore streamlines a pipeline casting single amino-acid variations (SAVs) into a dynamic 3D mutation movie providing a qualitative perspective on the mutational landscape of a protein. By default, the tool first generates all possible variants of the sequence reachable through SAVs (L*19 for proteins with L residues). Next, it predicts the structural conformation for all L*19 variants using state-of-the-art models. Finally, it visualizes the mutation matrix and produces a color-coded 3D animation. Alternatively, users can input other types of variants, e.g., from experimental structures. CONCLUSION: MutAmore samples alternative protein configurations to study the dynamical space accessible from SAVs in the post-AlphaFold2 era of structural biology. As the field shifts towards the exploration of alternative conformations of proteins, MutAmore aids in the understanding of the structural impact of mutations by providing a flexible pipeline for the generation of protein mutation movies using current and future structure prediction models.
Assuntos
Filmes Cinematográficos , Proteínas , Proteínas/genética , Mutação , Aminoácidos/genética , Conformação ProteicaRESUMO
Membrane proteins are unique in that they interact with lipid bilayers, making them indispensable for transporting molecules and relaying signals between and across cells. Due to the significance of the protein's functions, mutations often have profound effects on the fitness of the host. This is apparent both from experimental studies, which implicated numerous missense variants in diseases, as well as from evolutionary signals that allow elucidating the physicochemical constraints that intermembrane and aqueous environments bring. In this review, we report on the current state of knowledge acquired on missense variants (referred to as to single amino acid variants) affecting membrane proteins as well as the insights that can be extrapolated from data already available. This includes an overview of the annotations for membrane protein variants that have been collated within databases dedicated to the topic, bioinformatics approaches that leverage evolutionary information in order to shed light on previously uncharacterized membrane protein structures or interaction interfaces, tools for predicting the effects of mutations tailored specifically towards the characteristics of membrane proteins as well as two clinically relevant case studies explaining the implications of mutated membrane proteins in cancer and cardiomyopathy.
Assuntos
Cardiomiopatias/genética , Evolução Molecular , Proteínas de Membrana , Mutação de Sentido Incorreto , Proteínas de Neoplasias , Neoplasias/genética , Substituição de Aminoácidos , Biologia Computacional , Humanos , Proteínas de Membrana/química , Proteínas de Membrana/genética , Proteínas de Neoplasias/química , Proteínas de Neoplasias/genética , Conformação ProteicaRESUMO
Ancestral sequence reconstruction is a technique that is gaining widespread use in molecular evolution studies and protein engineering. Accurate reconstruction requires the ability to handle appropriately large numbers of sequences, as well as insertion and deletion (indel) events, but available approaches exhibit limitations. To address these limitations, we developed Graphical Representation of Ancestral Sequence Predictions (GRASP), which efficiently implements maximum likelihood methods to enable the inference of ancestors of families with more than 10,000 members. GRASP implements partial order graphs (POGs) to represent and infer insertion and deletion events across ancestors, enabling the identification of building blocks for protein engineering. To validate the capacity to engineer novel proteins from realistic data, we predicted ancestor sequences across three distinct enzyme families: glucose-methanol-choline (GMC) oxidoreductases, cytochromes P450, and dihydroxy/sugar acid dehydratases (DHAD). All tested ancestors demonstrated enzymatic activity. Our study demonstrates the ability of GRASP (1) to support large data sets over 10,000 sequences and (2) to employ insertions and deletions to identify building blocks for engineering biologically active ancestors, by exploring variation over evolutionary time.
Assuntos
Evolução Molecular , Mutação INDEL , Mutação INDEL/genética , Proteínas/genética , Evolução Biológica , FilogeniaRESUMO
Since 1992 PredictProtein (https://predictprotein.org) is a one-stop online resource for protein sequence analysis with its main site hosted at the Luxembourg Centre for Systems Biomedicine (LCSB) and queried monthly by over 3,000 users in 2020. PredictProtein was the first Internet server for protein predictions. It pioneered combining evolutionary information and machine learning. Given a protein sequence as input, the server outputs multiple sequence alignments, predictions of protein structure in 1D and 2D (secondary structure, solvent accessibility, transmembrane segments, disordered regions, protein flexibility, and disulfide bridges) and predictions of protein function (functional effects of sequence variation or point mutations, Gene Ontology (GO) terms, subcellular localization, and protein-, RNA-, and DNA binding). PredictProtein's infrastructure has moved to the LCSB increasing throughput; the use of MMseqs2 sequence search reduced runtime five-fold (apparently without lowering performance of prediction methods); user interface elements improved usability, and new prediction methods were added. PredictProtein recently included predictions from deep learning embeddings (GO and secondary structure) and a method for the prediction of proteins and residues binding DNA, RNA, or other proteins. PredictProtein.org aspires to provide reliable predictions to computational and experimental biologists alike. All scripts and methods are freely available for offline execution in high-throughput settings.
Assuntos
Conformação Proteica , Software , Sítios de Ligação , Proteínas do Nucleocapsídeo de Coronavírus/química , Proteínas de Ligação a DNA/química , Fosfoproteínas/química , Estrutura Secundária de Proteína , Proteínas/química , Proteínas/fisiologia , Proteínas de Ligação a RNA/química , Alinhamento de Sequência , Análise de Sequência de ProteínaRESUMO
BACKGROUND: Despite the immense importance of transmembrane proteins (TMP) for molecular biology and medicine, experimental 3D structures for TMPs remain about 4-5 times underrepresented compared to non-TMPs. Today's top methods such as AlphaFold2 accurately predict 3D structures for many TMPs, but annotating transmembrane regions remains a limiting step for proteome-wide predictions. RESULTS: Here, we present TMbed, a novel method inputting embeddings from protein Language Models (pLMs, here ProtT5), to predict for each residue one of four classes: transmembrane helix (TMH), transmembrane strand (TMB), signal peptide, or other. TMbed completes predictions for entire proteomes within hours on a single consumer-grade desktop machine at performance levels similar or better than methods, which are using evolutionary information from multiple sequence alignments (MSAs) of protein families. On the per-protein level, TMbed correctly identified 94 ± 8% of the beta barrel TMPs (53 of 57) and 98 ± 1% of the alpha helical TMPs (557 of 571) in a non-redundant data set, at false positive rates well below 1% (erred on 30 of 5654 non-membrane proteins). On the per-segment level, TMbed correctly placed, on average, 9 of 10 transmembrane segments within five residues of the experimental observation. Our method can handle sequences of up to 4200 residues on standard graphics cards used in desktop PCs (e.g., NVIDIA GeForce RTX 3060). CONCLUSIONS: Based on embeddings from pLMs and two novel filters (Gaussian and Viterbi), TMbed predicts alpha helical and beta barrel TMPs at least as accurately as any other method but at lower false positive rates. Given the few false positives and its outstanding speed, TMbed might be ideal to sieve through millions of 3D structures soon to be predicted, e.g., by AlphaFold2.
Assuntos
Idioma , Proteínas de Membrana , Bases de Dados de Proteínas , Proteínas de Membrana/química , Conformação Proteica em alfa-HéliceRESUMO
The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient-MCC-for ProtT5 embeddings of 0.596 ± 0.006 vs. 0.608 ± 0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Finally, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~ 20 k proteins) within 40 min on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA , and PredictProtein.
Assuntos
COVID-19 , SARS-CoV-2 , Algoritmos , Aminoácidos , COVID-19/genética , Humanos , Idioma , Proteoma , SARS-CoV-2/genéticaRESUMO
MOTIVATION: Classifying proteins into functional families can improve our understanding of protein function and can allow transferring annotations within one family. For this, functional families need to be 'pure', i.e., contain only proteins with identical function. Functional Families (FunFams) cluster proteins within CATH superfamilies into such groups of proteins sharing function. 11% of all FunFams (22 830 of 203 639) contain EC annotations and of those, 7% (1526 of 22 830) have inconsistent functional annotations. RESULTS: We propose an approach to further cluster FunFams into functionally more consistent sub-families by encoding their sequences through embeddings. These embeddings originate from language models transferring knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker). Using distances between embeddings and DBSCAN to cluster FunFams and identify outliers, doubled the number of pure clusters per FunFam compared to random clustering. Our approach was not limited to FunFams but also succeeded on families created using sequence similarity alone. Complementing EC annotations, we observed similar results for binding annotations. Thus, we expect an increased purity also for other aspects of function. Our results can help generating FunFams; the resulting clusters with improved functional consistency allow more reliable inference of annotations. We expect this approach to succeed equally for any other grouping of proteins by their phenotypes. AVAILABILITY AND IMPLEMENTATION: Code and embeddings are available via GitHub: https://github.com/Rostlab/FunFamsClustering. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
RESUMO
We modeled 3D structures of all SARS-CoV-2 proteins, generating 2,060 models that span 69% of the viral proteome and provide details not available elsewhere. We found that Ë6% of the proteome mimicked human proteins, while Ë7% was implicated in hijacking mechanisms that reverse post-translational modifications, block host translation, and disable host defenses; a further Ë29% self-assembled into heteromeric states that provided insight into how the viral replication and translation complex forms. To make these 3D models more accessible, we devised a structural coverage map, a novel visualization method to show what is-and is not-known about the 3D structure of the viral proteome. We integrated the coverage map into an accompanying online resource (https://aquaria.ws/covid) that can be used to find and explore models corresponding to the 79 structural states identified in this work. The resulting Aquaria-COVID resource helps scientists use emerging structural data to understand the mechanisms underlying coronavirus infection and draws attention to the 31% of the viral proteome that remains structurally unknown or dark.
Assuntos
Enzima de Conversão de Angiotensina 2/metabolismo , Interações Hospedeiro-Patógeno/genética , Processamento de Proteína Pós-Traducional , SARS-CoV-2/metabolismo , Glicoproteína da Espícula de Coronavírus/metabolismo , Sistemas de Transporte de Aminoácidos Neutros/química , Sistemas de Transporte de Aminoácidos Neutros/genética , Sistemas de Transporte de Aminoácidos Neutros/metabolismo , Enzima de Conversão de Angiotensina 2/química , Enzima de Conversão de Angiotensina 2/genética , Sítios de Ligação , COVID-19/genética , COVID-19/metabolismo , COVID-19/virologia , Biologia Computacional/métodos , Proteínas do Envelope de Coronavírus/química , Proteínas do Envelope de Coronavírus/genética , Proteínas do Envelope de Coronavírus/metabolismo , Proteínas do Nucleocapsídeo de Coronavírus/química , Proteínas do Nucleocapsídeo de Coronavírus/genética , Proteínas do Nucleocapsídeo de Coronavírus/metabolismo , Humanos , Proteínas de Transporte da Membrana Mitocondrial/química , Proteínas de Transporte da Membrana Mitocondrial/genética , Proteínas de Transporte da Membrana Mitocondrial/metabolismo , Proteínas do Complexo de Importação de Proteína Precursora Mitocondrial , Modelos Moleculares , Mimetismo Molecular , Neuropilina-1/química , Neuropilina-1/genética , Neuropilina-1/metabolismo , Fosfoproteínas/química , Fosfoproteínas/genética , Fosfoproteínas/metabolismo , Ligação Proteica , Conformação Proteica em alfa-Hélice , Conformação Proteica em Folha beta , Domínios e Motivos de Interação entre Proteínas , Mapeamento de Interação de Proteínas/métodos , Multimerização Proteica , SARS-CoV-2/química , SARS-CoV-2/genética , Glicoproteína da Espícula de Coronavírus/química , Glicoproteína da Espícula de Coronavírus/genética , Proteínas da Matriz Viral/química , Proteínas da Matriz Viral/genética , Proteínas da Matriz Viral/metabolismo , Proteínas Viroporinas/química , Proteínas Viroporinas/genética , Proteínas Viroporinas/metabolismo , Replicação ViralRESUMO
The native subcellular location (also referred to as localization or cellular compartment) of a protein is the one in which it acts most frequently; it is one aspect of protein function. Do ten eukaryotic model organisms differ in their location spectrum, i.e., the fraction of its proteome in each of seven major cellular compartments? As experimental annotations of locations remain biased and incomplete, we need prediction methods to answer this question. After systematic bias corrections, the complete but faulty prediction methods appeared to be more appropriate to compare location spectra between species than the incomplete more accurate experimental data. This work compared the location spectra for ten eukaryotes: Homo sapiens (human), Gorilla gorilla (gorilla), Pan troglodytes (chimpanzee), Mus musculus (mouse), Rattus norvegicus (rat), Drosophila melanogaster (fruit/vinegar fly), Anopheles gambiae (African malaria mosquito), Caenorhabitis elegans (nematode), Saccharomyces cerevisiae (baker's yeast), and Schizosaccharomyces pombe (fission yeast). The two largest classes were predicted to be the nucleus and the cytoplasm together accounting for 47-62% of all proteins, while 7-21% of the proteins were predicted in the plasma membrane and 4-15% to be secreted. Overall, the predicted location spectra were largely similar. However, in detail, the differences sufficed to plot trees (UPGMA) and 2D (PCA) maps relating the ten organisms using a simple Euclidean distance in seven states (location classes). The relations based on the simple predicted location spectra captured aspects of cross-species comparisons usually revealed only by much more detailed evolutionary comparisons. Most interestingly, known phylogenetic relations were reproduced better by paralog-only than by ortholog-only trees.
Assuntos
Drosophila melanogaster , Proteoma , Animais , Drosophila , Drosophila melanogaster/genética , Camundongos , Filogenia , Proteoma/genética , Ratos , Saccharomyces cerevisiae/genéticaRESUMO
Evaluating the impact of non-synonymous genetic variants is essential for uncovering disease associations and mechanisms of evolution. An in-depth understanding of sequence changes is also fundamental for synthetic protein design and stability assessments. However, the variant effect predictor performance gain observed in recent years has not kept up with the increased complexity of new methods. One likely reason for this might be that most approaches use similar sets of gene and protein features for modeling variant effects, often emphasizing sequence conservation. While high levels of conservation highlight residues essential for protein activity, much of the variation observable in vivo is arguably weaker in its impact, thus requiring evaluation at a higher level of resolution. Here, we describe functionNeutral/Toggle/Rheostatpredictor (funtrp), a novel computational method that categorizes protein positions based on the position-specific expected range of mutational impacts: Neutral (weak/no effects), Rheostat (function-tuning positions), or Toggle (on/off switches). We show that position types do not correlate strongly with familiar protein features such as conservation or protein disorder. We also find that position type distribution varies across different protein functions. Finally, we demonstrate that position types can improve performance of existing variant effect predictors and suggest a way forward for the development of new ones.
Assuntos
Biologia Computacional/métodos , Sequência Conservada/genética , Mutação/genética , Proteínas , Sequência de Aminoácidos/genética , Sequência de Bases/genética , Bases de Dados de Proteínas , Humanos , Modelos Moleculares , Proteínas/química , Proteínas/genética , Relação Estrutura-AtividadeRESUMO
BACKGROUND: Any two unrelated people differ by about 20,000 missense mutations (also referred to as SAVs: Single Amino acid Variants or missense SNV). Many SAVs have been predicted to strongly affect molecular protein function. Common SAVs (> 5% of population) were predicted to have, on average, more effect on molecular protein function than rare SAVs (< 1% of population). We hypothesized that the prevalence of effect in common over rare SAVs might partially be caused by common SAVs more often occurring at interfaces of proteins with other proteins, DNA, or RNA, thereby creating subgroup-specific phenotypes. We analyzed SAVs from 60,706 people through the lens of two prediction methods, one (SNAP2) predicting the effects of SAVs on molecular protein function, the other (ProNA2020) predicting residues in DNA-, RNA- and protein-binding interfaces. RESULTS: Three results stood out. Firstly, SAVs predicted to occur at binding interfaces were predicted to more likely affect molecular function than those predicted as not binding (p value < 2.2 × 10-16). Secondly, for SAVs predicted to occur at binding interfaces, common SAVs were predicted more strongly with effect on protein function than rare SAVs (p value < 2.2 × 10-16). Restriction to SAVs with experimental annotations confirmed all results, although the resulting subsets were too small to establish statistical significance for any result. Thirdly, the fraction of SAVs predicted at binding interfaces differed significantly between tissues, e.g. urinary bladder tissue was found abundant in SAVs predicted at protein-binding interfaces, and reproductive tissues (ovary, testis, vagina, seminal vesicle and endometrium) in SAVs predicted at DNA-binding interfaces. CONCLUSIONS: Overall, the results suggested that residues at protein-, DNA-, and RNA-binding interfaces contributed toward predicting that common SAVs more likely affect molecular function than rare SAVs.
Assuntos
Aminoácidos/genética , Variação Genética , Ácidos Nucleicos/metabolismo , Proteínas/genética , Proteínas/metabolismo , Sequência de Bases , Feminino , Humanos , Substâncias Macromoleculares/metabolismo , Masculino , Modelos Moleculares , Mutação de Sentido Incorreto/genética , Ligação Proteica , Reprodutibilidade dos TestesRESUMO
BACKGROUND: Deep mutational scanning (DMS) studies exploit the mutational landscape of sequence variation by systematically and comprehensively assaying the effect of single amino acid variants (SAVs; also referred to as missense mutations, or non-synonymous Single Nucleotide Variants - missense SNVs or nsSNVs) for particular proteins. We assembled SAV annotations from 22 different DMS experiments and normalized the effect scores to evaluate variant effect prediction methods. Three trained on traditional variant effect data (PolyPhen-2, SIFT, SNAP2), a regression method optimized on DMS data (Envision), and a naïve prediction using conservation information from homologs. RESULTS: On a set of 32,981 SAVs, all methods captured some aspects of the experimental effect scores, albeit not the same. Traditional methods such as SNAP2 correlated slightly more with measurements and better classified binary states (effect or neutral). Envision appeared to better estimate the precise degree of effect. Most surprising was that the simple naïve conservation approach using PSI-BLAST in many cases outperformed other methods. All methods captured beneficial effects (gain-of-function) significantly worse than deleterious (loss-of-function). For the few proteins with multiple independent experimental measurements, experiments differed substantially, but agreed more with each other than with predictions. CONCLUSIONS: DMS provides a new powerful experimental means of understanding the dynamics of the protein sequence space. As always, promising new beginnings have to overcome challenges. While our results demonstrated that DMS will be crucial to improve variant effect prediction methods, data diversity hindered simplification and generalization.
Assuntos
Biologia Computacional/métodos , Proteínas/genética , Área Sob a Curva , Proteína BRCA1/genética , Humanos , Mutação de Sentido Incorreto , Polimorfismo de Nucleotídeo Único , Curva ROC , SoftwareRESUMO
Ancestral sequence reconstruction has had recent success in decoding the origins and the determinants of complex protein functions. However, phylogenetic analyses of remote homologues must handle extreme amino acid sequence diversity resulting from extended periods of evolutionary change. We exploited the wealth of protein structures to develop an evolutionary model based on protein secondary structure. The approach follows the differences between discrete secondary structure states observed in modern proteins and those hypothesized in their immediate ancestors. We implemented maximum likelihood-based phylogenetic inference to reconstruct ancestral secondary structure. The predictive accuracy from the use of the evolutionary model surpasses that of comparative modeling and sequence-based prediction; the reconstruction extracts information not available from modern structures or the ancestral sequences alone. Based on a phylogenetic analysis of a sequence-diverse protein family, we showed that the model can highlight relationships that are evolutionarily rooted in structure and not evident in amino acid-based analysis.
Assuntos
Proteínas Adaptadoras de Transporte Vesicular/química , Proteínas de Bactérias/química , Evolução Molecular , Modelos Estatísticos , Proteínas Adaptadoras de Transporte Vesicular/história , Animais , Bactérias/química , Bactérias/classificação , Bactérias/metabolismo , Proteínas de Bactérias/história , Simulação por Computador , História do Século XXI , História Antiga , Humanos , Mamíferos/classificação , Mamíferos/metabolismo , Filogenia , Plantas/química , Plantas/classificação , Plantas/metabolismo , Estrutura Secundária de ProteínaRESUMO
NLSdb is a database collecting nuclear export signals (NES) and nuclear localization signals (NLS) along with experimentally annotated nuclear and non-nuclear proteins. NES and NLS are short sequence motifs related to protein transport out of and into the nucleus. The updated NLSdb now contains 2253 NLS and introduces 398 NES. The potential sets of novel NES and NLS have been generated by a simple 'in silico mutagenesis' protocol. We started with motifs annotated by experiments. In step 1, we increased specificity such that no known non-nuclear protein matched the refined motif. In step 2, we increased the sensitivity trying to match several different families with a motif. We then iterated over steps 1 and 2. The final set of 2253 NLS motifs matched 35% of 8421 experimentally verified nuclear proteins (up from 21% for the previous version) and none of 18 278 non-nuclear proteins. We updated the web interface providing multiple options to search protein sequences for NES and NLS motifs, and to evaluate your own signal sequences. NLSdb can be accessed via Rostlab services at: https://rostlab.org/services/nlsdb/.
Assuntos
Transporte Ativo do Núcleo Celular/genética , Bases de Dados Genéticas , Anotação de Sequência Molecular , Sinais de Exportação Nuclear/genética , Sinais de Localização Nuclear/química , Interface Usuário-Computador , Sequência de Aminoácidos , Animais , Arabidopsis/genética , Arabidopsis/metabolismo , Caenorhabditis elegans/genética , Caenorhabditis elegans/metabolismo , Núcleo Celular/metabolismo , Conjuntos de Dados como Assunto , Drosophila melanogaster/genética , Drosophila melanogaster/metabolismo , Células Eucarióticas/metabolismo , Humanos , Internet , Camundongos , Sinais de Localização Nuclear/genética , Sinais de Localização Nuclear/metabolismo , Oryza/genética , Oryza/metabolismo , Ratos , Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/metabolismo , Schizosaccharomyces/genética , Schizosaccharomyces/metabolismoRESUMO
BACKGROUND: Sub-nuclear structures or locations are associated with various nuclear processes. Proteins localized in these substructures are important to understand the interior nuclear mechanisms. Despite advances in high-throughput methods, experimental protein annotations remain limited. Predictions of cellular compartments have become very accurate, largely at the expense of leaving out substructures inside the nucleus making a fine-grained analysis impossible. RESULTS: Here, we present a new method (LocNuclei) that predicts nuclear substructures from sequence alone. LocNuclei used a string-based Profile Kernel with Support Vector Machines (SVMs). It distinguishes sub-nuclear localization in 13 distinct substructures and distinguishes between nuclear proteins confined to the nucleus and those that are also native to other compartments (traveler proteins). High performance was achieved by implicitly leveraging a large biological knowledge-base in creating predictions by homology-based inference through BLAST. Using this approach, the performance reached AUC = 0.70-0.74 and Q13 = 59-65%. Travelling proteins (nucleus and other) were identified at Q2 = 70-74%. A Gene Ontology (GO) analysis of the enrichment of biological processes revealed that the predicted sub-nuclear compartments matched the expected functionality. Analysis of protein-protein interactions (PPI) show that formation of compartments and functionality of proteins in these compartments highly rely on interactions between proteins. This suggested that the LocNuclei predictions carry important information about function. The source code and data sets are available through GitHub: https://github.com/Rostlab/LocNuclei . CONCLUSIONS: LocNuclei predicts subnuclear compartments and traveler proteins accurately. These predictions carry important information about functionality and PPIs.