RESUMEN
This paper presents an evaluation of predictions submitted for the "HMBS" challenge, a component of the sixth round of the Critical Assessment of Genome Interpretation held in 2021. The challenge required participants to predict the effects of missense variants of the human HMBS gene on yeast growth. The HMBS enzyme, critical for the biosynthesis of heme in eukaryotic cells, is highly conserved among eukaryotes. Despite the application of a variety of algorithms and methods, the performance of predictors was relatively similar, with Kendall's tau correlation coefficients between predictions and experimental scores around 0.3 for a majority of submissions. Notably, the median correlation (≥ 0.34) observed among these predictors, especially the top predictions from different groups, was greater than the correlation observed between their predictions and the actual experimental results. Most predictors were moderately successful in distinguishing between deleterious and benign variants, as evidenced by an area under the receiver operating characteristic (ROC) curve (AUC) of approximately 0.7 respectively. Compared with the recent two rounds of CAGI competitions, we noticed more predictors outperformed the baseline predictor, which is solely based on the amino acid frequencies. Nevertheless, the overall accuracy of predictions is still far short of positive control, which is derived from experimental scores, indicating the necessity for considerable improvements in the field. The most inaccurately predicted variants in this round were associated with the insertion loop, which is absent in many orthologs, suggesting the predictors still heavily rely on the information from multiple sequence alignment.
RESUMEN
Predicting the impact of mutations on proteins remains an important problem. As part of the CAGI5 frataxin challenge, we evaluate the accuracy with which Provean, FoldX, and ELASPIC can predict changes in the Gibbs free energy of a protein using a limited data set of eight mutations. We find that different methods have distinct strengths and limitations, with no method being strictly superior to other methods on all metrics. ELASPIC achieves the highest accuracy while also providing a web interface which simplifies the evaluation and analysis of mutations. FoldX is slightly less accurate than ELASPIC but is easier to run locally, as it does not depend on external tools or datasets. Provean achieves reasonable results while being computational less expensive than the other methods and not requiring a structure of the protein. In addition to methods submitted to the CAGI5 community experiment, and with the aim to inform about other methods with high accuracy, we also evaluate predictions made by Rosetta's ddg_monomer protocol, Rosetta's cartesian_ddg protocol, and thermodynamic integration calculations using Amber package. ELASPIC still achieves the highest accuracy, while Rosetta's catesian_ddg protocol appears to perform best in capturing the overall trend in the data.
Asunto(s)
Biología Computacional/métodos , Proteínas de Unión a Hierro/química , Proteínas de Unión a Hierro/genética , Mutación , Humanos , Modelos Moleculares , Conformación Proteica , Pliegue de Proteína , Estabilidad Proteica , Termodinámica , FrataxinaRESUMEN
Frataxin (FXN) is a highly conserved protein found in prokaryotes and eukaryotes that is required for efficient regulation of cellular iron homeostasis. Experimental evidence associates amino acid substitutions of the FXN to Friedreich Ataxia, a neurodegenerative disorder. Recently, new thermodynamic experiments have been performed to study the impact of somatic variations identified in cancer tissues on protein stability. The Critical Assessment of Genome Interpretation (CAGI) data provider at the University of Rome measured the unfolding free energy of a set of variants (FXN challenge data set) with far-UV circular dichroism and intrinsic fluorescence spectra. These values have been used to calculate the change in unfolding free energy between the variant and wild-type proteins at zero concentration of denaturant (ΔΔGH2O) . The FXN challenge data set, composed of eight amino acid substitutions, was used to evaluate the performance of the current computational methods for predicting the ΔΔGH2O value associated with the variants and to classify them as destabilizing and not destabilizing. For the fifth edition of CAGI, six independent research groups from Asia, Australia, Europe, and North America submitted 12 sets of predictions from different approaches. In this paper, we report the results of our assessment and discuss the limitations of the tested algorithms.
Asunto(s)
Sustitución de Aminoácidos , Proteínas de Unión a Hierro/química , Proteínas de Unión a Hierro/genética , Algoritmos , Dicroismo Circular , Humanos , Modelos Moleculares , Conformación Proteica , Pliegue de Proteína , Estabilidad Proteica , FrataxinaRESUMEN
UNLABELLED: ELASPIC is a novel ensemble machine-learning approach that predicts the effects of mutations on protein folding and protein-protein interactions. Here, we present the ELASPIC webserver, which makes the ELASPIC pipeline available through a fast and intuitive interface. The webserver can be used to evaluate the effect of mutations on any protein in the Uniprot database, and allows all predicted results, including modeled wild-type and mutated structures, to be managed and viewed online and downloaded if needed. It is backed by a database which contains improved structural domain definitions, and a list of curated domain-domain interactions for all known proteins, as well as homology models of domains and domain-domain interactions for the human proteome. Homology models for proteins of other organisms are calculated on the fly, and mutations are evaluated within minutes once the homology model is available. AVAILABILITY AND IMPLEMENTATION: The ELASPIC webserver is available online at http://elaspic.kimlab.org CONTACT: pm.kim@utoronto.ca or pi@kimlab.orgSupplementary data: Supplementary data are available at Bioinformatics online.
Asunto(s)
Proteoma , Humanos , Mutación , Unión Proteica , Pliegue de Proteína , Estabilidad Proteica , Programas InformáticosRESUMEN
Cys2His2 zinc finger (ZF) domains engineered to bind specific target sequences in the genome provide an effective strategy for programmable regulation of gene expression, with many potential therapeutic applications. However, the structurally intricate engagement of ZF domains with DNA has made their design challenging. Here we describe the screening of 49 billion protein-DNA interactions and the development of a deep-learning model, ZFDesign, that solves ZF design for any genomic target. ZFDesign is a modern machine learning method that models global and target-specific differences induced by a range of library environments and specifically takes into account compatibility of neighboring fingers using a novel hierarchical transformer architecture. We demonstrate the versatility of designed ZFs as nucleases as well as activators and repressors by seamless reprogramming of human transcription factors. These factors could be used to upregulate an allele of haploinsufficiency, downregulate a gain-of-function mutation or test the consequence of regulation of a single gene as opposed to the many genes that a transcription factor would normally influence.
Asunto(s)
Aprendizaje Profundo , Factores de Transcripción , Humanos , Factores de Transcripción/genética , Factores de Transcripción/metabolismo , Dedos de Zinc/genética , Regulación de la Expresión Génica , ADN/genéticaRESUMEN
Deep learning approaches have produced substantial breakthroughs in fields such as image classification and natural language processing and are making rapid inroads in the area of protein design. Many generative models of proteins have been developed that encompass all known protein sequences, model specific protein families, or extrapolate the dynamics of individual proteins. Those generative models can learn protein representations that are often more informative of protein structure and function than hand-engineered features. Furthermore, they can be used to quickly propose millions of novel proteins that resemble the native counterparts in terms of expression level, stability, or other attributes. The protein design process can further be guided by discriminative oracles to select candidates with the highest probability of having the desired properties. In this review, we discuss five classes of generative models that have been most successful at modeling proteins and provide a framework for model guided protein design.
Asunto(s)
Redes Neurales de la Computación , ProteínasRESUMEN
The ELASPIC web server allows users to evaluate the effect of mutations on protein folding and protein-protein interaction on a proteome-wide scale. It uses homology models of proteins and protein-protein interactions, which have been precalculated for several proteomes, and machine learning models, which integrate structural information with sequence conservation scores, in order to make its predictions. Since the original publication of the ELASPIC web server, several advances have motivated a revisiting of the problem of mutation effect prediction. First, progress in neural network architectures and self-supervised pre-trained has resulted in models which provide more informative embeddings of protein sequence and structure than those used by the original version of ELASPIC. Second, the amount of training data has increased several-fold, largely driven by advances in deep mutation scanning and other multiplexed assays of variant effect. Here, we describe two machine learning models which leverage the recent advances in order to achieve superior accuracy in predicting the effect of mutation on protein folding and protein-protein interaction. The models incorporate features generated using pre-trained transformer- and graph convolution-based neural networks, and are trained to optimize a ranking objective function, which permits the use of heterogeneous training data. The outputs from the new models have been incorporated into the ELASPIC web server, available at http://elaspic.kimlab.org.
Asunto(s)
Biología Computacional/métodos , Lenguaje , Mutación/genética , Redes Neurales de la Computación , Programas Informáticos , Algoritmos , Bases de Datos de Proteínas , Internet , Pliegue de Proteína , Reproducibilidad de los Resultados , Interfaz Usuario-ComputadorRESUMEN
Computational generation of new proteins with a predetermined three-dimensional shape and computational optimization of existing proteins while maintaining their shape are challenging problems in structural biology. Here, we present a protocol that uses ProteinSolver, a pre-trained graph convolutional neural network, to quickly generate thousands of sequences matching a specific protein topology. We describe computational approaches that can be used to evaluate the generated sequences, and we show how select sequences can be validated experimentally. For complete details on the use and execution of this protocol, please refer to Strokach et al. (2020).
Asunto(s)
Biología Computacional , Bases de Datos de Proteínas , Redes Neurales de la Computación , Proteínas , Programas Informáticos , Proteínas/química , Proteínas/genéticaRESUMEN
Protein structure and function is determined by the arrangement of the linear sequence of amino acids in 3D space. We show that a deep graph neural network, ProteinSolver, can precisely design sequences that fold into a predetermined shape by phrasing this challenge as a constraint satisfaction problem (CSP), akin to Sudoku puzzles. We trained ProteinSolver on over 70,000,000 real protein sequences corresponding to over 80,000 structures. We show that our method rapidly designs new protein sequences and benchmark them in silico using energy-based scores, molecular dynamics, and structure prediction methods. As a proof-of-principle validation, we use ProteinSolver to generate sequences that match the structure of serum albumin, then synthesize the top-scoring design and validate it in vitro using circular dichroism. ProteinSolver is freely available at http://design.proteinsolver.org and https://gitlab.com/ostrokach/proteinsolver. A record of this paper's transparent peer review process is included in the Supplemental Information.
Asunto(s)
Ingeniería de Proteínas/métodos , Análisis de Secuencia de Proteína/métodos , Algoritmos , Secuencia de Aminoácidos/genética , Simulación por Computador , Bases de Datos de Proteínas , Redes Neurales de la Computación , Proteínas/metabolismo , Programas InformáticosRESUMEN
The function of a protein is largely determined by its three-dimensional structure and its interactions with other proteins. Changes to a protein's amino acid sequence can alter its function by perturbing the energy landscapes of protein folding and binding. Many tools have been developed to predict the energetic effect of amino acid changes, utilizing features describing the sequence of a protein, the structure of a protein, or both. Those tools can have many applications, such as distinguishing between deleterious and benign mutations and designing proteins and peptides with attractive properties. In this chapter, we describe how to use one of such tools, ELASPIC, to predict the effect of mutations on the stability of proteins and the affinity between proteins, in the context of a human protein-protein interaction network. ELASPIC uses a wide range of sequential and structural features to predict the change in the Gibbs free energy for protein folding and protein-protein interactions. It can be used both through a web server and as a stand-alone application. Since ELASPIC was trained using homology models and not crystal structures, it can be applied to a much broader range of proteins than traditional methods. It can leverage precalculated sequence alignments, homology models, and other features, in order to drastically lower the amount of time required to evaluate individual mutations and make tractable the analysis of millions of mutations affecting the majority of proteins in a genome.