RESUMEN
The introduction of AlphaFold 21 has spurred a revolution in modelling the structure of proteins and their interactions, enabling a huge range of applications in protein modelling and design2-6. Here we describe our AlphaFold 3 model with a substantially updated diffusion-based architecture that is capable of predicting the joint structure of complexes including proteins, nucleic acids, small molecules, ions and modified residues. The new AlphaFold model demonstrates substantially improved accuracy over many previous specialized tools: far greater accuracy for protein-ligand interactions compared with state-of-the-art docking tools, much higher accuracy for protein-nucleic acid interactions compared with nucleic-acid-specific predictors and substantially higher antibody-antigen prediction accuracy compared with AlphaFold-Multimer v.2.37,8. Together, these results show that high-accuracy modelling across biomolecular space is possible within a single unified deep-learning framework.
Asunto(s)
Aprendizaje Profundo , Ligandos , Modelos Moleculares , Proteínas , Programas Informáticos , Humanos , Anticuerpos/química , Anticuerpos/metabolismo , Antígenos/metabolismo , Antígenos/química , Aprendizaje Profundo/normas , Iones/química , Iones/metabolismo , Simulación del Acoplamiento Molecular , Ácidos Nucleicos/química , Ácidos Nucleicos/metabolismo , Unión Proteica , Conformación Proteica , Proteínas/química , Proteínas/metabolismo , Reproducibilidad de los Resultados , Programas Informáticos/normasRESUMEN
The mutation and overexpression of the epidermal growth factor receptor (EGFR) are associated with the development of a variety of cancers, making this prototypical dimerization-activated receptor tyrosine kinase a prominent target of cancer drugs. Using long-timescale molecular dynamics simulations, we find that the N lobe dimerization interface of the wild-type EGFR kinase domain is intrinsically disordered and that it becomes ordered only upon dimerization. Our simulations suggest, moreover, that some cancer-linked mutations distal to the dimerization interface, particularly the widespread L834R mutation (also referred to as L858R), facilitate EGFR dimerization by suppressing this local disorder. Corroborating these findings, our biophysical experiments and kinase enzymatic assays indicate that the L834R mutation causes abnormally high activity primarily by promoting EGFR dimerization rather than by allowing activation without dimerization. We also find that phosphorylation of EGFR kinase domain at Tyr845 may suppress the intrinsic disorder, suggesting a molecular mechanism for autonomous EGFR signaling.
Asunto(s)
Receptores ErbB/química , Receptores ErbB/genética , Neoplasias/metabolismo , Mutación Puntual , Transducción de Señal , Secuencia de Aminoácidos , Cristalografía por Rayos X , Receptores ErbB/antagonistas & inhibidores , Receptores ErbB/metabolismo , Gefitinib , Humanos , Lapatinib , Simulación de Dinámica Molecular , Datos de Secuencia Molecular , Pliegue de Proteína , Inhibidores de Proteínas Quinasas/farmacología , Multimerización de Proteína , Estructura Terciaria de Proteína , Quinazolinas/farmacología , Alineación de SecuenciaRESUMEN
Protein structures can provide invaluable information, both for reasoning about biological processes and for enabling interventions such as structure-based drug development or targeted mutagenesis. After decades of effort, 17% of the total residues in human protein sequences are covered by an experimentally determined structure1. Here we markedly expand the structural coverage of the proteome by applying the state-of-the-art machine learning method, AlphaFold2, at a scale that covers almost the entire human proteome (98.5% of human proteins). The resulting dataset covers 58% of residues with a confident prediction, of which a subset (36% of all residues) have very high confidence. We introduce several metrics developed by building on the AlphaFold model and use them to interpret the dataset, identifying strong multi-domain predictions as well as regions that are likely to be disordered. Finally, we provide some case studies to illustrate how high-quality predictions could be used to generate biological hypotheses. We are making our predictions freely available to the community and anticipate that routine large-scale and high-accuracy structure prediction will become an important tool that will allow new questions to be addressed from a structural perspective.
Asunto(s)
Biología Computacional/normas , Aprendizaje Profundo/normas , Modelos Moleculares , Conformación Proteica , Proteoma/química , Conjuntos de Datos como Asunto/normas , Diacilglicerol O-Acetiltransferasa/química , Glucosa-6-Fosfatasa/química , Humanos , Proteínas de la Membrana/química , Pliegue de Proteína , Reproducibilidad de los ResultadosRESUMEN
Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort1-4, the structures of around 100,000 unique proteins have been determined5, but this represents a small fraction of the billions of known protein sequences6,7. Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence-the structure prediction component of the 'protein folding problem'8-has been an important open research problem for more than 50 years9. Despite recent progress10-14, existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14)15, demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm.
Asunto(s)
Redes Neurales de la Computación , Conformación Proteica , Pliegue de Proteína , Proteínas/química , Secuencia de Aminoácidos , Biología Computacional/métodos , Biología Computacional/normas , Bases de Datos de Proteínas , Aprendizaje Profundo/normas , Modelos Moleculares , Reproducibilidad de los Resultados , Alineación de SecuenciaRESUMEN
Protein structure prediction can be used to determine the three-dimensional shape of a protein from its amino acid sequence1. This problem is of fundamental importance as the structure of a protein largely determines its function2; however, protein structures can be difficult to determine experimentally. Considerable progress has recently been made by leveraging genetic information. It is possible to infer which amino acid residues are in contact by analysing covariation in homologous sequences, which aids in the prediction of protein structures3. Here we show that we can train a neural network to make accurate predictions of the distances between pairs of residues, which convey more information about the structure than contact predictions. Using this information, we construct a potential of mean force4 that can accurately describe the shape of a protein. We find that the resulting potential can be optimized by a simple gradient descent algorithm to generate structures without complex sampling procedures. The resulting system, named AlphaFold, achieves high accuracy, even for sequences with fewer homologous sequences. In the recent Critical Assessment of Protein Structure Prediction5 (CASP13)-a blind assessment of the state of the field-AlphaFold created high-accuracy structures (with template modelling (TM) scores6 of 0.7 or higher) for 24 out of 43 free modelling domains, whereas the next best method, which used sampling and contact information, achieved such accuracy for only 14 out of 43 domains. AlphaFold represents a considerable advance in protein-structure prediction. We expect this increased accuracy to enable insights into the function and malfunction of proteins, especially in cases for which no structures for homologous proteins have been experimentally determined7.
Asunto(s)
Aprendizaje Profundo , Modelos Moleculares , Conformación Proteica , Proteínas/química , Programas Informáticos , Secuencia de Aminoácidos , Caspasas/química , Caspasas/genética , Conjuntos de Datos como Asunto , Pliegue de Proteína , Proteínas/genéticaRESUMEN
The AlphaFold Database Protein Structure Database (AlphaFold DB, https://alphafold.ebi.ac.uk) has significantly impacted structural biology by amassing over 214 million predicted protein structures, expanding from the initial 300k structures released in 2021. Enabled by the groundbreaking AlphaFold2 artificial intelligence (AI) system, the predictions archived in AlphaFold DB have been integrated into primary data resources such as PDB, UniProt, Ensembl, InterPro and MobiDB. Our manuscript details subsequent enhancements in data archiving, covering successive releases encompassing model organisms, global health proteomes, Swiss-Prot integration, and a host of curated protein datasets. We detail the data access mechanisms of AlphaFold DB, from direct file access via FTP to advanced queries using Google Cloud Public Datasets and the programmatic access endpoints of the database. We also discuss the improvements and services added since its initial release, including enhancements to the Predicted Aligned Error viewer, customisation options for the 3D viewer, and improvements in the search engine of AlphaFold DB.
The AlphaFold Protein Structure Database (AlphaFold DB) is a massive digital library of predicted protein structures, with over 214 million entries, marking a 500-times expansion in size since its initial release in 2021. The structures are predicted using Google DeepMind's AlphaFold 2 artificial intelligence (AI) system. Our new report highlights the latest updates we have made to this database. We have added more data on specific organisms and proteins related to global health and expanded to cover almost the complete UniProt database, a primary data resource of protein sequences. We also made it easier for our users to access the data by directly downloading files or using advanced cloud-based tools. Finally, we have also improved how users view and search through these protein structures, making the user experience smoother and more informative. In short, AlphaFold DB has been growing rapidly and has become more user-friendly and robust to support the broader scientific community.
Asunto(s)
Inteligencia Artificial , Estructura Secundaria de Proteína , Proteoma , Secuencia de Aminoácidos , Bases de Datos de Proteínas , Motor de Búsqueda , Proteínas/químicaRESUMEN
How noncoding DNA determines gene expression in different cell types is a major unsolved problem, and critical downstream applications in human genetics depend on improved solutions. Here, we report substantially improved gene expression prediction accuracy from DNA sequences through the use of a deep learning architecture, called Enformer, that is able to integrate information from long-range interactions (up to 100 kb away) in the genome. This improvement yielded more accurate variant effect predictions on gene expression for both natural genetic variants and saturation mutagenesis measured by massively parallel reporter assays. Furthermore, Enformer learned to predict enhancer-promoter interactions directly from the DNA sequence competitively with methods that take direct experimental data as input. We expect that these advances will enable more effective fine-mapping of human disease associations and provide a framework to interpret cis-regulatory evolution.
Asunto(s)
ADN/genética , Bases de Datos Genéticas , Epigénesis Genética , Regulación de la Expresión Génica , Aprendizaje Automático , Red Nerviosa , Animales , Línea Celular , Genoma , Genómica/métodos , Humanos , Ratones , Sitios de Carácter CuantitativoRESUMEN
The AlphaFold Protein Structure Database (AlphaFold DB, https://alphafold.ebi.ac.uk) is an openly accessible, extensive database of high-accuracy protein-structure predictions. Powered by AlphaFold v2.0 of DeepMind, it has enabled an unprecedented expansion of the structural coverage of the known protein-sequence space. AlphaFold DB provides programmatic access to and interactive visualization of predicted atomic coordinates, per-residue and pairwise model-confidence estimates and predicted aligned errors. The initial release of AlphaFold DB contains over 360,000 predicted structures across 21 model-organism proteomes, which will soon be expanded to cover most of the (over 100 million) representative sequences from the UniRef90 data set.
Asunto(s)
Bases de Datos de Proteínas , Pliegue de Proteína , Proteínas/química , Programas Informáticos , Secuencia de Aminoácidos , Animales , Bacterias/genética , Bacterias/metabolismo , Conjuntos de Datos como Asunto , Dictyostelium/genética , Dictyostelium/metabolismo , Hongos/genética , Hongos/metabolismo , Humanos , Internet , Modelos Moleculares , Plantas/genética , Plantas/metabolismo , Conformación Proteica en Hélice alfa , Conformación Proteica en Lámina beta , Proteínas/genética , Proteínas/metabolismo , Trypanosoma cruzi/genética , Trypanosoma cruzi/metabolismoRESUMEN
We describe the operation and improvement of AlphaFold, the system that was entered by the team AlphaFold2 to the "human" category in the 14th Critical Assessment of Protein Structure Prediction (CASP14). The AlphaFold system entered in CASP14 is entirely different to the one entered in CASP13. It used a novel end-to-end deep neural network trained to produce protein structures from amino acid sequence, multiple sequence alignments, and homologous proteins. In the assessors' ranking by summed z scores (>2.0), AlphaFold scored 244.0 compared to 90.8 by the next best group. The predictions made by AlphaFold had a median domain GDT_TS of 92.4; this is the first time that this level of average accuracy has been achieved during CASP, especially on the more difficult Free Modeling targets, and represents a significant improvement in the state of the art in protein structure prediction. We reported how AlphaFold was run as a human team during CASP14 and improved such that it now achieves an equivalent level of performance without intervention, opening the door to highly accurate large-scale structure prediction.
Asunto(s)
Modelos Moleculares , Redes Neurales de la Computación , Pliegue de Proteína , Proteínas , Programas Informáticos , Secuencia de Aminoácidos , Biología Computacional , Aprendizaje Profundo , Conformación Proteica , Proteínas/química , Proteínas/metabolismo , Análisis de Secuencia de ProteínaRESUMEN
Single-molecule force spectroscopy has proven extremely beneficial in elucidating folding pathways for membrane proteins. Here, we simulate these measurements, conducting hundreds of unfolding trajectories using our fast Upside algorithm for slow enough speeds to reproduce key experimental features that may be missed using all-atom methods. The speed also enables us to determine the logarithmic dependence of pulling velocities on the rupture levels to better compare to experimental values. For simulations of atomic force microscope measurements in which force is applied vertically to the C-terminus of bacteriorhodopsin, we reproduce the major experimental features including even the back-and-forth unfolding of single helical turns. When pulling laterally on GlpG to mimic the experiment, we observe quite different behavior depending on the stiffness of the spring. With a soft spring, as used in the experimental studies with magnetic tweezers, the force remains nearly constant after the initial unfolding event, and a few pathways and a high degree of cooperativity are observed in both the experiment and simulation. With a stiff spring, however, the force drops to near zero after each major unfolding event, and numerous intermediates are observed along a wide variety of pathways. Hence, the mode of force application significantly alters the perception of the folding landscape, including the number of intermediates and the degree of folding cooperativity, important issues that should be considered when designing experiments and interpreting unfolding data.
Asunto(s)
Proteínas de Unión al ADN/química , Endopeptidasas/química , Proteínas de Escherichia coli/química , Proteínas de la Membrana/química , Simulación de Dinámica Molecular , Pliegue de Proteína , Membrana Dobles de Lípidos/químicaRESUMEN
We describe AlphaFold, the protein structure prediction system that was entered by the group A7D in CASP13. Submissions were made by three free-modeling (FM) methods which combine the predictions of three neural networks. All three systems were guided by predictions of distances between pairs of residues produced by a neural network. Two systems assembled fragments produced by a generative neural network, one using scores from a network trained to regress GDT_TS. The third system shows that simple gradient descent on a properly constructed potential is able to perform on par with more expensive traditional search techniques and without requiring domain segmentation. In the CASP13 FM assessors' ranking by summed z-scores, this system scored highest with 68.3 vs 48.2 for the next closest group (an average GDT_TS of 61.4). The system produced high-accuracy structures (with GDT_TS scores of 70 or higher) for 11 out of 43 FM domains. Despite not explicitly using template information, the results in the template category were comparable to the best performing template-based methods.
Asunto(s)
Biología Computacional/métodos , Redes Neurales de la Computación , Conformación Proteica , Pliegue de Proteína , Proteínas/química , Algoritmos , Bases de Datos de Proteínas , Modelos MolecularesRESUMEN
An ongoing challenge in protein chemistry is to identify the underlying interaction energies that capture protein dynamics. The traditional trade-off in biomolecular simulation between accuracy and computational efficiency is predicated on the assumption that detailed force fields are typically well-parameterized, obtaining a significant fraction of possible accuracy. We re-examine this trade-off in the more realistic regime in which parameterization is a greater source of error than the level of detail in the force field. To address parameterization of coarse-grained force fields, we use the contrastive divergence technique from machine learning to train from simulations of 450 proteins. In our procedure, the computational efficiency of the model enables high accuracy through the precise tuning of the Boltzmann ensemble. This method is applied to our recently developed Upside model, where the free energy for side chains is rapidly calculated at every time-step, allowing for a smooth energy landscape without steric rattling of the side chains. After this contrastive divergence training, the model is able to de novo fold proteins up to 100 residues on a single core in days. This improved Upside model provides a starting point both for investigation of folding dynamics and as an inexpensive Bayesian prior for protein physics that can be integrated with additional experimental or bioinformatic data.
Asunto(s)
Biología Computacional/métodos , Proteínas/química , Teorema de Bayes , Simulación por Computador , Aprendizaje Automático , Simulación de Dinámica Molecular/estadística & datos numéricos , Conformación Proteica , Pliegue de Proteína , Programas Informáticos , TermodinámicaRESUMEN
To address the large gap between time scales that can be easily reached by molecular simulations and those required to understand protein dynamics, we present a rapid self-consistent approximation of the side chain free energy at every integration step. In analogy with the adiabatic Born-Oppenheimer approximation for electronic structure, the protein backbone dynamics are simulated as preceding according to the dictates of the free energy of an instantaneously-equilibrated side chain potential. The side chain free energy is computed on the fly, allowing the protein backbone dynamics to traverse a greatly smoothed energetic landscape. This computation results in extremely rapid equilibration and sampling of the Boltzmann distribution. Our method, termed Upside, employs a reduced model involving the three backbone atoms, along with the carbonyl oxygen and amide proton, and a single (oriented) side chain bead having multiple locations reflecting the conformational diversity of the side chain's rotameric states. We also introduce a novel, maximum-likelihood method to parameterize the side chain interactions using protein structures. We demonstrate state-of-the-art accuracy for predicting χ1 rotamer states while consuming only milliseconds of CPU time. Our method enables rapidly equilibrating coarse-grained simulations that can nonetheless contain significant molecular detail. We also show that the resulting free energies of the side chains are sufficiently accurate for de novo folding of some proteins.
Asunto(s)
Simulación de Dinámica Molecular/estadística & datos numéricos , Proteínas/química , Aminoácidos/química , Entropía , Modelos Moleculares , Probabilidad , Conformación Proteica , TermodinámicaRESUMEN
We use the statistics of a large and curated training set of transmembrane helical proteins to develop a knowledge-based potential that accounts for the dependence on both the depth of burial of the protein in the membrane and the degree of side-chain exposure. Additionally, the statistical potential includes depth-dependent energies for unsatisfied backbone hydrogen bond donors and acceptors, which are found to be relatively small, â¼2 RT. Our potential accurately places known proteins within the bilayer. The potential is applied to the mechanosensing MscL channel in membranes of varying thickness and curvature, as well as to the prediction of protein structure. The potential is incorporated into our new Upside molecular dynamics algorithm. Notably, we account for the exchange of protein-lipid interactions for protein-protein interactions as helices contact each other, thereby avoiding overestimating the energetics of helix association within the membrane. Simulations of most multimeric complexes find that isolated monomers and the oligomers retain the same orientation in the membrane, suggesting that the assembly of prepositioned monomers presents a viable mechanism of oligomerization.
Asunto(s)
Membrana Celular/química , Proteínas de la Membrana/química , Simulación de Dinámica Molecular , Enlace de Hidrógeno , Cinética , Conformación Proteica en Hélice alfa , Pliegue de Proteína , TermodinámicaRESUMEN
In this Viewpoint, 2023 Lasker award winners John Jumper and Demis Hassabis describe their invention, the artificial intelligencebased system AlphaFold, which is able to predict protein structure with great accuracy.
Asunto(s)
Distinciones y Premios , Investigación Biomédica , Conformación Proteica , Investigación Biomédica/historia , Medicina , Estructura Molecular , Reino UnidoRESUMEN
The loss of conformational entropy is a major contribution in the thermodynamics of protein folding. However, accurate determination of the quantity has proven challenging. We calculate this loss using molecular dynamic simulations of both the native protein and a realistic denatured state ensemble. For ubiquitin, the total change in entropy is TΔSTotal = 1.4 kcalâ mol(-1) per residue at 300 K with only 20% from the loss of side-chain entropy. Our analysis exhibits mixed agreement with prior studies because of the use of more accurate ensembles and contributions from correlated motions. Buried side chains lose only a factor of 1.4 in the number of conformations available per rotamer upon folding (ΩU/ΩN). The entropy loss for helical and sheet residues differs due to the smaller motions of helical residues (TΔShelix-sheet = 0.5 kcalâ mol(-1)), a property not fully reflected in the amide N-H and carbonyl C=O bond NMR order parameters. The results have implications for the thermodynamics of folding and binding, including estimates of solvent ordering and microscopic entropies obtained from NMR.
Asunto(s)
Entropía , Espectroscopía de Resonancia Magnética , Pliegue de Proteína , Ubiquitina/química , Aminoácidos/química , Desnaturalización Proteica , Estructura Secundaria de ProteínaRESUMEN
Silicon nanowires (SiNWs) have emerged as a new class of materials with important applications in biology and medicine with current efforts having focused primarily on using substrate bound SiNW devices. However, developing devices capable of free-standing inter- and intracellular operation is an important next step in designing new synthetic cellular materials and tools for biophysical characterization. To demonstrate this, here we show that label free SiNWs can be internalized in multiple cell lines, forming robust cytoskeletal interfaces, and when kinked can serve as free-standing inter- and intracellular force probes capable of continuous extended (>1 h) force monitoring. Our results show that intercellular interactions exhibit ratcheting like behavior with force peaks of â¼69.6 pN/SiNW, while intracellular force peaks of â¼116.9 pN/SiNW were recorded during smooth muscle contraction. To accomplish this, we have introduced a simple single-capture dark-field/phase contrast optical imaging modality, scatter enhanced phase contrast (SEPC), which enables the simultaneous visualization of both cellular components and inorganic nanostructures. This approach demonstrates that rationally designed devices capable of substrate-independent operation are achievable, providing a simple and scalable method for continuous inter- and intracellular force dynamics studies.
RESUMEN
The vast majority of missense variants observed in the human genome are of unknown clinical significance. We present AlphaMissense, an adaptation of AlphaFold fine-tuned on human and primate variant population frequency databases to predict missense variant pathogenicity. By combining structural context and evolutionary conservation, our model achieves state-of-the-art results across a wide range of genetic and experimental benchmarks, all without explicitly training on such data. The average pathogenicity score of genes is also predictive for their cell essentiality, capable of identifying short essential genes that existing statistical approaches are underpowered to detect. As a resource to the community, we provide a database of predictions for all possible human single amino acid substitutions and classify 89% of missense variants as either likely benign or likely pathogenic.
Asunto(s)
Sustitución de Aminoácidos , Enfermedad , Mutación Missense , Proteoma , Alineación de Secuencia , Humanos , Sustitución de Aminoácidos/genética , Benchmarking , Secuencia Conservada , Bases de Datos Genéticas , Enfermedad/genética , Genoma Humano , Conformación Proteica , Proteoma/genética , Alineación de Secuencia/métodos , Aprendizaje AutomáticoRESUMEN
Recognition of promoters in bacterial RNA polymerases (RNAPs) is controlled by sigma subunits. The key sequence motif recognized by the sigma, the -10 promoter element, is located in the non-template strand of the double-stranded DNA molecule ~10 nucleotides upstream of the transcription start site. Here, we explain the mechanism by which the phage AR9 non-virion RNAP (nvRNAP), a bacterial RNAP homolog, recognizes the -10 element of its deoxyuridine-containing promoter in the template strand. The AR9 sigma-like subunit, the nvRNAP enzyme core, and the template strand together form two nucleotide base-accepting pockets whose shapes dictate the requirement for the conserved deoxyuridines. A single amino acid substitution in the AR9 sigma-like subunit allows one of these pockets to accept a thymine thus expanding the promoter consensus. Our work demonstrates the extent to which viruses can evolve host-derived multisubunit enzymes to make transcription of their own genes independent of the host.