Búsqueda | Portal Regional de la BVS

OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization.

Ahdritz, Gustaf; Bouatta, Nazim; Floristean, Christina; Kadyan, Sachin; Xia, Qinghui; Gerecke, William; O'Donnell, Timothy J; Berenberg, Daniel; Fisk, Ian; Zanichelli, Niccolò; Zhang, Bo; Nowaczynski, Arkadiusz; Wang, Bei; Stepniewska-Dziubinska, Marta M; Zhang, Shang; Ojewole, Adegoke; Guney, Murat Efe; Biderman, Stella; Watkins, Andrew M; Ra, Stephen; Lorenzo, Pablo Ribalta; Nivon, Lucas; Weitzner, Brian; Ban, Yih-En Andrew; Chen, Shiyang; Zhang, Minjia; Li, Conglong; Song, Shuaiwen Leon; He, Yuxiong; Sorger, Peter K; Mostaque, Emad; Zhang, Zhao; Bonneau, Richard; AlQuraishi, Mohammed.

Nat Methods ; 21(8): 1514-1524, 2024 Aug.

Artículo en Inglés | MEDLINE | ID: mdl-38744917

RESUMEN

AlphaFold2 revolutionized structural biology with the ability to predict protein structures with exceptionally high accuracy. Its implementation, however, lacks the code and data required to train new models. These are necessary to (1) tackle new tasks, like protein-ligand complex structure prediction, (2) investigate the process by which the model learns and (3) assess the model's capacity to generalize to unseen regions of fold space. Here we report OpenFold, a fast, memory efficient and trainable implementation of AlphaFold2. We train OpenFold from scratch, matching the accuracy of AlphaFold2. Having established parity, we find that OpenFold is remarkably robust at generalizing even when the size and diversity of its training set is deliberately limited, including near-complete elisions of classes of secondary structure elements. By analyzing intermediate structures produced during training, we also gain insights into the hierarchical manner in which OpenFold learns to fold. In sum, our studies demonstrate the power and utility of OpenFold, which we believe will prove to be a crucial resource for the protein modeling community.

Asunto(s)

Modelos Moleculares , Pliegue de Proteína , Proteínas , Proteínas/química , Biología Computacional/métodos , Programas Informáticos , Conformación Proteica , Algoritmos , Estructura Secundaria de Proteína

OpenProteinSet: Training data for structural biology at scale.

Ahdritz, Gustaf; Bouatta, Nazim; Kadyan, Sachin; Jarosch, Lukas; Berenberg, Daniel; Fisk, Ian; Watkins, Andrew M; Ra, Stephen; Bonneau, Richard; AlQuraishi, Mohammed.

ArXiv ; 2023 Aug 10.

Artículo en Inglés | MEDLINE | ID: mdl-37608940

RESUMEN

Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.

Publisher Correction: Single-sequence protein structure prediction using a language model and deep learning.

Chowdhury, Ratul; Bouatta, Nazim; Biswas, Surojit; Floristean, Christina; Kharkar, Anant; Roy, Koushik; Rochereau, Charlotte; Ahdritz, Gustaf; Zhang, Joanna; Church, George M; Sorger, Peter K; AlQuraishi, Mohammed.

Nat Biotechnol ; 40(11): 1692, 2022 Nov.

Artículo en Inglés | MEDLINE | ID: mdl-36253538

Single-sequence protein structure prediction using a language model and deep learning.

Nat Biotechnol ; 40(11): 1617-1623, 2022 11.

Artículo en Inglés | MEDLINE | ID: mdl-36192636

RESUMEN

AlphaFold2 and related computational systems predict protein structure using deep learning and co-evolutionary relationships encoded in multiple sequence alignments (MSAs). Despite high prediction accuracy achieved by these systems, challenges remain in (1) prediction of orphan and rapidly evolving proteins for which an MSA cannot be generated; (2) rapid exploration of designed structures; and (3) understanding the rules governing spontaneous polypeptide folding in solution. Here we report development of an end-to-end differentiable recurrent geometric network (RGN) that uses a protein language model (AminoBERT) to learn latent structural information from unaligned proteins. A linked geometric module compactly represents Cα backbone geometry in a translationally and rotationally invariant way. On average, RGN2 outperforms AlphaFold2 and RoseTTAFold on orphan proteins and classes of designed proteins while achieving up to a 106-fold reduction in compute time. These findings demonstrate the practical and theoretical strengths of protein language models relative to MSAs in structure prediction.

Asunto(s)

Aprendizaje Profundo , Lenguaje , Proteínas/metabolismo , Alineación de Secuencia , Biología Computacional , Conformación Proteica

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA