Búsqueda | Biblioteca Virtual en Salud Odontología. Uruguay

OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization.

Ahdritz, Gustaf; Bouatta, Nazim; Floristean, Christina; Kadyan, Sachin; Xia, Qinghui; Gerecke, William; O'Donnell, Timothy J; Berenberg, Daniel; Fisk, Ian; Zanichelli, Niccolò; Zhang, Bo; Nowaczynski, Arkadiusz; Wang, Bei; Stepniewska-Dziubinska, Marta M; Zhang, Shang; Ojewole, Adegoke; Guney, Murat Efe; Biderman, Stella; Watkins, Andrew M; Ra, Stephen; Lorenzo, Pablo Ribalta; Nivon, Lucas; Weitzner, Brian; Ban, Yih-En Andrew; Chen, Shiyang; Zhang, Minjia; Li, Conglong; Song, Shuaiwen Leon; He, Yuxiong; Sorger, Peter K; Mostaque, Emad; Zhang, Zhao; Bonneau, Richard; AlQuraishi, Mohammed.

Nat Methods ; 21(8): 1514-1524, 2024 Aug.

Artículo en Inglés | MEDLINE | ID: mdl-38744917

RESUMEN

AlphaFold2 revolutionized structural biology with the ability to predict protein structures with exceptionally high accuracy. Its implementation, however, lacks the code and data required to train new models. These are necessary to (1) tackle new tasks, like protein-ligand complex structure prediction, (2) investigate the process by which the model learns and (3) assess the model's capacity to generalize to unseen regions of fold space. Here we report OpenFold, a fast, memory efficient and trainable implementation of AlphaFold2. We train OpenFold from scratch, matching the accuracy of AlphaFold2. Having established parity, we find that OpenFold is remarkably robust at generalizing even when the size and diversity of its training set is deliberately limited, including near-complete elisions of classes of secondary structure elements. By analyzing intermediate structures produced during training, we also gain insights into the hierarchical manner in which OpenFold learns to fold. In sum, our studies demonstrate the power and utility of OpenFold, which we believe will prove to be a crucial resource for the protein modeling community.

Asunto(s)

Modelos Moleculares , Pliegue de Proteína , Proteínas , Proteínas/química , Biología Computacional/métodos , Programas Informáticos , Conformación Proteica , Algoritmos , Estructura Secundaria de Proteína

RNA3DB: A structurally-dissimilar dataset split for training and benchmarking deep learning models for RNA structure prediction.

Szikszai, Marcell; Magnus, Marcin; Sanghi, Siddhant; Kadyan, Sachin; Bouatta, Nazim; Rivas, Elena.

bioRxiv ; 2024 Mar 11.

Artículo en Inglés | MEDLINE | ID: mdl-38352531

RESUMEN

With advances in protein structure prediction thanks to deep learning models like AlphaFold, RNA structure prediction has recently received increased attention from deep learning researchers. RNAs introduce substantial challenges due to the sparser availability and lower structural diversity of the experimentally resolved RNA structures in comparison to protein structures. These challenges are often poorly addressed by the existing literature, many of which report inflated performance due to using training and testing sets with significant structural overlap. Further, the most recent Critical Assessment of Structure Prediction (CASP15) has shown that deep learning models for RNA structure are currently outperformed by traditional methods. In this paper we present RNA3DB, a dataset of structured RNAs, derived from the Protein Data Bank (PDB), that is designed for training and benchmarking deep learning models. The RNA3DB method arranges the RNA 3D chains into distinct groups (Components) that are non-redundant both with regard to sequence as well as structure, providing a robust way of dividing training, validation, and testing sets. Any split of these structurally-dissimilar Components are guaranteed to produce test and validations sets that are distinct by sequence and structure from those in the training set. We provide the RNA3DB dataset, a particular train/test split of the RNA3DB Components (in an approximate 70/30 ratio) that will be updated periodically. We also provide the RNA3DB methodology along with the source-code, with the goal of creating a reproducible and customizable tool for producing structurally-dissimilar dataset splits for structural RNAs.

RNA3DB: A structurally-dissimilar dataset split for training and benchmarking deep learning models for RNA structure prediction.

Szikszai, Marcell; Magnus, Marcin; Sanghi, Siddhant; Kadyan, Sachin; Bouatta, Nazim; Rivas, Elena.

J Mol Biol ; : 168552, 2024 Mar 27.

Artículo en Inglés | MEDLINE | ID: mdl-38552946

RESUMEN

OpenProteinSet: Training data for structural biology at scale.

Ahdritz, Gustaf; Bouatta, Nazim; Kadyan, Sachin; Jarosch, Lukas; Berenberg, Daniel; Fisk, Ian; Watkins, Andrew M; Ra, Stephen; Bonneau, Richard; AlQuraishi, Mohammed.

ArXiv ; 2023 Aug 10.

Artículo en Inglés | MEDLINE | ID: mdl-37608940

RESUMEN

Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA