Search | VHL Regional Portal

1.

OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization.

Ahdritz, Gustaf; Bouatta, Nazim; Floristean, Christina; Kadyan, Sachin; Xia, Qinghui; Gerecke, William; O'Donnell, Timothy J; Berenberg, Daniel; Fisk, Ian; Zanichelli, Niccolò; Zhang, Bo; Nowaczynski, Arkadiusz; Wang, Bei; Stepniewska-Dziubinska, Marta M; Zhang, Shang; Ojewole, Adegoke; Guney, Murat Efe; Biderman, Stella; Watkins, Andrew M; Ra, Stephen; Lorenzo, Pablo Ribalta; Nivon, Lucas; Weitzner, Brian; Ban, Yih-En Andrew; Chen, Shiyang; Zhang, Minjia; Li, Conglong; Song, Shuaiwen Leon; He, Yuxiong; Sorger, Peter K; Mostaque, Emad; Zhang, Zhao; Bonneau, Richard; AlQuraishi, Mohammed.

Nat Methods ; 2024 May 14.

Article in English | MEDLINE | ID: mdl-38744917

ABSTRACT

AlphaFold2 revolutionized structural biology with the ability to predict protein structures with exceptionally high accuracy. Its implementation, however, lacks the code and data required to train new models. These are necessary to (1) tackle new tasks, like protein-ligand complex structure prediction, (2) investigate the process by which the model learns and (3) assess the model's capacity to generalize to unseen regions of fold space. Here we report OpenFold, a fast, memory efficient and trainable implementation of AlphaFold2. We train OpenFold from scratch, matching the accuracy of AlphaFold2. Having established parity, we find that OpenFold is remarkably robust at generalizing even when the size and diversity of its training set is deliberately limited, including near-complete elisions of classes of secondary structure elements. By analyzing intermediate structures produced during training, we also gain insights into the hierarchical manner in which OpenFold learns to fold. In sum, our studies demonstrate the power and utility of OpenFold, which we believe will prove to be a crucial resource for the protein modeling community.

2.

RNA3DB: A structurally-dissimilar dataset split for training and benchmarking deep learning models for RNA structure prediction.

Szikszai, Marcell; Magnus, Marcin; Sanghi, Siddhant; Kadyan, Sachin; Bouatta, Nazim; Rivas, Elena.

J Mol Biol ; : 168552, 2024 Mar 27.

Article in English | MEDLINE | ID: mdl-38552946

ABSTRACT

With advances in protein structure prediction thanks to deep learning models like AlphaFold, RNA structure prediction has recently received increased attention from deep learning researchers. RNAs introduce substantial challenges due to the sparser availability and lower structural diversity of the experimentally resolved RNA structures in comparison to protein structures. These challenges are often poorly addressed by the existing literature, many of which report inflated performance due to using training and testing sets with significant structural overlap. Further, the most recent Critical Assessment of Structure Prediction (CASP15) has shown that deep learning models for RNA structure are currently outperformed by traditional methods. In this paper we present RNA3DB, a dataset of structured RNAs, derived from the Protein Data Bank (PDB), that is designed for training and benchmarking deep learning models. The RNA3DB method arranges the RNA 3D chains into distinct groups (Components) that are non-redundant both with regard to sequence as well as structure, providing a robust way of dividing training, validation, and testing sets. Any split of these structurally-dissimilar Components are guaranteed to produce test and validations sets that are distinct by sequence and structure from those in the training set. We provide the RNA3DB dataset, a particular train/test split of the RNA3DB Components (in an approximate 70/30 ratio) that will be updated periodically. We also provide the RNA3DB methodology along with the source-code, with the goal of creating a reproducible and customizable tool for producing structurally-dissimilar dataset splits for structural RNAs.

3.

RNA3DB: A structurally-dissimilar dataset split for training and benchmarking deep learning models for RNA structure prediction.

Szikszai, Marcell; Magnus, Marcin; Sanghi, Siddhant; Kadyan, Sachin; Bouatta, Nazim; Rivas, Elena.

bioRxiv ; 2024 Mar 11.

Article in English | MEDLINE | ID: mdl-38352531

ABSTRACT

With advances in protein structure prediction thanks to deep learning models like AlphaFold, RNA structure prediction has recently received increased attention from deep learning researchers. RNAs introduce substantial challenges due to the sparser availability and lower structural diversity of the experimentally resolved RNA structures in comparison to protein structures. These challenges are often poorly addressed by the existing literature, many of which report inflated performance due to using training and testing sets with significant structural overlap. Further, the most recent Critical Assessment of Structure Prediction (CASP15) has shown that deep learning models for RNA structure are currently outperformed by traditional methods. In this paper we present RNA3DB, a dataset of structured RNAs, derived from the Protein Data Bank (PDB), that is designed for training and benchmarking deep learning models. The RNA3DB method arranges the RNA 3D chains into distinct groups (Components) that are non-redundant both with regard to sequence as well as structure, providing a robust way of dividing training, validation, and testing sets. Any split of these structurally-dissimilar Components are guaranteed to produce test and validations sets that are distinct by sequence and structure from those in the training set. We provide the RNA3DB dataset, a particular train/test split of the RNA3DB Components (in an approximate 70/30 ratio) that will be updated periodically. We also provide the RNA3DB methodology along with the source-code, with the goal of creating a reproducible and customizable tool for producing structurally-dissimilar dataset splits for structural RNAs.

4.

OpenProteinSet: Training data for structural biology at scale.

Ahdritz, Gustaf; Bouatta, Nazim; Kadyan, Sachin; Jarosch, Lukas; Berenberg, Daniel; Fisk, Ian; Watkins, Andrew M; Ra, Stephen; Bonneau, Richard; AlQuraishi, Mohammed.

ArXiv ; 2023 Aug 10.

Article in English | MEDLINE | ID: mdl-37608940

ABSTRACT

Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.

5.

Structural biology at the scale of proteomes.

Bouatta, Nazim; AlQuraishi, Mohammed.

Nat Struct Mol Biol ; 30(2): 129-130, 2023 02.

Article in English | MEDLINE | ID: mdl-36797377

Subject(s)

Molecular Biology , Proteome , Proteomics , Computational Biology

6.

Publisher Correction: Single-sequence protein structure prediction using a language model and deep learning.

Chowdhury, Ratul; Bouatta, Nazim; Biswas, Surojit; Floristean, Christina; Kharkar, Anant; Roy, Koushik; Rochereau, Charlotte; Ahdritz, Gustaf; Zhang, Joanna; Church, George M; Sorger, Peter K; AlQuraishi, Mohammed.

Nat Biotechnol ; 40(11): 1692, 2022 Nov.

Article in English | MEDLINE | ID: mdl-36253538

7.

Single-sequence protein structure prediction using a language model and deep learning.

Chowdhury, Ratul; Bouatta, Nazim; Biswas, Surojit; Floristean, Christina; Kharkar, Anant; Roy, Koushik; Rochereau, Charlotte; Ahdritz, Gustaf; Zhang, Joanna; Church, George M; Sorger, Peter K; AlQuraishi, Mohammed.

Nat Biotechnol ; 40(11): 1617-1623, 2022 11.

Article in English | MEDLINE | ID: mdl-36192636

ABSTRACT

AlphaFold2 and related computational systems predict protein structure using deep learning and co-evolutionary relationships encoded in multiple sequence alignments (MSAs). Despite high prediction accuracy achieved by these systems, challenges remain in (1) prediction of orphan and rapidly evolving proteins for which an MSA cannot be generated; (2) rapid exploration of designed structures; and (3) understanding the rules governing spontaneous polypeptide folding in solution. Here we report development of an end-to-end differentiable recurrent geometric network (RGN) that uses a protein language model (AminoBERT) to learn latent structural information from unaligned proteins. A linked geometric module compactly represents Cα backbone geometry in a translationally and rotationally invariant way. On average, RGN2 outperforms AlphaFold2 and RoseTTAFold on orphan proteins and classes of designed proteins while achieving up to a 106-fold reduction in compute time. These findings demonstrate the practical and theoretical strengths of protein language models relative to MSAs in structure prediction.

Subject(s)

Deep Learning , Language , Proteins/metabolism , Sequence Alignment , Computational Biology , Protein Conformation

8.

Protein structure prediction by AlphaFold2: are attention and symmetries all you need?

Bouatta, Nazim; Sorger, Peter; AlQuraishi, Mohammed.

Acta Crystallogr D Struct Biol ; 77(Pt 8): 982-991, 2021 Aug 01.

Article in English | MEDLINE | ID: mdl-34342271

ABSTRACT

The functions of most proteins result from their 3D structures, but determining their structures experimentally remains a challenge, despite steady advances in crystallography, NMR and single-particle cryoEM. Computationally predicting the structure of a protein from its primary sequence has long been a grand challenge in bioinformatics, intimately connected with understanding protein chemistry and dynamics. Recent advances in deep learning, combined with the availability of genomic data for inferring co-evolutionary patterns, provide a new approach to protein structure prediction that is complementary to longstanding physics-based approaches. The outstanding performance of AlphaFold2 in the recent Critical Assessment of protein Structure Prediction (CASP14) experiment demonstrates the remarkable power of deep learning in structure prediction. In this perspective, we focus on the key features of AlphaFold2, including its use of (i) attention mechanisms and Transformers to capture long-range dependencies, (ii) symmetry principles to facilitate reasoning over protein structures in three dimensions and (iii) end-to-end differentiability as a unifying framework for learning from protein data. The rules of protein folding are ultimately encoded in the physical principles that underpin it; to conclude, the implications of having a powerful computational model for structure prediction that does not explicitly rely on those principles are discussed.

Subject(s)

Proteins/chemistry , Proteins/metabolism , Algorithms , Animals , Caspases/chemistry , Caspases/metabolism , Computational Biology/methods , Databases, Protein , Humans , Protein Conformation

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL