Pesquisa | BVS - MINISTÉRIO DA SAÚDE

OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization.

Ahdritz, Gustaf; Bouatta, Nazim; Floristean, Christina; Kadyan, Sachin; Xia, Qinghui; Gerecke, William; O'Donnell, Timothy J; Berenberg, Daniel; Fisk, Ian; Zanichelli, Niccolò; Zhang, Bo; Nowaczynski, Arkadiusz; Wang, Bei; Stepniewska-Dziubinska, Marta M; Zhang, Shang; Ojewole, Adegoke; Guney, Murat Efe; Biderman, Stella; Watkins, Andrew M; Ra, Stephen; Lorenzo, Pablo Ribalta; Nivon, Lucas; Weitzner, Brian; Ban, Yih-En Andrew; Chen, Shiyang; Zhang, Minjia; Li, Conglong; Song, Shuaiwen Leon; He, Yuxiong; Sorger, Peter K; Mostaque, Emad; Zhang, Zhao; Bonneau, Richard; AlQuraishi, Mohammed.

Nat Methods ; 2024 May 14.

Artigo em Inglês | MEDLINE | ID: mdl-38744917

RESUMO

AlphaFold2 revolutionized structural biology with the ability to predict protein structures with exceptionally high accuracy. Its implementation, however, lacks the code and data required to train new models. These are necessary to (1) tackle new tasks, like protein-ligand complex structure prediction, (2) investigate the process by which the model learns and (3) assess the model's capacity to generalize to unseen regions of fold space. Here we report OpenFold, a fast, memory efficient and trainable implementation of AlphaFold2. We train OpenFold from scratch, matching the accuracy of AlphaFold2. Having established parity, we find that OpenFold is remarkably robust at generalizing even when the size and diversity of its training set is deliberately limited, including near-complete elisions of classes of secondary structure elements. By analyzing intermediate structures produced during training, we also gain insights into the hierarchical manner in which OpenFold learns to fold. In sum, our studies demonstrate the power and utility of OpenFold, which we believe will prove to be a crucial resource for the protein modeling community.

Protein remote homology detection and structural alignment using deep learning.

Hamamsy, Tymor; Morton, James T; Blackwell, Robert; Berenberg, Daniel; Carriero, Nicholas; Gligorijevic, Vladimir; Strauss, Charlie E M; Leman, Julia Koehler; Cho, Kyunghyun; Bonneau, Richard.

Nat Biotechnol ; 2023 Sep 07.

Artigo em Inglês | MEDLINE | ID: mdl-37679542

RESUMO

Exploiting sequence-structure-function relationships in biotechnology requires improved methods for aligning proteins that have low sequence similarity to previously annotated proteins. We develop two deep learning methods to address this gap, TM-Vec and DeepBLAST. TM-Vec allows searching for structure-structure similarities in large sequence databases. It is trained to accurately predict TM-scores as a metric of structural similarity directly from sequence pairs without the need for intermediate computation or solution of structures. Once structurally similar proteins have been identified, DeepBLAST can structurally align proteins using only sequence information by identifying structurally homologous regions between proteins. It outperforms traditional sequence alignment methods and performs similarly to structure-based alignment methods. We show the merits of TM-Vec and DeepBLAST on a variety of datasets, including better identification of remotely homologous proteins compared with state-of-the-art sequence alignment and structure prediction methods.

OpenProteinSet: Training data for structural biology at scale.

Ahdritz, Gustaf; Bouatta, Nazim; Kadyan, Sachin; Jarosch, Lukas; Berenberg, Daniel; Fisk, Ian; Watkins, Andrew M; Ra, Stephen; Bonneau, Richard; AlQuraishi, Mohammed.

ArXiv ; 2023 Aug 10.

Artigo em Inglês | MEDLINE | ID: mdl-37608940

RESUMO

Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.

Sequence-structure-function relationships in the microbial protein universe.

Koehler Leman, Julia; Szczerbiak, Pawel; Renfrew, P Douglas; Gligorijevic, Vladimir; Berenberg, Daniel; Vatanen, Tommi; Taylor, Bryn C; Chandler, Chris; Janssen, Stefan; Pataki, Andras; Carriero, Nick; Fisk, Ian; Xavier, Ramnik J; Knight, Rob; Bonneau, Richard; Kosciolek, Tomasz.

Nat Commun ; 14(1): 2351, 2023 04 26.

Artigo em Inglês | MEDLINE | ID: mdl-37100781

RESUMO

For the past half-century, structural biologists relied on the notion that similar protein sequences give rise to similar structures and functions. While this assumption has driven research to explore certain parts of the protein universe, it disregards spaces that don't rely on this assumption. Here we explore areas of the protein universe where similar protein functions can be achieved by different sequences and different structures. We predict ~200,000 structures for diverse protein sequences from 1,003 representative genomes across the microbial tree of life and annotate them functionally on a per-residue basis. Structure prediction is accomplished using the World Community Grid, a large-scale citizen science initiative. The resulting database of structural models is complementary to the AlphaFold database, with regards to domains of life as well as sequence diversity and sequence length. We identify 148 novel folds and describe examples where we map specific functions to structural motifs. We also show that the structural space is continuous and largely saturated, highlighting the need for a shift in focus across all branches of biology, from obtaining structures to putting them into context and from sequence-based to sequence-structure-function based meta-omics analyses.

Assuntos

Dobramento de Proteína , Proteínas , Proteínas/metabolismo , Sequência de Aminoácidos , Relação Estrutura-Atividade , Bases de Dados de Proteínas

Structure-based protein function prediction using graph convolutional networks.

Gligorijevic, Vladimir; Renfrew, P Douglas; Kosciolek, Tomasz; Leman, Julia Koehler; Berenberg, Daniel; Vatanen, Tommi; Chandler, Chris; Taylor, Bryn C; Fisk, Ian M; Vlamakis, Hera; Xavier, Ramnik J; Knight, Rob; Cho, Kyunghyun; Bonneau, Richard.

Nat Commun ; 12(1): 3168, 2021 05 26.

Artigo em Inglês | MEDLINE | ID: mdl-34039967

RESUMO

The rapid increase in the number of proteins in sequence databases and the diversity of their functions challenge computational approaches for automated function prediction. Here, we introduce DeepFRI, a Graph Convolutional Network for predicting protein functions by leveraging sequence features extracted from a protein language model and protein structures. It outperforms current leading methods and sequence-based Convolutional Neural Networks and scales to the size of current sequence repositories. Augmenting the training set of experimental structures with homology models allows us to significantly expand the number of predictable functions. DeepFRI has significant de-noising capability, with only a minor drop in performance when experimental structures are replaced by protein models. Class activation mapping allows function predictions at an unprecedented resolution, allowing site-specific annotations at the residue-level in an automated manner. We show the utility and high performance of our method by annotating structures from the PDB and SWISS-MODEL, making several new confident function predictions. DeepFRI is available as a webserver at https://beta.deepfri.flatironinstitute.org/ .

Assuntos

Biologia Computacional/métodos , Aprendizado Profundo , Modelos Biológicos , Estrutura Terciária de Proteína , Proteínas/fisiologia , Sequência de Aminoácidos , Bases de Dados de Proteínas/estatística & dados numéricos , Conjuntos de Dados como Assunto , Modelos Moleculares , Proteínas/ultraestrutura , Relação Estrutura-Atividade

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA