Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 7 de 7
Filtrar
1.
Nature ; 622(7983): 646-653, 2023 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-37704037

RESUMO

We are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the AlphaFold database1. These models cover nearly all proteins that are known, including those challenging to annotate for function or putative biological role using standard homology-based approaches. In this study, we examine the extent to which the AlphaFold database has structurally illuminated this 'dark matter' of the natural protein universe at high predicted accuracy. We further describe the protein diversity that these models cover as an annotated interactive sequence similarity network, accessible at https://uniprot3d.org/atlas/AFDB90v4 . By searching for novelties from sequence, structure and semantic perspectives, we uncovered the ß-flower fold, added several protein families to Pfam database2 and experimentally demonstrated that one of these belongs to a new superfamily of translation-targeting toxin-antitoxin systems, TumE-TumA. This work underscores the value of large-scale efforts in identifying, annotating and prioritizing new protein families. By leveraging the recent deep learning revolution in protein bioinformatics, we can now shed light into uncharted areas of the protein universe at an unprecedented scale, paving the way to innovations in life sciences and biotechnology.


Assuntos
Bases de Dados de Proteínas , Aprendizado Profundo , Anotação de Sequência Molecular , Dobramento de Proteína , Proteínas , Homologia Estrutural de Proteína , Sequência de Aminoácidos , Internet , Proteínas/química , Proteínas/classificação , Proteínas/metabolismo
2.
Comput Struct Biotechnol J ; 21: 238-250, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36544476

RESUMO

The process of designing biomolecules, in particular proteins, is witnessing a rapid change in available tooling and approaches, moving from design through physicochemical force fields, to producing plausible, complex sequences fast via end-to-end differentiable statistical models. To achieve conditional and controllable protein design, researchers at the interface of artificial intelligence and biology leverage advances in natural language processing (NLP) and computer vision techniques, coupled with advances in computing hardware to learn patterns from growing biological databases, curated annotations thereof, or both. Once learned, these patterns can be used to provide novel insights into mechanistic biology and the design of biomolecules. However, navigating and understanding the practical applications for the many recent protein design tools is complex. To facilitate this, we 1) document recent advances in deep learning (DL) assisted protein design from the last three years, 2) present a practical pipeline that allows to go from de novo-generated sequences to their predicted properties and web-powered visualization within minutes, and 3) leverage it to suggest a generated protein sequence which might be used to engineer a biosynthetic gene cluster to produce a molecular glue-like compound. Lastly, we discuss challenges and highlight opportunities for the protein design field.

3.
Nat Struct Mol Biol ; 29(11): 1056-1067, 2022 11.
Artigo em Inglês | MEDLINE | ID: mdl-36344848

RESUMO

Most proteins fold into 3D structures that determine how they function and orchestrate the biological processes of the cell. Recent developments in computational methods for protein structure predictions have reached the accuracy of experimentally determined models. Although this has been independently verified, the implementation of these methods across structural-biology applications remains to be tested. Here, we evaluate the use of AlphaFold2 (AF2) predictions in the study of characteristic structural elements; the impact of missense variants; function and ligand binding site predictions; modeling of interactions; and modeling of experimental structural data. For 11 proteomes, an average of 25% additional residues can be confidently modeled when compared with homology modeling, identifying structural features rarely seen in the Protein Data Bank. AF2-based predictions of protein disorder and complexes surpass dedicated tools, and AF2 models can be used across diverse applications equally well compared with experimentally determined structures, when the confidence metrics are critically considered. In summary, we find that these advances are likely to have a transformative impact in structural biology and broader life-science research.


Assuntos
Biologia Computacional , Furilfuramida , Biologia Computacional/métodos , Sítios de Ligação , Proteínas/química , Bases de Dados de Proteínas , Conformação Proteica
4.
PLoS One ; 16(9): e0253102, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34591846

RESUMO

In genomics, optical mapping technology provides long-range contiguity information to improve genome sequence assemblies and detect structural variation. Originally a laborious manual process, Bionano Genomics platforms now offer high-throughput, automated optical mapping based on chips packed with nanochannels through which unwound DNA is guided and the fluorescent DNA backbone and specific restriction sites are recorded. Although the raw image data obtained is of high quality, the processing and assembly software accompanying the platforms is closed source and does not seem to make full use of data, labeling approximately half of the measured signals as unusable. Here we introduce two new software tools, independent of Bionano Genomics software, to extract and process molecules from raw images (OptiScan) and to perform molecule-to-molecule and molecule-to-reference alignments using a novel signal-based approach (OptiMap). We demonstrate that the molecules detected by OptiScan can yield better assemblies, and that the approach taken by OptiMap results in higher use of molecules from the raw data. These tools lay the foundation for a suite of open-source methods to process and analyze high-throughput optical mapping data. The Python implementations of the OptiTools are publicly available through http://www.bif.wur.nl/.


Assuntos
Genômica/métodos , Mapeamento por Restrição Óptica/métodos , Mapeamento Cromossômico , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA
5.
Bioinformatics ; 36(Suppl_2): i718-i725, 2020 12 30.
Artigo em Inglês | MEDLINE | ID: mdl-33381814

RESUMO

MOTIVATION: As the number of experimentally solved protein structures rises, it becomes increasingly appealing to use structural information for predictive tasks involving proteins. Due to the large variation in protein sizes, folds and topologies, an attractive approach is to embed protein structures into fixed-length vectors, which can be used in machine learning algorithms aimed at predicting and understanding functional and physical properties. Many existing embedding approaches are alignment based, which is both time-consuming and ineffective for distantly related proteins. On the other hand, library- or model-based approaches depend on a small library of fragments or require the use of a trained model, both of which may not generalize well. RESULTS: We present Geometricus, a novel and universally applicable approach to embedding proteins in a fixed-dimensional space. The approach is fast, accurate, and interpretable. Geometricus uses a set of 3D moment invariants to discretize fragments of protein structures into shape-mers, which are then counted to describe the full structure as a vector of counts. We demonstrate the applicability of this approach in various tasks, ranging from fast structure similarity search, unsupervised clustering and structure classification across proteins from different superfamilies as well as within the same family. AVAILABILITY AND IMPLEMENTATION: Python code available at https://git.wur.nl/durai001/geometricus.


Assuntos
Algoritmos , Proteínas , Análise por Conglomerados , Aprendizado de Máquina
6.
Comput Struct Biotechnol J ; 18: 981-992, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32368333

RESUMO

The vast number of protein structures currently available opens exciting opportunities for machine learning on proteins, aimed at predicting and understanding functional properties. In particular, in combination with homology modelling, it is now possible to not only use sequence features as input for machine learning, but also structure features. However, in order to do so, robust multiple structure alignments are imperative. Here we present Caretta, a multiple structure alignment suite meant for homologous but sequentially divergent protein families which consistently returns accurate alignments with a higher coverage than current state-of-the-art tools. Caretta is available as a GUI and command-line application and additionally outputs an aligned structure feature matrix for a given set of input structures, which can readily be used in downstream steps for supervised or unsupervised machine learning. We show Caretta's performance on two benchmark datasets, and present an example application of Caretta in predicting the conformational state of cyclin-dependent kinases.

7.
Bioinformatics ; 32(17): i487-i493, 2016 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-27587666

RESUMO

MOTIVATION: Next-generation sequencing technology is generating a wealth of highly similar genome sequences for many species, paving the way for a transition from single-genome to pan-genome analyses. Accordingly, genomics research is going to switch from reference-centric to pan-genomic approaches. We define the pan-genome as a comprehensive representation of multiple annotated genomes, facilitating analyses on the similarity and divergence of the constituent genomes at the nucleotide, gene and genome structure level. Current pan-genomic approaches do not thoroughly address scalability, functionality and usability. RESULTS: We introduce a generalized De Bruijn graph as a pan-genome representation, as well as an online algorithm to construct it. This representation is stored in a Neo4j graph database, which makes our approach scalable to large eukaryotic genomes. Besides the construction algorithm, our software package, called PanTools, currently provides functionality for annotating pan-genomes, adding sequences, grouping genes, retrieving gene sequences or genomic regions, reconstructing genomes and comparing and querying pan-genomes. We demonstrate the performance of the tool using datasets of 62 E. coli genomes, 93 yeast genomes and 19 Arabidopsis thaliana genomes. AVAILABILITY AND IMPLEMENTATION: The Java implementation of PanTools is publicly available at http://www.bif.wur.nl CONTACT: sandra.smit@wur.nl.


Assuntos
Algoritmos , Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Arabidopsis , Biologia Computacional/métodos , Escherichia coli , Genoma Bacteriano , Genômica , Humanos , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA