Pesquisa | BVS IEC

MGnify: the microbiome sequence data analysis resource in 2023.

Richardson, Lorna; Allen, Ben; Baldi, Germana; Beracochea, Martin; Bileschi, Maxwell L; Burdett, Tony; Burgin, Josephine; Caballero-Pérez, Juan; Cochrane, Guy; Colwell, Lucy J; Curtis, Tom; Escobar-Zepeda, Alejandra; Gurbich, Tatiana A; Kale, Varsha; Korobeynikov, Anton; Raj, Shriya; Rogers, Alexander B; Sakharova, Ekaterina; Sanchez, Santiago; Wilkinson, Darren J; Finn, Robert D.

Nucleic Acids Res ; 51(D1): D753-D759, 2023 01 06.

Artigo em Inglês | MEDLINE | ID: mdl-36477304

RESUMO

The MGnify platform (https://www.ebi.ac.uk/metagenomics) facilitates the assembly, analysis and archiving of microbiome-derived nucleic acid sequences. The platform provides access to taxonomic assignments and functional annotations for nearly half a million analyses covering metabarcoding, metatranscriptomic, and metagenomic datasets, which are derived from a wide range of different environments. Over the past 3 years, MGnify has not only grown in terms of the number of datasets contained but also increased the breadth of analyses provided, such as the analysis of long-read sequences. The MGnify protein database now exceeds 2.4 billion non-redundant sequences predicted from metagenomic assemblies. This collection is now organised into a relational database making it possible to understand the genomic context of the protein through navigation back to the source assembly and sample metadata, marking a major improvement. To extend beyond the functional annotations already provided in MGnify, we have applied deep learning-based annotation methods. The technology underlying MGnify's Application Programming Interface (API) and website has been upgraded, and we have enabled the ability to perform downstream analysis of the MGnify data through the introduction of a coupled Jupyter Lab environment.

Assuntos

Microbiota , Análise de Sequência , Genômica/métodos , Metagenoma , Metagenômica/métodos , Microbiota/genética , Software , Análise de Sequência/métodos

InterPro in 2022.

Paysan-Lafosse, Typhaine; Blum, Matthias; Chuguransky, Sara; Grego, Tiago; Pinto, Beatriz Lázaro; Salazar, Gustavo A; Bileschi, Maxwell L; Bork, Peer; Bridge, Alan; Colwell, Lucy; Gough, Julian; Haft, Daniel H; Letunic, Ivica; Marchler-Bauer, Aron; Mi, Huaiyu; Natale, Darren A; Orengo, Christine A; Pandurangan, Arun P; Rivoire, Catherine; Sigrist, Christian J A; Sillitoe, Ian; Thanki, Narmada; Thomas, Paul D; Tosatto, Silvio C E; Wu, Cathy H; Bateman, Alex.

Nucleic Acids Res ; 51(D1): D418-D427, 2023 01 06.

Artigo em Inglês | MEDLINE | ID: mdl-36350672

RESUMO

The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. Here, we report recent developments with InterPro (version 90.0) and its associated software, including updates to data content and to the website. These developments extend and enrich the information provided by InterPro, and provide a more user friendly access to the data. Additionally, we have worked on adding Pfam website features to the InterPro website, as the Pfam website will be retired in late 2022. We also show that InterPro's sequence coverage has kept pace with the growth of UniProtKB. Moreover, we report the development of a card game as a method of engaging the non-scientific community. Finally, we discuss the benefits and challenges brought by the use of artificial intelligence for protein structure prediction.

Assuntos

Bases de Dados de Proteínas , Humanos , Sequência de Aminoácidos , Inteligência Artificial , Internet , Proteínas/química , Software

Sequential regulatory activity prediction across chromosomes with convolutional neural networks.

Kelley, David R; Reshef, Yakir A; Bileschi, Maxwell; Belanger, David; McLean, Cory Y; Snoek, Jasper.

Genome Res ; 28(5): 739-750, 2018 05.

Artigo em Inglês | MEDLINE | ID: mdl-29588361

RESUMO

Models for predicting phenotypic outcomes from genotypes have important applications to understanding genomic function and improving human health. Here, we develop a machine-learning system to predict cell-type-specific epigenetic and transcriptional profiles in large mammalian genomes from DNA sequence alone. By use of convolutional neural networks, this system identifies promoters and distal regulatory elements and synthesizes their content to make effective gene expression predictions. We show that model predictions for the influence of genomic variants on gene expression align well to causal variants underlying eQTLs in human populations and can be useful for generating mechanistic hypotheses to enable fine mapping of disease loci.

Assuntos

Cromossomos/genética , Biologia Computacional/métodos , Redes Neurais de Computação , Sequências Reguladoras de Ácido Nucleico/genética , Animais , Epigenômica/métodos , Perfilação da Expressão Gênica/métodos , Regulação da Expressão Gênica , Genômica/métodos , Humanos , Aprendizado de Máquina , Modelos Genéticos , Polimorfismo de Nucleotídeo Único , Regiões Promotoras Genéticas/genética

Investigation of protein family relationships with deep learning.

Ponamareva, Irina; Andreeva, Antonina; Bileschi, Maxwell L; Colwell, Lucy; Bateman, Alex.

Bioinform Adv ; 4(1): vbae132, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-39399373

RESUMO

Motivation: In this article, we propose a method for finding similarities between Pfam families based on the pre-trained neural network ProtENN2. We use the model ProtENN2 per-residue embeddings to produce new high-dimensional per-family embeddings and develop an approach for calculating inter-family similarity scores based on these embeddings, and evaluate its predictions using structure comparison. Results: We apply our method to Pfam annotation by refining clan membership for Pfam families, suggesting both new members of existing clans and potential new clans for future Pfam releases. We investigate some of the failure modes of our approach, which suggests directions for future improvements. Our method is relatively simple with few parameters and could be applied to other protein family classification models. Overall, our work suggests potential benefits of employing deep learning for improving our understanding of protein family relationships and functions of previously uncharacterized families. Availability and implementation: github.com/iponamareva/ProtCNNSim, 10.5281/zenodo.10091909.

ProteInfer, deep neural networks for protein functional inference.

Sanderson, Theo; Bileschi, Maxwell L; Belanger, David; Colwell, Lucy J.

Elife ; 122023 02 27.

Artigo em Inglês | MEDLINE | ID: mdl-36847334

RESUMO

Predicting the function of a protein from its amino acid sequence is a long-standing challenge in bioinformatics. Traditional approaches use sequence alignment to compare a query sequence either to thousands of models of protein families or to large databases of individual protein sequences. Here we introduce ProteInfer, which instead employs deep convolutional neural networks to directly predict a variety of protein functions - Enzyme Commission (EC) numbers and Gene Ontology (GO) terms - directly from an unaligned amino acid sequence. This approach provides precise predictions which complement alignment-based methods, and the computational efficiency of a single neural network permits novel and lightweight software interfaces, which we demonstrate with an in-browser graphical interface for protein function prediction in which all computation is performed on the user's personal computer with no data uploaded to remote servers. Moreover, these models place full-length amino acid sequences into a generalised functional space, facilitating downstream analysis and interpretation. To read the interactive version of this paper, please visit https://google-research.github.io/proteinfer/.

Assuntos

Algoritmos , Redes Neurais de Computação , Proteínas/genética , Proteínas/química , Sequência de Aminoácidos , Software , Biologia Computacional/métodos

Using deep learning to annotate the protein universe.

Bileschi, Maxwell L; Belanger, David; Bryant, Drew H; Sanderson, Theo; Carter, Brandon; Sculley, D; Bateman, Alex; DePristo, Mark A; Colwell, Lucy J.

Nat Biotechnol ; 40(6): 932-937, 2022 06.

Artigo em Inglês | MEDLINE | ID: mdl-35190689

RESUMO

Understanding the relationship between amino acid sequence and protein function is a long-standing challenge with far-reaching scientific and translational implications. State-of-the-art alignment-based techniques cannot predict function for one-third of microbial protein sequences, hampering our ability to exploit data from diverse organisms. Here, we train deep learning models to accurately predict functional annotations for unaligned amino acid sequences across rigorous benchmark assessments built from the 17,929 families of the protein families database Pfam. The models infer known patterns of evolutionary substitutions and learn representations that accurately cluster sequences from unseen families. Combining deep models with existing methods significantly improves remote homology detection, suggesting that the deep models learn complementary information. This approach extends the coverage of Pfam by >9.5%, exceeding additions made over the last decade, and predicts function for 360 human reference proteome proteins with no previous Pfam annotation. These results suggest that deep learning models will be a core component of future protein annotation tools.

Assuntos

Aprendizado Profundo , Sequência de Aminoácidos , Bases de Dados de Proteínas , Humanos , Anotação de Sequência Molecular , Proteoma/metabolismo , Proteômica

Critiquing Protein Family Classification Models Using Sufficient Input Subsets.

Carter, Brandon; Bileschi, Maxwell; Smith, Jamie; Sanderson, Theo; Bryant, Drew; Belanger, David; Colwell, Lucy J.

J Comput Biol ; 27(8): 1219-1231, 2020 08.

Artigo em Inglês | MEDLINE | ID: mdl-31874057

RESUMO

In many application domains, neural networks are highly accurate and have been deployed at large scale. However, users often do not have good tools for understanding how these models arrive at their predictions. This has hindered adoption in fields such as the life and medical sciences, where researchers require that models base their decisions on underlying biological phenomena rather than peculiarities of the dataset. We propose a set of methods for critiquing deep learning models and demonstrate their application for protein family classification, a task for which high-accuracy models have considerable potential impact. Our methods extend the Sufficient Input Subsets (SIS) technique, which we use to identify subsets of features in each protein sequence that are alone sufficient for classification. Our suite of tools analyzes these subsets to shed light on the decision-making criteria employed by models trained on this task. These tools show that while deep models may perform classification for biologically relevant reasons, their behavior varies considerably across the choice of network architecture and parameter initialization. While the techniques that we develop are specific to the protein sequence classification task, the approach taken generalizes to a broad set of scientific contexts in which model interpretability is essential.

Assuntos

Biologia Computacional , Modelos Biológicos , Família Multigênica/genética , Proteínas/classificação , Aprendizado Profundo , Humanos , Aprendizado de Máquina , Redes Neurais de Computação , Proteínas/genética

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA