RESUMEN
The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. Here, we report recent developments with InterPro (version 90.0) and its associated software, including updates to data content and to the website. These developments extend and enrich the information provided by InterPro, and provide a more user friendly access to the data. Additionally, we have worked on adding Pfam website features to the InterPro website, as the Pfam website will be retired in late 2022. We also show that InterPro's sequence coverage has kept pace with the growth of UniProtKB. Moreover, we report the development of a card game as a method of engaging the non-scientific community. Finally, we discuss the benefits and challenges brought by the use of artificial intelligence for protein structure prediction.
Asunto(s)
Bases de Datos de Proteínas , Humanos , Secuencia de Aminoácidos , Inteligencia Artificial , Internet , Proteínas/química , Programas InformáticosRESUMEN
The Annotation Query (AnnoQ) (http://annoq.org/) is designed to provide comprehensive and up-to-date functional annotations for human genetic variants. The system is supported by an annotation database with â¼39 million human variants from the Haplotype Reference Consortium (HRC) pre-annotated with sequence feature annotations by WGSA and functional annotations to Gene Ontology (GO) and pathways in PANTHER. The database operates on an optimized Elasticsearch framework to support real-time complex searches. This implementation enables users to annotate data with the most up-to-date functional annotations via simple queries instead of setting up individual tools. A web interface allows users to interactively browse the annotations, annotate variants and search variant data. Its easy-to-use interface and search capabilities are well-suited for scientists with fewer bioinformatics skills such as bench scientists and statisticians. AnnoQ also has an API for users to access and annotate the data programmatically. Packages for programming languages, such as the R package, are available for users to embed the annotation queries in their scripts. AnnoQ serves researchers with a wide range of backgrounds and research interests as an integrated annotation platform.
Asunto(s)
Variación Genética , Anotación de Secuencia Molecular , Programas Informáticos , Humanos , Bases de Datos Genéticas , Internet , Anotación de Secuencia Molecular/métodos , Interfaz Usuario-Computador , Variación Genética/genética , Haplotipos/genética , Lenguajes de ProgramaciónRESUMEN
The Orthology Benchmark Service (https://orthology.benchmarkservice.org) is the gold standard for orthology inference evaluation, supported and maintained by the Quest for Orthologs consortium. It is an essential resource to compare existing and new methods of orthology inference (the bedrock for many comparative genomics and phylogenetic analysis) over a standard dataset and through common procedures. The Quest for Orthologs Consortium is dedicated to maintaining the resource up to date, through regular updates of the Reference Proteomes and increasingly accessible data through the OpenEBench platform. For this update, we have added a new benchmark based on curated orthology assertion from the Vertebrate Gene Nomenclature Committee, and provided an example meta-analysis of the public predictions present on the platform.
Asunto(s)
Benchmarking , Genómica , Filogenia , Genómica/métodos , ProteomaRESUMEN
PANTHER (Protein Analysis Through Evolutionary Relationships, http://www.pantherdb.org) is a resource for the evolutionary and functional classification of protein-coding genes from all domains of life. The evolutionary classification is based on a library of over 15,000 phylogenetic trees, and the functional classifications include Gene Ontology terms and pathways. Here, we analyze the current coverage of genes from genomes in different taxonomic groups, so that users can better understand what to expect when analyzing a gene list using PANTHER tools. We also describe extensive improvements to PANTHER made in the past two years. The PANTHER Protein Class ontology has been completely refactored, and 6101 PANTHER families have been manually assigned to a Protein Class, providing a high level classification of protein families and their genes. Users can access the TreeGrafter tool to add their own protein sequences to the reference phylogenetic trees in PANTHER, to infer evolutionary context as well as fine-grained annotations. We have added human enhancer-gene links that associate non-coding regions with the annotated human genes in PANTHER. We have also expanded the available services for programmatic access to PANTHER tools and data via application programming interfaces (APIs). Other improvements include additional plant genomes and an updated PANTHER GO-slim.
Asunto(s)
Biología Computacional/métodos , Elementos de Facilitación Genéticos/genética , Filogenia , Programas Informáticos , Interfaz Usuario-Computador , Evolución Molecular , Ontología de Genes , Genoma , Anotación de Secuencia Molecular , Sistemas de Lectura Abierta/genéticaRESUMEN
The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. InterProScan is the underlying software that allows protein and nucleic acid sequences to be searched against InterPro's signatures. Signatures are predictive models which describe protein families, domains or sites, and are provided by multiple databases. InterPro combines signatures representing equivalent families, domains or sites, and provides additional information such as descriptions, literature references and Gene Ontology (GO) terms, to produce a comprehensive resource for protein classification. Founded in 1999, InterPro has become one of the most widely used resources for protein family annotation. Here, we report the status of InterPro (version 81.0) in its 20th year of operation, and its associated software, including updates to database content, the release of a new website and REST API, and performance improvements in InterProScan.
Asunto(s)
Bases de Datos de Proteínas , Proteínas/química , Secuencia de Aminoácidos , COVID-19/metabolismo , Internet , Anotación de Secuencia Molecular , Dominios Proteicos , Mapas de Interacción de Proteínas , SARS-CoV-2/metabolismo , Alineación de SecuenciaRESUMEN
Accurate determination of the evolutionary relationships between genes is a foundational challenge in biology. Homology-evolutionary relatedness-is in many cases readily determined based on sequence similarity analysis. By contrast, whether or not two genes directly descended from a common ancestor by a speciation event (orthologs) or duplication event (paralogs) is more challenging, yet provides critical information on the history of a gene. Since 2009, this task has been the focus of the Quest for Orthologs (QFO) Consortium. The sixth QFO meeting took place in Okazaki, Japan in conjunction with the 67th National Institute for Basic Biology conference. Here, we report recent advances, applications, and oncoming challenges that were discussed during the conference. Steady progress has been made toward standardization and scalability of new and existing tools. A feature of the conference was the presentation of a panel of accessible tools for phylogenetic profiling and several developments to bring orthology beyond the gene unit-from domains to networks. This meeting brought into light several challenges to come: leveraging orthology computations to get the most of the incoming avalanche of genomic data, integrating orthology from domain to biological network levels, building better gene models, and adapting orthology approaches to the broad evolutionary and genomic diversity recognized in different forms of life and viruses.
Asunto(s)
Especiación Genética , Genómica/tendencias , Filogenia , Genoma Viral , Genómica/métodosRESUMEN
MOTIVATION: Gene Ontology Causal Activity Models (GO-CAMs) assemble individual associations of gene products with cellular components, molecular functions and biological processes into causally linked activity flow models. Pathway databases such as the Reactome Knowledgebase create detailed molecular process descriptions of reactions and assemble them, based on sharing of entities between individual reactions into pathway descriptions. RESULTS: To convert the rich content of Reactome into GO-CAMs, we have developed a software tool, Pathways2GO, to convert the entire set of normal human Reactome pathways into GO-CAMs. This conversion yields standard GO annotations from Reactome content and supports enhanced quality control for both Reactome and GO, yielding a nearly seamless conversion between these two resources for the bioinformatics community. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
RESUMEN
MOTIVATION: A large variety of molecular interactions occurs between biomolecular components in cells. When a molecular interaction results in a regulatory effect, exerted by one component onto a downstream component, a so-called 'causal interaction' takes place. Causal interactions constitute the building blocks in our understanding of larger regulatory networks in cells. These causal interactions and the biological processes they enable (e.g. gene regulation) need to be described with a careful appreciation of the underlying molecular reactions. A proper description of this information enables archiving, sharing and reuse by humans and for automated computational processing. Various representations of causal relationships between biological components are currently used in a variety of resources. RESULTS: Here, we propose a checklist that accommodates current representations, called the Minimum Information about a Molecular Interaction CAusal STatement (MI2CAST). This checklist defines both the required core information, as well as a comprehensive set of other contextual details valuable to the end user and relevant for reusing and reproducing causal molecular interaction information. The MI2CAST checklist can be used as reporting guidelines when annotating and curating causal statements, while fostering uniformity and interoperability of the data across resources. AVAILABILITY AND IMPLEMENTATION: The checklist together with examples is accessible at https://github.com/MI2CAST/MI2CAST. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Programas Informáticos , Causalidad , HumanosRESUMEN
Gene function annotation is important for a variety of downstream analyses of genetic data. But experimental characterization of function remains costly and slow, making computational prediction an important endeavor. Phylogenetic approaches to prediction have been developed, but implementation of a practical Bayesian framework for parameter estimation remains an outstanding challenge. We have developed a computationally efficient model of evolution of gene annotations using phylogenies based on a Bayesian framework using Markov Chain Monte Carlo for parameter estimation. Unlike previous approaches, our method is able to estimate parameters over many different phylogenetic trees and functions. The resulting parameters agree with biological intuition, such as the increased probability of function change following gene duplication. The method performs well on leave-one-out cross-validation, and we further validated some of the predictions in the experimental scientific literature.
Asunto(s)
Modelos Genéticos , Anotación de Secuencia Molecular/métodos , Filogenia , Algoritmos , Animales , Teorema de Bayes , Biología Computacional , Bases de Datos Genéticas , Evolución Molecular , Ontología de Genes/estadística & datos numéricos , Humanos , Funciones de Verosimilitud , Cadenas de Markov , Ratones , Modelos Estadísticos , Anotación de Secuencia Molecular/estadística & datos numéricos , Método de Montecarlo , Familia de MultigenesRESUMEN
The identification of orthologs-genes in different species which descended from the same gene in their last common ancestor-is a prerequisite for many analyses in comparative genomics and molecular evolution. Numerous algorithms and resources have been conceived to address this problem, but benchmarking and interpreting them is fraught with difficulties (need to compare them on a common input dataset, absence of ground truth, computational cost of calling orthologs). To address this, the Quest for Orthologs consortium maintains a reference set of proteomes and provides a web server for continuous orthology benchmarking (http://orthology.benchmarkservice.org). Furthermore, consensus ortholog calls derived from public benchmark submissions are provided on the Alliance of Genome Resources website, the joint portal of NIH-funded model organism databases.
Asunto(s)
Familia de Multigenes , Proteoma , Programas Informáticos , Animales , Benchmarking , Consenso , Genómica , Humanos , Ratones , Filogenia , RatasRESUMEN
PANTHER (Protein Analysis Through Evolutionary Relationships, http://pantherdb.org) is a resource for the evolutionary and functional classification of genes from organisms across the tree of life. We report the improvements we have made to the resource during the past two years. For evolutionary classifications, we have added more prokaryotic and plant genomes to the phylogenetic gene trees, expanding the representation of gene evolution in these lineages. We have refined many protein family boundaries, and have aligned PANTHER with the MEROPS resource for protease and protease inhibitor families. For functional classifications, we have developed an entirely new PANTHER GO-slim, containing over four times as many Gene Ontology terms as our previous GO-slim, as well as curated associations of genes to these terms. Lastly, we have made substantial improvements to the enrichment analysis tools available on the PANTHER website: users can now analyze over 900 different genomes, using updated statistical tests with false discovery rate corrections for multiple testing. The overrepresentation test is also available as a web service, for easy addition to third-party sites.
Asunto(s)
Bases de Datos Genéticas , Genoma , Proteínas/clasificación , Animales , Evolución Molecular , Ontología de Genes , Genes , Genoma Microbiano , Genoma de Planta , Péptido Hidrolasas/clasificación , Filogenia , Proteínas/genética , Programas InformáticosRESUMEN
A growing number of whole genome sequencing projects, in combination with development of phylogenetic methods for reconstructing gene evolution, have provided us with a window into genomes that existed millions, and even billions, of years ago. Ancestral Genomes (http://ancestralgenomes.org) is a resource for comprehensive reconstructions of these 'fossil genomes'. Comprehensive sets of protein-coding genes have been reconstructed for 78 genomes of now-extinct species that were the common ancestors of extant species from across the tree of life. The reconstructed genes are based on the extensive library of over 15 000 gene family trees from the PANTHER database, and are updated on a yearly basis. For each ancestral gene, we assign a stable identifier, and provide additional information designed to facilitate analysis: an inferred name, a reconstructed protein sequence, a set of inferred Gene Ontology (GO) annotations, and a 'proxy gene' for each ancestral gene, defined as the least-diverged descendant of the ancestral gene in a given extant genome. On the Ancestral Genomes website, users can browse the Ancestral Genomes by selecting nodes in a species tree, and can compare an extant genome with any of its reconstructed ancestors to understand how the genome evolved.
Asunto(s)
Bases de Datos Genéticas , Evolución Molecular , Genes , Genoma , Filogenia , Animales , Eucariontes/genética , Extinción Biológica , Genes Arqueales , Genes Bacterianos , Genes Protozoarios , Anotación de Secuencia Molecular , Programas InformáticosRESUMEN
The InterPro database (http://www.ebi.ac.uk/interpro/) classifies protein sequences into families and predicts the presence of functionally important domains and sites. Here, we report recent developments with InterPro (version 70.0) and its associated software, including an 18% growth in the size of the database in terms on new InterPro entries, updates to content, the inclusion of an additional entry type, refined modelling of discontinuous domains, and the development of a new programmatic interface and website. These developments extend and enrich the information provided by InterPro, and provide greater flexibility in terms of data access. We also show that InterPro's sequence coverage has kept pace with the growth of UniProtKB, and discuss how our evaluation of residue coverage may help guide future curation activities.
Asunto(s)
Bases de Datos de Proteínas , Anotación de Secuencia Molecular , Animales , Bases de Datos Genéticas , Ontología de Genes , Humanos , Internet , Familia de Multigenes , Dominios Proteicos/genética , Homología de Secuencia de Aminoácido , Programas Informáticos , Interfaz Usuario-ComputadorRESUMEN
Gene families evolve by the processes of speciation (creating orthologs), gene duplication (paralogs), and horizontal gene transfer (xenologs), in addition to sequence divergence and gene loss. Orthologs in particular play an essential role in comparative genomics and phylogenomic analyses. With the continued sequencing of organisms across the tree of life, the data are available to reconstruct the unique evolutionary histories of tens of thousands of gene families. Accurate reconstruction of these histories, however, is a challenging computational problem, and the focus of the Quest for Orthologs Consortium. We review the recent advances and outstanding challenges in this field, as revealed at a symposium and meeting held at the University of Southern California in 2017. Key advances have been made both at the level of orthology algorithm development and with respect to coordination across the community of algorithm developers and orthology end-users. Applications spanned a broad range, including gene function prediction, phylostratigraphy, genome evolution, and phylogenomics. The meetings highlighted the increasing use of meta-analyses integrating results from multiple different algorithms, and discussed ongoing challenges in orthology inference as well as the next steps toward improvement and integration of orthology resources.
Asunto(s)
Evolución Molecular , Genómica/tendencias , Familia de Multigenes , Algoritmos , Animales , Genómica/métodos , HumanosRESUMEN
Summary: TreeGrafter is a new software tool for annotating protein sequences using pre-annotated phylogenetic trees. Currently, the tool provides annotations to Gene Ontology (GO) terms, and PANTHER family and subfamily. The approach is generalizable to any annotations that have been made to internal nodes of a reference phylogenetic tree. TreeGrafter takes each input query protein sequence, finds the best matching homologous family in a library of pre-calculated, pre-annotated gene trees, and then grafts it to the best location in the tree. It then annotates the sequence by propagating annotations from ancestral nodes in the reference tree. We show that TreeGrafter outperforms subfamily HMM scoring for correctly assigning subfamily membership, and that it produces highly specific annotations of GO terms based on annotated reference phylogenetic trees. This method will be further integrated into InterProScan, enabling an even broader user community. Availability and implementation: TreeGrafter is freely available on the web at https://github.com/pantherdb/TreeGrafter, including as a Docker image. Supplementary information: Supplementary data are available at Bioinformatics online.
Asunto(s)
Ontología de Genes , Anotación de Secuencia Molecular , Filogenia , Proteínas/genética , Programas InformáticosRESUMEN
The MEROPS database (http://www.ebi.ac.uk/merops/) is an integrated source of information about peptidases, their substrates and inhibitors. The hierarchical classification is: protein-species, family, clan, with an identifier at each level. The MEROPS website moved to the EMBL-EBI in 2017, requiring refactoring of the code-base and services provided. The interface to sequence searching has changed and the MEROPS protein sequence libraries can be searched at the EMBL-EBI with HMMER, FastA and BLASTP. Cross-references have been established between MEROPS and the PANTHER database at both the family and protein-species level, which will help to improve curation and coverage between the resources. Because of the increasing size of the MEROPS sequence collection, in future only sequences of characterized proteins, and from completely sequenced genomes of organisms of evolutionary, medical or commercial significance will be added. As an example, peptidase homologues in four proteomes from the Asgard superphylum of Archaea have been identified and compared to other archaean, bacterial and eukaryote proteomes. This has given insights into the origins and evolution of peptidase families, including an expansion in the number of proteasome components in Asgard archaeotes and as organisms increase in complexity. Novel structures for proteasome complexes in archaea are postulated.
Asunto(s)
Bases de Datos de Proteínas , Péptido Hidrolasas/metabolismo , Archaea/enzimología , Archaea/genética , Bacterias/enzimología , Bacterias/genética , Eucariontes/enzimología , Eucariontes/genética , Humanos , Péptido Hidrolasas/química , Péptido Hidrolasas/genética , Filogenia , Inhibidores de Proteasas/farmacología , Alineación de Secuencia , Especificidad por SustratoRESUMEN
BACKGROUND: Biological knowledge, and therefore Gene Ontology annotation sets, for human genes is incomplete. Recent studies have reported that biases in available GO annotations result in biased estimates of functional similarities of genes, but it is still unclear what the effect of incompleteness itself may be, even in the absence of bias. Pairwise gene similarities are used in a number of contexts, including gene "functional similarity" clustering and the related problem of functional ontology structure inference, but it is not known how different similarity measures or clustering methods perform on this task, and how the clusters are affected by annotation completeness. RESULTS: We developed representations of both "complete" and "incomplete" GO annotation datasets based on experimentally-supported annotations from the GO database-specifically designed to model the incompleteness of human gene annotations-and computed semantic similarities for each set using a variety of different published measures. We then assessed the clusters derived from these measures using two different clustering methods: hierarchical clustering, and the CliXO algorithm. We find the CliXO algorithm, combined with appropriate measures, performs better than hierarchical clustering in reconstructing GO both when the data are complete, and incomplete. Some measures, particularly those that create a pairwise gene similarity by averaging over all pairwise annotation similarities, had consistently poor performance, and a few measures, such as Lin best-matched average and Relevance maximum, were generally among the best performers for a broad range in annotation completeness and types of GO classes. Finally, we show that for semantic similarity-based clustering, the multicellular organism process branch of the GO biological process ontology is more challenging to represent than the cellular process branch. CONCLUSIONS: We assessed the effects of annotation completeness on the distribution of pairwise gene semantic similarity scores, and subsequent effects on the clusters derived from these scores. Our results suggest combinations of semantic similarity measures, gene-level scoring methods and clustering method that perform best for functional gene clustering using annotation sets of varying completeness. Overall, our results underscore the importance of increasing the completeness of GO annotations to for supporting computational analyses of gene function.
Asunto(s)
Biología Computacional/métodos , Ontología de Genes , Anotación de Secuencia Molecular , Algoritmos , Área Bajo la Curva , Análisis por Conglomerados , Humanos , SemánticaRESUMEN
Achieving high accuracy in orthology inference is essential for many comparative, evolutionary and functional genomic analyses, yet the true evolutionary history of genes is generally unknown and orthologs are used for very different applications across phyla, requiring different precision-recall trade-offs. As a result, it is difficult to assess the performance of orthology inference methods. Here, we present a community effort to establish standards and an automated web-based service to facilitate orthology benchmarking. Using this service, we characterize 15 well-established inference methods and resources on a battery of 20 different benchmarks. Standardized benchmarking provides a way for users to identify the most effective methods for the problem at hand, sets a minimum requirement for new tools and resources, and guides the development of more accurate orthology inference methods.
Asunto(s)
Biología Computacional/normas , Genómica/normas , Filogenia , Proteómica/normas , Archaea/clasificación , Archaea/genética , Bacterias/clasificación , Bacterias/genética , Biología Computacional/métodos , Bases de Datos Genéticas , Eucariontes/clasificación , Eucariontes/genética , Ontología de Genes , Genómica/métodos , Modelos Genéticos , Proteómica/métodos , Análisis de Secuencia de Proteína , Homología de Secuencia , Especificidad de la EspecieRESUMEN
The PANTHER database (Protein ANalysis THrough Evolutionary Relationships, http://pantherdb.org) contains comprehensive information on the evolution and function of protein-coding genes from 104 completely sequenced genomes. PANTHER software tools allow users to classify new protein sequences, and to analyze gene lists obtained from large-scale genomics experiments. In the past year, major improvements include a large expansion of classification information available in PANTHER, as well as significant enhancements to the analysis tools. Protein subfamily functional classifications have more than doubled due to progress of the Gene Ontology Phylogenetic Annotation Project. For human genes (as well as a few other organisms), PANTHER now also supports enrichment analysis using pathway classifications from the Reactome resource. The gene list enrichment tools include a new 'hierarchical view' of results, enabling users to leverage the structure of the classifications/ontologies; the tools also allow users to upload genetic variant data directly, rather than requiring prior conversion to a gene list. The updated coding single-nucleotide polymorphisms (SNP) scoring tool uses an improved algorithm. The hidden Markov model (HMM) search tools now use HMMER3, dramatically reducing search times and improving accuracy of E-value statistics. Finally, the PANTHER Tree-Attribute Viewer has been implemented in JavaScript, with new views for exploring protein sequence evolution.
Asunto(s)
Biología Computacional/métodos , Genómica/métodos , Programas Informáticos , Curaduría de Datos , Bases de Datos Genéticas , Ontología de Genes , Filogenia , Polimorfismo de Nucleótido Simple , Navegador WebRESUMEN
InterPro (http://www.ebi.ac.uk/interpro/) is a freely available database used to classify protein sequences into families and to predict the presence of important domains and sites. InterProScan is the underlying software that allows both protein and nucleic acid sequences to be searched against InterPro's predictive models, which are provided by its member databases. Here, we report recent developments with InterPro and its associated software, including the addition of two new databases (SFLD and CDD), and the functionality to include residue-level annotation and prediction of intrinsic disorder. These developments enrich the annotations provided by InterPro, increase the overall number of residues annotated and allow more specific functional inferences.