RESUMEN
The Protein Ontology (PRO; http://purl.obolibrary.org/obo/pr) formally defines and describes taxon-specific and taxon-neutral protein-related entities in three major areas: proteins related by evolution; proteins produced from a given gene; and protein-containing complexes. PRO thus serves as a tool for referencing protein entities at any level of specificity. To enhance this ability, and to facilitate the comparison of such entities described in different resources, we developed a standardized representation of proteoforms using UniProtKB as a sequence reference and PSI-MOD as a post-translational modification reference. We illustrate its use in facilitating an alignment between PRO and Reactome protein entities. We also address issues of scalability, describing our first steps into the use of text mining to identify protein-related entities, the large-scale import of proteoform information from expert curated resources, and our ability to dynamically generate PRO terms. Web views for individual terms are now more informative about closely-related terms, including for example an interactive multiple sequence alignment. Finally, we describe recent improvement in semantic utility, with PRO now represented in OWL and as a SPARQL endpoint. These developments will further support the anticipated growth of PRO and facilitate discoverability of and allow aggregation of data relating to protein entities.
Asunto(s)
Biología Computacional/métodos , Bases de Datos Genéticas , Proteínas , Animales , Humanos , Proteínas/química , Proteínas/genética , Navegador WebRESUMEN
The Protein Ontology (PRO; http://proconsortium.org) formally defines protein entities and explicitly represents their major forms and interrelations. Protein entities represented in PRO corresponding to single amino acid chains are categorized by level of specificity into family, gene, sequence and modification metaclasses, and there is a separate metaclass for protein complexes. All metaclasses also have organism-specific derivatives. PRO complements established sequence databases such as UniProtKB, and interoperates with other biomedical and biological ontologies such as the Gene Ontology (GO). PRO relates to UniProtKB in that PRO's organism-specific classes of proteins encoded by a specific gene correspond to entities documented in UniProtKB entries. PRO relates to the GO in that PRO's representations of organism-specific protein complexes are subclasses of the organism-agnostic protein complex terms in the GO Cellular Component Ontology. The past few years have seen growth and changes to the PRO, as well as new points of access to the data and new applications of PRO in immunology and proteomics. Here we describe some of these developments.
Asunto(s)
Ontologías Biológicas , Bases de Datos de Proteínas , Proteínas/clasificación , Animales , Humanos , Internet , Ratones , Proteínas/químicaRESUMEN
The Gene Ontology (GO) is an important component of modern biological knowledge representation with great utility for computational analysis of genomic and genetic data. The Gene Ontology Consortium (GOC) consists of a large team of contributors including curation teams from most model organism database groups as well as curation teams focused on representation of data relevant to specific human diseases. Key to the generation of consistent and comprehensive annotations is the development and use of shared standards and measures of curation quality. The GOC engages all contributors to work to a defined standard of curation that is presented here in the context of annotation of genes in the laboratory mouse. Comprehensive understanding of the origin, epistemology, and coverage of GO annotations is essential for most effective use of GO resources. Here the application of comparative approaches to capturing functional data in the mouse system is described.
Asunto(s)
Bases de Datos Genéticas , Ontología de Genes , Anotación de Secuencia Molecular , Animales , Biología Computacional , Genómica , Humanos , Ratones , Análisis de Secuencia de ADNRESUMEN
BACKGROUND: The Gene Ontology (GO) facilitates the description of the action of gene products in a biological context. Many GO terms refer to chemical entities that participate in biological processes. To facilitate accurate and consistent systems-wide biological representation, it is necessary to integrate the chemical view of these entities with the biological view of GO functions and processes. We describe a collaborative effort between the GO and the Chemical Entities of Biological Interest (ChEBI) ontology developers to ensure that the representation of chemicals in the GO is both internally consistent and in alignment with the chemical expertise captured in ChEBI. RESULTS: We have examined and integrated the ChEBI structural hierarchy into the GO resource through computationally-assisted manual curation of both GO and ChEBI. Our work has resulted in the creation of computable definitions of GO terms that contain fully defined semantic relationships to corresponding chemical terms in ChEBI. CONCLUSIONS: The set of logical definitions using both the GO and ChEBI has already been used to automate aspects of GO development and has the potential to allow the integration of data across the domains of biology and chemistry. These logical definitions are available as an extended version of the ontology from http://purl.obolibrary.org/obo/go/extensions/go-plus.owl.
Asunto(s)
Biología , Química , Genes , Vocabulario ControladoRESUMEN
The Protein Ontology (PRO) provides a formal, logically-based classification of specific protein classes including structured representations of protein isoforms, variants and modified forms. Initially focused on proteins found in human, mouse and Escherichia coli, PRO now includes representations of protein complexes. The PRO Consortium works in concert with the developers of other biomedical ontologies and protein knowledge bases to provide the ability to formally organize and integrate representations of precise protein forms so as to enhance accessibility to results of protein research. PRO (http://pir.georgetown.edu/pro) is part of the Open Biomedical Ontology Foundry.
Asunto(s)
Bases de Datos de Proteínas , Proteínas/clasificación , Animales , Proteínas de Escherichia coli/química , Humanos , Ratones , Complejos Multiproteicos/química , Complejos Multiproteicos/clasificación , Isoformas de Proteínas/química , Isoformas de Proteínas/clasificación , Proteínas/química , Proteínas/genética , Interfaz Usuario-Computador , Vocabulario ControladoRESUMEN
Gene inactivation can affect the process(es) in which that gene acts and causally downstream ones, yielding diverse mutant phenotypes. Identifying the genetic pathways resulting in a given phenotype helps us understand how individual genes interact in a functional network. Computable representations of biological pathways include detailed process descriptions in the Reactome Knowledgebase and causal activity flows between molecular functions in Gene Ontology-Causal Activity Models (GO-CAMs). A computational process has been developed to convert Reactome pathways to GO-CAMs. Laboratory mice are widely used models of normal and pathological human processes. We have converted human Reactome GO-CAMs to orthologous mouse GO-CAMs, as a resource to transfer pathway knowledge between humans and model organisms. These mouse GO-CAMs allowed us to define sets of genes that function in a causally connected way. To demonstrate that individual variant genes from connected pathways result in similar but distinguishable phenotypes, we used the genes in our pathway models to cross-query mouse phenotype annotations in the Mouse Genome Database (MGD). Using GO-CAM representations of 2 related but distinct pathways, gluconeogenesis and glycolysis, we show that individual causal paths in gene networks give rise to discrete phenotypic outcomes resulting from perturbations of glycolytic and gluconeogenic genes. The accurate and detailed descriptions of gene interactions recovered in this analysis of well-studied processes suggest that this strategy can be applied to less well-understood processes in less well-studied model systems to predict phenotypic outcomes of novel gene variants and to identify potential gene targets in altered processes.
Asunto(s)
Biología Computacional , Bases de Datos Genéticas , Ratones , Humanos , Animales , Ontología de Genes , Mutación , Fenotipo , Biología Computacional/métodosRESUMEN
Gene inactivation can affect the process(es) in which that gene acts and causally downstream ones, yielding diverse mutant phenotypes. Identifying the genetic pathways resulting in a given phenotype helps us understand how individual genes interact in a functional network. Computable representations of biological pathways include detailed process descriptions in the Reactome Knowledgebase, and causal activity flows between molecular functions in Gene Ontology-Causal Activity Models (GO-CAMs). A computational process has been developed to convert Reactome pathways to GO-CAMs. Laboratory mice are widely used models of normal and pathological human processes. We have converted human Reactome GO-CAMs to orthologous mouse GO-CAMs, as a resource to transfer pathway knowledge between humans and model organisms. These mouse GO-CAMs allowed us to define sets of genes that function in a connected and well-defined way. To test whether individual genes from well-defined pathways result in similar and distinguishable phenotypes, we used the genes in our pathway models to cross-query mouse phenotype annotations in the Mouse Genome Database (MGD). Using GO-CAM representations of two related but distinct pathways, gluconeogenesis and glycolysis, we can identify causal paths in gene networks that give rise to discrete phenotypic outcomes for perturbations of glycolysis and gluconeogenesis. The accurate and detailed descriptions of gene interactions recovered in this analysis of well-studied processes suggest that this strategy can be applied to less well-understood processes in less well-studied model systems to predict phenotypic outcomes of novel gene variants and to identify potential gene targets in altered processes.
RESUMEN
The Gene Ontology (GO) knowledgebase (http://geneontology.org) is a comprehensive resource concerning the functions of genes and gene products (proteins and noncoding RNAs). GO annotations cover genes from organisms across the tree of life as well as viruses, though most gene function knowledge currently derives from experiments carried out in a relatively small number of model organisms. Here, we provide an updated overview of the GO knowledgebase, as well as the efforts of the broad, international consortium of scientists that develops, maintains, and updates the GO knowledgebase. The GO knowledgebase consists of three components: (1) the GO-a computational knowledge structure describing the functional characteristics of genes; (2) GO annotations-evidence-supported statements asserting that a specific gene product has a particular functional characteristic; and (3) GO Causal Activity Models (GO-CAMs)-mechanistic models of molecular "pathways" (GO biological processes) created by linking multiple GO annotations using defined relations. Each of these components is continually expanded, revised, and updated in response to newly published discoveries and receives extensive QA checks, reviews, and user feedback. For each of these components, we provide a description of the current contents, recent developments to keep the knowledgebase up to date with new discoveries, and guidance on how users can best make use of the data that we provide. We conclude with future directions for the project.
Asunto(s)
Bases de Datos Genéticas , Proteínas , Ontología de Genes , Proteínas/genética , Anotación de Secuencia Molecular , Biología ComputacionalRESUMEN
BACKGROUND: Representing species-specific proteins and protein complexes in ontologies that are both human- and machine-readable facilitates the retrieval, analysis, and interpretation of genome-scale data sets. Although existing protin-centric informatics resources provide the biomedical research community with well-curated compendia of protein sequence and structure, these resources lack formal ontological representations of the relationships among the proteins themselves. The Protein Ontology (PRO) Consortium is filling this informatics resource gap by developing ontological representations and relationships among proteins and their variants and modified forms. Because proteins are often functional only as members of stable protein complexes, the PRO Consortium, in collaboration with existing protein and pathway databases, has launched a new initiative to implement logical and consistent representation of protein complexes. DESCRIPTION: We describe here how the PRO Consortium is meeting the challenge of representing species-specific protein complexes, how protein complex representation in PRO supports annotation of protein complexes and comparative biology, and how PRO is being integrated into existing community bioinformatics resources. The PRO resource is accessible at http://pir.georgetown.edu/pro/. CONCLUSION: PRO is a unique database resource for species-specific protein complexes. PRO facilitates robust annotation of variations in composition and function contexts for protein complexes within and between species.
Asunto(s)
Bases de Datos de Proteínas , Complejos Multiproteicos , Proteínas/química , Animales , Biología Computacional , Humanos , Internet , Complejos Multienzimáticos , Proteínas/metabolismoRESUMEN
Curation of biological data is a multi-faceted task whose goal is to create a structured, comprehensive, integrated, and accurate resource of current biological knowledge. These structured data facilitate the work of the scientific community by providing knowledge about genes or genomes and by generating validated connections between the data that yield new information and stimulate new research approaches. For the model organism databases (MODs), an important source of data is research publications. Every published paper containing experimental information about a particular model organism is a candidate for curation. All such papers are examined carefully by curators for relevant information. Here, four curators from different MODs describe the literature curation process and highlight approaches taken by the four MODs to address: (1) the decision process by which papers are selected, and (2) the identification and prioritization of the data contained in the paper. We will highlight some of the challenges that MOD biocurators face, and point to ways in which researchers and publishers can support the work of biocurators and the value of such support.
Asunto(s)
Bases de Datos Genéticas , Modelos Biológicos , Animales , Bibliografías como Asunto , Genes , Internet , Estadística como Asunto , Terminología como AsuntoRESUMEN
BACKGROUND: Cellular processes require the interaction of many proteins across several cellular compartments. Determining the collective network of such interactions is an important aspect of understanding the role and regulation of individual proteins. The Gene Ontology (GO) is used by model organism databases and other bioinformatics resources to provide functional annotation of proteins. The annotation process provides a mechanism to document the binding of one protein with another. We have constructed protein interaction networks for mouse proteins utilizing the information encoded in the GO annotations. The work reported here presents a methodology for integrating and visualizing information on protein-protein interactions. RESULTS: GO annotation at Mouse Genome Informatics (MGI) captures 1318 curated, documented interactions. These include 129 binary interactions and 125 interaction involving three or more gene products. Three networks involve over 30 partners, the largest involving 109 proteins. Several tools are available at MGI to visualize and analyze these data. CONCLUSIONS: Curators at the MGI database annotate protein-protein interaction data from experimental reports from the literature. Integration of these data with the other types of data curated at MGI places protein binding data into the larger context of mouse biology and facilitates the generation of new biological hypotheses based on physical interactions among gene products.
Asunto(s)
Biología Computacional/métodos , Mapeo de Interacción de Proteínas/métodos , Animales , Sitios de Unión , Clonación Molecular , Bases de Datos Genéticas , Bases de Datos de Proteínas , Genes , Genoma , Genómica , Humanos , Almacenamiento y Recuperación de la Información , Ratones , Modelos Teóricos , Biología Molecular/métodos , Conformación Molecular , Fosforilación , Unión Proteica , Pliegue de Proteína , Proteínas/química , Proteómica , Programas InformáticosRESUMEN
The Mouse Genome Database, the Gene Expression Database and the Mouse Tumor Biology database are integrated components of the Mouse Genome Informatics (MGI) resource (http://www.informatics.jax.org). The MGI system presents both a consensus view and an experimental view of the knowledge concerning the genetics and genomics of the laboratory mouse. From genotype to phenotype, this information resource integrates information about genes, sequences, maps, expression analyses, alleles, strains and mutant phenotypes. Comparative mammalian data are also presented particularly in regards to the use of the mouse as a model for the investigation of molecular and genetic components of human diseases. These data are collected from literature curation as well as downloads of large datasets (SwissProt, LocusLink, etc.). MGI is one of the founding members of the Gene Ontology (GO) and uses the GO for functional annotation of genes. Here, we discuss the workflow associated with manual GO annotation at MGI, from literature collection to display of the annotations. Peer-reviewed literature is collected mostly from a set of journals available electronically. Selected articles are entered into a master bibliography and indexed to one of eight areas of interest such as 'GO' or 'homology' or 'phenotype'. Each article is then either indexed to a gene already contained in the database or funneled through a separate nomenclature database to add genes. The master bibliography and associated indexing provide information for various curator-reports such as 'papers selected for GO that refer to genes with NO GO annotation'. Once indexed, curators who have expertise in appropriate disciplines enter pertinent information. MGI makes use of several controlled vocabularies that ensure uniform data encoding, enable robust analysis and support the construction of complex queries. These vocabularies range from pick-lists to structured vocabularies such as the GO. All data associations are supported with statements of evidence as well as access to source publications.
Asunto(s)
Bases de Datos Genéticas , Genoma/genética , Informática , Anotación de Secuencia Molecular/métodos , Flujo de Trabajo , Acceso a la Información , Animales , Genómica , Humanos , Ratones , Procesamiento de Lenguaje Natural , Control de CalidadRESUMEN
The body of human genomic and proteomic evidence continues to grow at ever-increasing rates, while annotation efforts struggle to keep pace. A surprisingly small fraction of human genes have clear, documented associations with specific functions, and new functions continue to be found for characterized genes. Here we assembled an integrated collection of diverse genomic and proteomic data for 21,341 human genes and make quantitative associations of each to 4333 Gene Ontology terms. We combined guilt-by-profiling and guilt-by-association approaches to exploit features unique to the data types. Performance was evaluated by cross-validation, prospective validation, and by manual evaluation with the biological literature. Functional-linkage networks were also constructed, and their utility was demonstrated by identifying candidate genes related to a glioma FLN using a seed network from genome-wide association studies. Our annotations are presented-alongside existing validated annotations-in a publicly accessible and searchable web interface.