RESUMO
Bridging the gap between genetic variations, environmental determinants, and phenotypic outcomes is critical for supporting clinical diagnosis and understanding mechanisms of diseases. It requires integrating open data at a global scale. The Monarch Initiative advances these goals by developing open ontologies, semantic data models, and knowledge graphs for translational research. The Monarch App is an integrated platform combining data about genes, phenotypes, and diseases across species. Monarch's APIs enable access to carefully curated datasets and advanced analysis tools that support the understanding and diagnosis of disease for diverse applications such as variant prioritization, deep phenotyping, and patient profile-matching. We have migrated our system into a scalable, cloud-based infrastructure; simplified Monarch's data ingestion and knowledge graph integration systems; enhanced data mapping and integration standards; and developed a new user interface with novel search and graph navigation features. Furthermore, we advanced Monarch's analytic tools by developing a customized plugin for OpenAI's ChatGPT to increase the reliability of its responses about phenotypic data, allowing us to interrogate the knowledge in the Monarch graph using state-of-the-art Large Language Models. The resources of the Monarch Initiative can be found at monarchinitiative.org and its corresponding code repository at github.com/monarch-initiative/monarch-app.
Assuntos
Bases de Dados Factuais , Doença , Genes , Fenótipo , Humanos , Internet , Bases de Dados Factuais/normas , Software , Genes/genética , Doença/genéticaRESUMO
MOTIVATION: Creating knowledge bases and ontologies is a time consuming task that relies on manual curation. AI/NLP approaches can assist expert curators in populating these knowledge bases, but current approaches rely on extensive training data, and are not able to populate arbitrarily complex nested knowledge schemas. RESULTS: Here we present Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES), a Knowledge Extraction approach that relies on the ability of Large Language Models (LLMs) to perform zero-shot learning and general-purpose query answering from flexible prompts and return information conforming to a specified schema. Given a detailed, user-defined knowledge schema and an input text, SPIRES recursively performs prompt interrogation against an LLM to obtain a set of responses matching the provided schema. SPIRES uses existing ontologies and vocabularies to provide identifiers for matched elements. We present examples of applying SPIRES in different domains, including extraction of food recipes, multi-species cellular signaling pathways, disease treatments, multi-step drug mechanisms, and chemical to disease relationships. Current SPIRES accuracy is comparable to the mid-range of existing Relation Extraction methods, but greatly surpasses an LLM's native capability of grounding entities with unique identifiers. SPIRES has the advantage of easy customization, flexibility, and, crucially, the ability to perform new tasks in the absence of any new training data. This method supports a general strategy of leveraging the language interpreting capabilities of LLMs to assemble knowledge bases, assisting manual knowledge curation and acquisition while supporting validation with publicly-available databases and ontologies external to the LLM. AVAILABILITY AND IMPLEMENTATION: SPIRES is available as part of the open source OntoGPT package: https://github.com/monarch-initiative/ontogpt.
Assuntos
Bases de Conhecimento , Semântica , Bases de Dados FactuaisRESUMO
MOTIVATION: Knowledge graphs (KGs) are a powerful approach for integrating heterogeneous data and making inferences in biology and many other domains, but a coherent solution for constructing, exchanging, and facilitating the downstream use of KGs is lacking. RESULTS: Here we present KG-Hub, a platform that enables standardized construction, exchange, and reuse of KGs. Features include a simple, modular extract-transform-load pattern for producing graphs compliant with Biolink Model (a high-level data model for standardizing biological data), easy integration of any OBO (Open Biological and Biomedical Ontologies) ontology, cached downloads of upstream data sources, versioned and automatically updated builds with stable URLs, web-browsable storage of KG artifacts on cloud infrastructure, and easy reuse of transformed subgraphs across projects. Current KG-Hub projects span use cases including COVID-19 research, drug repurposing, microbial-environmental interactions, and rare disease research. KG-Hub is equipped with tooling to easily analyze and manipulate KGs. KG-Hub is also tightly integrated with graph machine learning (ML) tools which allow automated graph ML, including node embeddings and training of models for link prediction and node classification. AVAILABILITY AND IMPLEMENTATION: https://kghub.org.
Assuntos
Ontologias Biológicas , COVID-19 , Humanos , Reconhecimento Automatizado de Padrão , Doenças Raras , Aprendizado de MáquinaRESUMO
The Zebrafish Information Network (ZFIN) (https://zfin.org/) is the database for the model organism, zebrafish (Danio rerio). ZFIN expertly curates, organizes, and provides a wide array of zebrafish genetic and genomic data, including genes, alleles, transgenic lines, gene expression, gene function, mutant phenotypes, orthology, human disease models, gene and mutant nomenclature, and reagents. New features at ZFIN include major updates to the home page and the gene page, the two most used pages at ZFIN. Data including disease models, phenotypes, expression, mutants and gene function continue to be contributed to The Alliance of Genome Resources for integration with similar data from other model organisms.
Assuntos
Biologia Computacional/métodos , Bases de Dados Genéticas , Genoma/genética , Genômica/métodos , Peixe-Zebra/genética , Animais , Animais Geneticamente Modificados , Mineração de Dados/métodos , Expressão Gênica , Humanos , Internet , Modelos Animais , Mutação , Fenótipo , Proteínas de Peixe-Zebra/genéticaRESUMO
The Zebrafish Information Network (ZFIN) (https://zfin.org/) is the database for the model organism, zebrafish (Danio rerio). ZFIN expertly curates, organizes and provides a wide array of zebrafish genetic and genomic data, including genes, alleles, transgenic lines, gene expression, gene function, mutant phenotypes, orthology, human disease models, nomenclature and reagents. New features at ZFIN include increased support for genomic regions and for non-coding genes, and support for more expressive Gene Ontology annotations. ZFIN has recently taken over maintenance of the zebrafish reference genome sequence as part of the Genome Reference Consortium. ZFIN is also a founding member of the Alliance of Genome Resources, a collaboration of six model organism databases (MODs) and the Gene Ontology Consortium (GO). The recently launched Alliance portal (https://alliancegenome.org) provides a unified, comparative view of MOD, GO, and human data, and facilitates foundational and translational biomedical research.
Assuntos
Bases de Dados Genéticas , Genoma/genética , Genômica , Peixe-Zebra/genética , Animais , Expressão Gênica/genética , Ontologia Genética , Humanos , Anotação de Sequência Molecular , Mutação/genética , FenótipoRESUMO
The Zebrafish Model Organism Database (ZFIN; http://zfin.org) is the central resource for zebrafish (Danio rerio) genetic, genomic, phenotypic and developmental data. ZFIN curators provide expert manual curation and integration of comprehensive data involving zebrafish genes, mutants, transgenic constructs and lines, phenotypes, genotypes, gene expressions, morpholinos, TALENs, CRISPRs, antibodies, anatomical structures, models of human disease and publications. We integrate curated, directly submitted, and collaboratively generated data, making these available to zebrafish research community. Among the vertebrate model organisms, zebrafish are superbly suited for rapid generation of sequence-targeted mutant lines, characterization of phenotypes including gene expression patterns, and generation of human disease models. The recent rapid adoption of zebrafish as human disease models is making management of these data particularly important to both the research and clinical communities. Here, we describe recent enhancements to ZFIN including use of the zebrafish experimental conditions ontology, 'Fish' records in the ZFIN database, support for gene expression phenotypes, models of human disease, mutation details at the DNA, RNA and protein levels, and updates to the ZFIN single box search.
Assuntos
Bases de Dados Genéticas , Estudos de Associação Genética/métodos , Genômica/métodos , Ferramenta de Busca , Peixe-Zebra/genética , Animais , Biologia Computacional/métodos , Curadoria de Dados , Modelos Animais de Doenças , Expressão Gênica , Predisposição Genética para Doença , Genótipo , Humanos , Mutação , FenótipoRESUMO
The Zebrafish Model Organism Database (ZFIN; http://zfin.org) is the central resource for genetic and genomic data from zebrafish (Danio rerio) research. ZFIN staff curate detailed information about genes, mutants, genotypes, reporter lines, sequences, constructs, antibodies, knockdown reagents, expression patterns, phenotypes, gene product function, and orthology from publications. Researchers can submit mutant, transgenic, expression, and phenotype data directly to ZFIN and use the ZFIN Community Wiki to share antibody and protocol information. Data can be accessed through topic-specific searches, a new site-wide search, and the data-mining resource ZebrafishMine (http://zebrafishmine.org). Data download and web service options are also available. ZFIN collaborates with major bioinformatics organizations to verify and integrate genomic sequence data, provide nomenclature support, establish reciprocal links, and participate in the development of standardized structured vocabularies (ontologies) used for data annotation and searching. ZFIN-curated gene, function, expression, and phenotype data are available for comparative exploration at several multi-species resources. The use of zebrafish as a model for human disease is increasing. ZFIN is supporting this growing area with three major projects: adding easy access to computed orthology data from gene pages, curating details of the gene expression pattern changes in mutant fish, and curating zebrafish models of human diseases.
Assuntos
Bases de Dados Genéticas , Proteínas de Peixe-Zebra/genética , Peixe-Zebra/genética , Animais , Biologia Computacional/métodos , Curadoria de Dados/métodos , Estudos de Associação Genética , Genômica/métodos , Internet , Modelos AnimaisRESUMO
InterMine is a data integration warehouse and analysis software system developed for large and complex biological data sets. Designed for integrative analysis, it can be accessed through a user-friendly web interface. For bioinformaticians, extensive web services as well as programming interfaces for most common scripting languages support access to all features. The web interface includes a useful identifier look-up system, and both simple and sophisticated search options. Interactive results tables enable exploration, and data can be filtered, summarized, and browsed. A set of graphical analysis tools provide a rich environment for data exploration including statistical enrichment of sets of genes or other entities. InterMine databases have been developed for the major model organisms, budding yeast, nematode worm, fruit fly, zebrafish, mouse, and rat together with a newly developed human database. Here, we describe how this has facilitated interoperation and development of cross-organism analysis tools and reports. InterMine as a data exploration and analysis tool is also described. All the InterMine-based systems described in this article are resources freely available to the scientific community.
Assuntos
Bases de Dados Factuais , Software , Animais , Biologia Computacional/métodos , Bases de Dados Genéticas , Genômica , Humanos , Internet , Integração de Sistemas , Interface Usuário-ComputadorRESUMO
ZFIN, the Zebrafish Model Organism Database (http://zfin.org), is the central resource for zebrafish genetic, genomic, phenotypic and developmental data. ZFIN curators manually curate and integrate comprehensive data involving zebrafish genes, mutants, transgenics, phenotypes, genotypes, gene expressions, morpholinos, antibodies, anatomical structures and publications. Integrated views of these data, as well as data gathered through collaborations and data exchanges, are provided through a wide selection of web-based search forms. Among the vertebrate model organisms, zebrafish are uniquely well suited for rapid and targeted generation of mutant lines. The recent rapid production of mutants and transgenic zebrafish is making management of data associated with these resources particularly important to the research community. Here, we describe recent enhancements to ZFIN aimed at improving our support for mutant and transgenic lines, including (i) enhanced mutant/transgenic search functionality; (ii) more expressive phenotype curation methods; (iii) new downloads files and archival data access; (iv) incorporation of new data loads from laboratories undertaking large-scale generation of mutant or transgenic lines and (v) new GBrowse tracks for transgenic insertions, genes with antibodies and morpholinos.
Assuntos
Bases de Dados Genéticas , Peixe-Zebra/genética , Animais , Animais Geneticamente Modificados , Genômica , Internet , Modelos Animais , Mutação , FenótipoRESUMO
ZFIN, the Zebrafish Model Organism Database, http://zfin.org, serves as the central repository and web-based resource for zebrafish genetic, genomic, phenotypic and developmental data. ZFIN manually curates comprehensive data for zebrafish genes, phenotypes, genotypes, gene expression, antibodies, anatomical structures and publications. A wide-ranging collection of web-based search forms and tools facilitates access to integrated views of these data promoting analysis and scientific discovery. Data represented in ZFIN are derived from three primary sources: curation of zebrafish publications, individual research laboratories and collaborations with bioinformatics organizations. Data formats include text, images and graphical representations. ZFIN is a dynamic resource with data added daily as part of our ongoing curation process. Software updates are frequent. Here, we describe recent additions to ZFIN including (i) enhanced access to images, (ii) genomic features, (iii) genome browser, (iv) transcripts, (v) antibodies and (vi) a community wiki for protocols and antibodies.
Assuntos
Bases de Dados Genéticas , Peixe-Zebra/genética , Animais , Anticorpos , Expressão Gênica , Genômica , Modelos Animais , Fenótipo , RNA Mensageiro/química , RNA Mensageiro/metabolismo , Peixe-Zebra/imunologia , Peixe-Zebra/metabolismoRESUMO
Knowledge graphs have become a common approach for knowledge representation. Yet, the application of graph methodology is elusive due to the sheer number and complexity of knowledge sources. In addition, semantic incompatibilities hinder efforts to harmonize and integrate across these diverse sources. As part of The Biomedical Translator Consortium, we have developed a knowledge graph-based question-answering system designed to augment human reasoning and accelerate translational scientific discovery: the Translator system. We have applied the Translator system to answer biomedical questions in the context of a broad array of diseases and syndromes, including Fanconi anemia, primary ciliary dyskinesia, multiple sclerosis, and others. A variety of collaborative approaches have been used to research and develop the Translator system. One recent approach involved the establishment of a monthly "Question-of-the-Month (QotM) Challenge" series. Herein, we describe the structure of the QotM Challenge; the six challenges that have been conducted to date on drug-induced liver injury, cannabidiol toxicity, coronavirus infection, diabetes, psoriatic arthritis, and ATP1A3-related phenotypes; the scientific insights that have been gleaned during the challenges; and the technical issues that were identified over the course of the challenges and that can now be addressed to foster further development of the prototype Translator system. We close with a discussion on Large Language Models such as ChatGPT and highlight differences between those models and the Translator system.
RESUMO
The Gene Ontology (GO) knowledgebase (http://geneontology.org) is a comprehensive resource concerning the functions of genes and gene products (proteins and noncoding RNAs). GO annotations cover genes from organisms across the tree of life as well as viruses, though most gene function knowledge currently derives from experiments carried out in a relatively small number of model organisms. Here, we provide an updated overview of the GO knowledgebase, as well as the efforts of the broad, international consortium of scientists that develops, maintains, and updates the GO knowledgebase. The GO knowledgebase consists of three components: (1) the GO-a computational knowledge structure describing the functional characteristics of genes; (2) GO annotations-evidence-supported statements asserting that a specific gene product has a particular functional characteristic; and (3) GO Causal Activity Models (GO-CAMs)-mechanistic models of molecular "pathways" (GO biological processes) created by linking multiple GO annotations using defined relations. Each of these components is continually expanded, revised, and updated in response to newly published discoveries and receives extensive QA checks, reviews, and user feedback. For each of these components, we provide a description of the current contents, recent developments to keep the knowledgebase up to date with new discoveries, and guidance on how users can best make use of the data that we provide. We conclude with future directions for the project.
Assuntos
Bases de Dados Genéticas , Proteínas , Ontologia Genética , Proteínas/genética , Anotação de Sequência Molecular , Biologia ComputacionalRESUMO
The standardized identification of biomedical entities is a cornerstone of interoperability, reuse, and data integration in the life sciences. Several registries have been developed to catalog resources maintaining identifiers for biomedical entities such as small molecules, proteins, cell lines, and clinical trials. However, existing registries have struggled to provide sufficient coverage and metadata standards that meet the evolving needs of modern life sciences researchers. Here, we introduce the Bioregistry, an integrative, open, community-driven metaregistry that synthesizes and substantially expands upon 23 existing registries. The Bioregistry addresses the need for a sustainable registry by leveraging public infrastructure and automation, and employing a progressive governance model centered around open code and open data to foster community contribution. The Bioregistry can be used to support the standardized annotation of data, models, ontologies, and scientific literature, thereby promoting their interoperability and reuse. The Bioregistry can be accessed through https://bioregistry.io and its source code and data are available under the MIT and CC0 Licenses at https://github.com/biopragmatics/bioregistry .
RESUMO
Within clinical, biomedical, and translational science, an increasing number of projects are adopting graphs for knowledge representation. Graph-based data models elucidate the interconnectedness among core biomedical concepts, enable data structures to be easily updated, and support intuitive queries, visualizations, and inference algorithms. However, knowledge discovery across these "knowledge graphs" (KGs) has remained difficult. Data set heterogeneity and complexity; the proliferation of ad hoc data formats; poor compliance with guidelines on findability, accessibility, interoperability, and reusability; and, in particular, the lack of a universally accepted, open-access model for standardization across biomedical KGs has left the task of reconciling data sources to downstream consumers. Biolink Model is an open-source data model that can be used to formalize the relationships between data structures in translational science. It incorporates object-oriented classification and graph-oriented features. The core of the model is a set of hierarchical, interconnected classes (or categories) and relationships between them (or predicates) representing biomedical entities such as gene, disease, chemical, anatomic structure, and phenotype. The model provides class and edge attributes and associations that guide how entities should relate to one another. Here, we highlight the need for a standardized data model for KGs, describe Biolink Model, and compare it with other models. We demonstrate the utility of Biolink Model in various initiatives, including the Biomedical Data Translator Consortium and the Monarch Initiative, and show how it has supported easier integration and interoperability of biomedical KGs, bringing together knowledge from multiple sources and helping to realize the goals of translational science.
Assuntos
Reconhecimento Automatizado de Padrão , Ciência Translacional Biomédica , ConhecimentoRESUMO
Despite progress in the development of standards for describing and exchanging scientific information, the lack of easy-to-use standards for mapping between different representations of the same or similar objects in different databases poses a major impediment to data integration and interoperability. Mappings often lack the metadata needed to be correctly interpreted and applied. For example, are two terms equivalent or merely related? Are they narrow or broad matches? Or are they associated in some other way? Such relationships between the mapped terms are often not documented, which leads to incorrect assumptions and makes them hard to use in scenarios that require a high degree of precision (such as diagnostics or risk prediction). Furthermore, the lack of descriptions of how mappings were done makes it hard to combine and reconcile mappings, particularly curated and automated ones. We have developed the Simple Standard for Sharing Ontological Mappings (SSSOM) which addresses these problems by: (i) Introducing a machine-readable and extensible vocabulary to describe metadata that makes imprecision, inaccuracy and incompleteness in mappings explicit. (ii) Defining an easy-to-use simple table-based format that can be integrated into existing data science pipelines without the need to parse or query ontologies, and that integrates seamlessly with Linked Data principles. (iii) Implementing open and community-driven collaborative workflows that are designed to evolve the standard continuously to address changing requirements and mapping practices. (iv) Providing reference tools and software libraries for working with the standard. In this paper, we present the SSSOM standard, describe several use cases in detail and survey some of the existing work on standardizing the exchange of mappings, with the goal of making mappings Findable, Accessible, Interoperable and Reusable (FAIR). The SSSOM specification can be found at http://w3id.org/sssom/spec. Database URL: http://w3id.org/sssom/spec.
Assuntos
Metadados , Web Semântica , Gerenciamento de Dados , Bases de Dados Factuais , Fluxo de TrabalhoRESUMO
The Zebrafish Information Network (ZFIN, http://zfin.org), the model organism database for zebrafish, provides the central location for curated zebrafish genetic, genomic and developmental data. Extensive data integration of mutant phenotypes, genes, expression patterns, sequences, genetic markers, morpholinos, map positions, publications and community resources facilitates the use of the zebrafish as a model for studying gene function, development, behavior and disease. Access to ZFIN data is provided via web-based query forms and through bulk data files. ZFIN is the definitive source for zebrafish gene and allele nomenclature, the zebrafish anatomical ontology (AO) and for zebrafish gene ontology (GO) annotations. ZFIN plays an active role in the development of cross-species ontologies such as the phenotypic quality ontology (PATO) and the gene ontology (GO). Recent enhancements to ZFIN include (i) a new home page and navigation bar, (ii) expanded support for genotypes and phenotypes, (iii) comprehensive phenotype annotations based on anatomical, phenotypic quality and gene ontologies, (iv) a BLAST server tightly integrated with the ZFIN database via ZFIN-specific datasets, (v) a global site search and (vi) help with hands-on resources.
Assuntos
Bases de Dados Genéticas , Fenótipo , Peixe-Zebra/genética , Animais , Genótipo , Internet , Modelos Animais , Mutação , Alinhamento de Sequência , Integração de Sistemas , Interface Usuário-Computador , Peixe-Zebra/anatomia & histologiaRESUMO
Model organism databases (MODs) have been collecting and integrating biomedical research data for 30 years and were designed to meet specific needs of each model organism research community. The contributions of model organism research to understanding biological systems would be hard to overstate. Modern molecular biology methods and cost reductions in nucleotide sequencing have opened avenues for direct application of model organism research to elucidating mechanisms of human diseases. Thus, the mandate for model organism research and databases has now grown to include facilitating use of these data in translational applications. Challenges in meeting this opportunity include the distribution of research data across many databases and websites, a lack of data format standards for some data types, and sustainability of scale and cost for genomic database resources like MODs. The issues of widely distributed data and application of data standards are some of the challenges addressed by FAIR (Findable, Accessible, Interoperable, and Re-usable) data principles. The Alliance of Genome Resources is now moving to address these challenges by bringing together expertly curated research data from fly, mouse, rat, worm, yeast, zebrafish, and the Gene Ontology consortium. Centralized multi-species data access, integration, and format standardization will lower the data utilization barrier in comparative genomics and translational applications and will provide a framework in which sustainable scale and cost can be addressed. This article presents a brief historical perspective on how the Alliance model organisms are complementary and how they have already contributed to understanding the etiology of human diseases. In addition, we discuss four challenges for using data from MODs in translational applications and how the Alliance is working to address them, in part by applying FAIR data principles. Ultimately, combined data from these animal models are more powerful than the sum of the parts.
Assuntos
Animais de Laboratório , Bases de Dados como Assunto , Pesquisa Translacional Biomédica/métodos , Animais , Modelos AnimaisRESUMO
The Zebrafish Model Organism Database (ZFIN; https://zfin.org) is the central resource for genetic, genomic, and phenotypic data for zebrafish (Danio rerio) research. ZFIN continuously assesses trends in zebrafish research, adding new data types and providing data repositories and tools that members of the research community can use to navigate data. The many research advantages and flexibility of manipulation of zebrafish have made them an increasingly attractive animal to model and study human disease.To facilitate disease-related research, ZFIN developed support to provide human disease information as well as annotation of zebrafish models of human disease. Human disease term pages at ZFIN provide information about disease names, synonyms, and references to other databases as well as a list of publications reporting studies of human diseases in which zebrafish were used. Zebrafish orthologs of human genes that are implicated in human disease etiology are routinely studied to provide an understanding of the molecular basis of disease. Therefore, a list of human genes involved in the disease with their corresponding zebrafish ortholog is displayed on the disease page, with links to additional information regarding the genes and existing mutations. Studying human disease often requires the use of models that recapitulate some or all of the pathologies observed in human diseases. Access to information regarding existing and published models can be critical, because they provide a tractable way to gain insight into the phenotypic outcomes of the disease. ZFIN annotates zebrafish models of human disease and supports retrieval of these published models by listing zebrafish models on the disease term page as well as by providing search interfaces and data download files to access the data. The improvements ZFIN has made to annotate, display, and search data related to human disease, especially zebrafish models for disease and disease-associated gene information, should be helpful to researchers and clinicians considering the use of zebrafish to study human disease.
Assuntos
Bases de Dados Genéticas , Modelos Animais de Doenças , Proteínas de Peixe-Zebra/genética , Peixe-Zebra/genética , Animais , Biologia Computacional/métodos , Curadoria de Dados/métodos , Estudos de Associação Genética , Genoma , Genômica , Humanos , Modelos Animais , MutaçãoRESUMO
Model organisms are widely used for understanding basic biology, and have significantly contributed to the study of human disease. In recent years, genomic analysis has provided extensive evidence of widespread conservation of gene sequence and function amongst eukaryotes, allowing insights from model organisms to help decipher gene function in a wider range of species. The InterMOD consortium is developing an infrastructure based around the InterMine data warehouse system to integrate genomic and functional data from a number of key model organisms, leading the way to improved cross-species research. So far including budding yeast, nematode worm, fruit fly, zebrafish, rat and mouse, the project has set up data warehouses, synchronized data models, and created analysis tools and links between data from different species. The project unites a number of major model organism databases, improving both the consistency and accessibility of comparative research, to the benefit of the wider scientific community.