RESUMEN
MOTIVATION: Knowledge graphs (KGs) are a powerful approach for integrating heterogeneous data and making inferences in biology and many other domains, but a coherent solution for constructing, exchanging, and facilitating the downstream use of KGs is lacking. RESULTS: Here we present KG-Hub, a platform that enables standardized construction, exchange, and reuse of KGs. Features include a simple, modular extract-transform-load pattern for producing graphs compliant with Biolink Model (a high-level data model for standardizing biological data), easy integration of any OBO (Open Biological and Biomedical Ontologies) ontology, cached downloads of upstream data sources, versioned and automatically updated builds with stable URLs, web-browsable storage of KG artifacts on cloud infrastructure, and easy reuse of transformed subgraphs across projects. Current KG-Hub projects span use cases including COVID-19 research, drug repurposing, microbial-environmental interactions, and rare disease research. KG-Hub is equipped with tooling to easily analyze and manipulate KGs. KG-Hub is also tightly integrated with graph machine learning (ML) tools which allow automated graph ML, including node embeddings and training of models for link prediction and node classification. AVAILABILITY AND IMPLEMENTATION: https://kghub.org.
Asunto(s)
Ontologías Biológicas , COVID-19 , Humanos , Reconocimiento de Normas Patrones Automatizadas , Enfermedades Raras , Aprendizaje AutomáticoRESUMEN
The Bovine Genome Database (BGD) (http://bovinegenome.org) has been the key community bovine genomics database for more than a decade. To accommodate the increasing amount and complexity of bovine genomics data, BGD continues to advance its practices in data acquisition, curation, integration and efficient data retrieval. BGD provides tools for genome browsing (JBrowse), genome annotation (Apollo), data mining (BovineMine) and sequence database searching (BLAST). To augment the BGD genome annotation capabilities, we have developed a new Apollo plug-in, called the Locus-Specific Alternate Assembly (LSAA) tool, which enables users to identify and report potential genome assembly errors and structural variants. BGD now hosts both the newest bovine reference genome assembly, ARS-UCD1.2, as well as the previous reference genome, UMD3.1.1, with cross-genome navigation and queries supported in JBrowse and BovineMine, respectively. Other notable enhancements to BovineMine include the incorporation of genomes and gene annotation datasets for non-bovine ruminant species (goat and sheep), support for multiple assemblies per organism in the Regions Search tool, integration of additional ontologies and development of many new template queries. To better serve the research community, we continue to focus on improving existing tools, developing new tools, adding new datasets and encouraging researchers to use these resources.
Asunto(s)
Bovinos/genética , Biología Computacional/métodos , Bases de Datos Factuales , Genoma , Algoritmos , Animales , Gráficos por Computador , Minería de Datos , Bases de Datos Genéticas , Perfilación de la Expresión Génica , Genómica , Internet , Anotación de Secuencia Molecular , RNA-Seq , Valores de Referencia , Rumiantes/genética , Alineación de Secuencia , Programas Informáticos , Interfaz Usuario-ComputadorRESUMEN
In biology and biomedicine, relating phenotypic outcomes with genetic variation and environmental factors remains a challenge: patient phenotypes may not match known diseases, candidate variants may be in genes that haven't been characterized, research organisms may not recapitulate human or veterinary diseases, environmental factors affecting disease outcomes are unknown or undocumented, and many resources must be queried to find potentially significant phenotypic associations. The Monarch Initiative (https://monarchinitiative.org) integrates information on genes, variants, genotypes, phenotypes and diseases in a variety of species, and allows powerful ontology-based search. We develop many widely adopted ontologies that together enable sophisticated computational analysis, mechanistic discovery and diagnostics of Mendelian diseases. Our algorithms and tools are widely used to identify animal models of human disease through phenotypic similarity, for differential diagnostics and to facilitate translational research. Launched in 2015, Monarch has grown with regards to data (new organisms, more sources, better modeling); new API and standards; ontologies (new Mondo unified disease ontology, improvements to ontologies such as HPO and uPheno); user interface (a redesigned website); and community development. Monarch data, algorithms and tools are being used and extended by resources such as GA4GH and NCATS Translator, among others, to aid mechanistic discovery and diagnostics.
Asunto(s)
Biología Computacional/métodos , Genotipo , Fenotipo , Algoritmos , Animales , Ontologías Biológicas , Bases de Datos Genéticas , Exoma , Estudios de Asociación Genética , Variación Genética , Genómica , Humanos , Internet , Programas Informáticos , Investigación Biomédica Traslacional , Interfaz Usuario-ComputadorRESUMEN
Genome annotation is the process of identifying the location and function of a genome's encoded features. Improving the biological accuracy of annotation is a complex and iterative process requiring researchers to review and incorporate multiple sources of information such as transcriptome alignments, predictive models based on sequence profiles, and comparisons to features found in related organisms. Because rapidly decreasing costs are enabling an ever-growing number of scientists to incorporate sequencing as a routine laboratory technique, there is widespread demand for tools that can assist in the deliberative analytical review of genomic information. To this end, we present Apollo, an open source software package that enables researchers to efficiently inspect and refine the precise structure and role of genomic features in a graphical browser-based platform. Some of Apollo's newer user interface features include support for real-time collaboration, allowing distributed users to simultaneously edit the same encoded features while also instantly seeing the updates made by other researchers on the same region in a manner similar to Google Docs. Its technical architecture enables Apollo to be integrated into multiple existing genomic analysis pipelines and heterogeneous laboratory workflow platforms. Finally, we consider the implications that Apollo and related applications may have on how the results of genome research are published and made accessible.
Asunto(s)
Biología Computacional/métodos , Anotación de Secuencia Molecular/métodos , Mapeo Cromosómico/métodos , Sistemas de Administración de Bases de Datos , Genoma/genética , Genómica , Almacenamiento y Recuperación de la Información , Internet , Programas Informáticos , Interfaz Usuario-ComputadorRESUMEN
We report an update of the Hymenoptera Genome Database (HGD) (http://HymenopteraGenome.org), a model organism database for insect species of the order Hymenoptera (ants, bees and wasps). HGD maintains genomic data for 9 bee species, 10 ant species and 1 wasp, including the versions of genome and annotation data sets published by the genome sequencing consortiums and those provided by NCBI. A new data-mining warehouse, HymenopteraMine, based on the InterMine data warehousing system, integrates the genome data with data from external sources and facilitates cross-species analyses based on orthology. New genome browsers and annotation tools based on JBrowse/WebApollo provide easy genome navigation, and viewing of high throughput sequence data sets and can be used for collaborative genome annotation. All of the genomes and annotation data sets are combined into a single BLAST server that allows users to select and combine sequence data sets to search.
Asunto(s)
Bases de Datos Genéticas , Genoma de los Insectos , Himenópteros/genética , Anotación de Secuencia Molecular , Animales , Minería de Datos , Genómica , Alineación de SecuenciaRESUMEN
We report an update of the Bovine Genome Database (BGD) (http://BovineGenome.org). The goal of BGD is to support bovine genomics research by providing genome annotation and data mining tools. We have developed new genome and annotation browsers using JBrowse and WebApollo for two Bos taurus genome assemblies, the reference genome assembly (UMD3.1.1) and the alternate genome assembly (Btau_4.6.1). Annotation tools have been customized to highlight priority genes for annotation, and to aid annotators in selecting gene evidence tracks from 91 tissue specific RNAseq datasets. We have also developed BovineMine, based on the InterMine data warehousing system, to integrate the bovine genome, annotation, QTL, SNP and expression data with external sources of orthology, gene ontology, gene interaction and pathway information. BovineMine provides powerful query building tools, as well as customized query templates, and allows users to analyze and download genome-wide datasets. With BovineMine, bovine researchers can use orthology to leverage the curated gene pathways of model organisms, such as human, mouse and rat. BovineMine will be especially useful for gene ontology and pathway analyses in conjunction with GWAS and QTL studies.
Asunto(s)
Bovinos/genética , Bases de Datos Genéticas , Genoma , Animales , Bovinos/metabolismo , Minería de Datos , Expresión Génica , Humanos , Ratones , Anotación de Secuencia Molecular , Ratas , Programas InformáticosRESUMEN
Translational research requires data at multiple scales of biological organization. Advancements in sequencing and multi-omics technologies have increased the availability of these data, but researchers face significant integration challenges. Knowledge graphs (KGs) are used to model complex phenomena, and methods exist to construct them automatically. However, tackling complex biomedical integration problems requires flexibility in the way knowledge is modeled. Moreover, existing KG construction methods provide robust tooling at the cost of fixed or limited choices among knowledge representation models. PheKnowLator (Phenotype Knowledge Translator) is a semantic ecosystem for automating the FAIR (Findable, Accessible, Interoperable, and Reusable) construction of ontologically grounded KGs with fully customizable knowledge representation. The ecosystem includes KG construction resources (e.g., data preparation APIs), analysis tools (e.g., SPARQL endpoint resources and abstraction algorithms), and benchmarks (e.g., prebuilt KGs). We evaluated the ecosystem by systematically comparing it to existing open-source KG construction methods and by analyzing its computational performance when used to construct 12 different large-scale KGs. With flexible knowledge representation, PheKnowLator enables fully customizable KGs without compromising performance or usability.
Asunto(s)
Disciplinas de las Ciencias Biológicas , Bases del Conocimiento , Reconocimiento de Normas Patrones Automatizadas , Algoritmos , Investigación Biomédica TraslacionalRESUMEN
Cyclooxygenase-2 (COX-2) and 5-lipoxygenase (5-LOX) enzyme have been found to play a role in promoting growth in colon cancer cell lines. The di-tert-butyl phenol class of compounds has been found to inhibit both COX-2 and 5-LOX enzymes with proven effectiveness in arresting tumor growth. In the present study, the structural analogs of 2,6 di-tert-butyl-p-benzoquinone (BQ) appended with hydrazide side chain were found to inhibit COX-2 and 5-LOX enzymes at micromolar concentrations. Molecular docking of the compounds into COX-2 and 5-LOX protein cavities indicated strong binding interactions supporting the observed cytototoxicities. The signaling interaction between endogenous hyaluronan and CD44 has been shown to regulate COX-2 activities through ErbB2 receptor tyrosine kinase (RTK) activation. In the present studies it has been observed for the first time, that three of our COX/5-LOX dual inhibitors inhibit proliferation upon hydrazide substitution and prevent the activity of pro-angiogenic factors in HCA-7, HT-29, Apc10.1 cells as well as the hyaluronan synthase-2 (Has2) enzyme over-expressed in colon cancer cells, through inhibition of the hyaluronan/CD44v6 cell survival pathway. Since there is a substantial enhancement in the antiproliferative activities of these compounds upon hydrazide substitution, the present work opens up new opportunities for evolving novel active compounds of BQ series for inhibiting colon cancer.
Asunto(s)
Antineoplásicos/farmacología , Araquidonato 5-Lipooxigenasa/metabolismo , Neoplasias del Colon/tratamiento farmacológico , Ciclohexanonas/farmacología , Inhibidores de la Ciclooxigenasa 2/farmacología , Ciclooxigenasa 2/metabolismo , Receptores de Hialuranos/metabolismo , Ácido Hialurónico/metabolismo , Hidrazinas/farmacología , Inhibidores de la Lipooxigenasa/farmacología , Antineoplásicos/síntesis química , Antineoplásicos/química , Línea Celular Tumoral , Proliferación Celular/efectos de los fármacos , Neoplasias del Colon/metabolismo , Neoplasias del Colon/patología , Ciclohexanonas/síntesis química , Ciclohexanonas/química , Inhibidores de la Ciclooxigenasa 2/síntesis química , Inhibidores de la Ciclooxigenasa 2/química , Relación Dosis-Respuesta a Droga , Ensayos de Selección de Medicamentos Antitumorales , Humanos , Hidrazinas/síntesis química , Hidrazinas/química , Inhibidores de la Lipooxigenasa/síntesis química , Inhibidores de la Lipooxigenasa/química , Modelos Moleculares , Estructura Molecular , Transducción de Señal/efectos de los fármacos , Relación Estructura-ActividadRESUMEN
The 24th annual Bioinformatics Open Source Conference ( BOSC 2023) was part of the 2023i conference on Intelligent Systems for Molecular Biology and the European Conference on Computational Biology (ISMB/ECCB 2023). Launched in 2000 and held yearly since, BOSC is the premier meeting covering open-source bioinformatics and open science. Like ISMB 2022, the 2023 meeting was a hybrid conference, with the in-person component hosted in Lyon, France. ISMB/ECCB attracted a near-record number of attendees, with over 2100 in person and about 900 more online. Approximately 200 people participated in BOSC sessions. In addition to 43 talks and 49 posters, BOSC featured two keynotes: Sara El-Gebali, who spoke about "A New Odyssey: Pioneering the Future of Scientific Progress Through Open Collaboration", and Joseph Yracheta, who spoke about "The Dissonance between Scientific Altruism & Capitalist Extraction: The Zero Trust and Federated Data Sovereignty Solution." Once again, a joint session brought together BOSC and the Bio-Ontologies COSI. The conference ended with a panel on Open and Ethical Data Sharing. As in prior years, BOSC was preceded by a CollaborationFest, a collaborative work event that brought together about 40 participants interested in synergistically combining ideas, shaping project plans, developing software, and more.
Asunto(s)
Biología Computacional , Programas Informáticos , Humanos , Difusión de la InformaciónRESUMEN
The Swiss Personalized Health Network (SPHN) is a government-funded initiative developing federated infrastructures for a responsible and efficient secondary use of health data for research purposes in compliance with the FAIR principles (Findable, Accessible, Interoperable and Reusable). We built a common standard infrastructure with a fit-for-purpose strategy to bring together health-related data and ease the work of both data providers to supply data in a standard manner and researchers by enhancing the quality of the collected data. As a result, the SPHN Resource Description Framework (RDF) schema was implemented together with a data ecosystem that encompasses data integration, validation tools, analysis helpers, training and documentation for representing health metadata and data in a consistent manner and reaching nationwide data interoperability goals. Data providers can now efficiently deliver several types of health data in a standardised and interoperable way while a high degree of flexibility is granted for the various demands of individual research projects. Researchers in Switzerland have access to FAIR health data for further use in RDF triplestores.
Asunto(s)
Investigación sobre Servicios de Salud , Web Semántica , Metadatos , Suiza , Recolección de DatosRESUMEN
The 23 rd annual Bioinformatics Open Source Conference (BOSC 2022) was part of this year's conference on Intelligent Systems for Molecular Biology (ISMB). Launched in 2000 and held every year since, BOSC is the premier meeting covering open source bioinformatics and open science. ISMB 2022 was, for the first time, a hybrid conference, with the in-person component hosted in Madison, Wisconsin (USA). About 1000 people attended ISMB 2022 in person, with another 800 online. Approximately 200 people participated in BOSC sessions, which included 28 talks chosen from submitted abstracts, 46 posters, and a panel discussion, "Building and Sustaining Inclusive Open Science Communities". BOSC 2022 included joint keynotes with two other COSIs. Jason Williams gave a BOSC / Education COSI keynote entitled "Riding the bicycle: Including all scientists on a path to excellence". A joint session with Bio-Ontologies featured a keynote by Melissa Haendel, "The open data highway: turbo-boosting translational traffic with ontologies."
Asunto(s)
Biología Computacional , Biología de Sistemas , Congresos como Asunto , HumanosRESUMEN
The standardized identification of biomedical entities is a cornerstone of interoperability, reuse, and data integration in the life sciences. Several registries have been developed to catalog resources maintaining identifiers for biomedical entities such as small molecules, proteins, cell lines, and clinical trials. However, existing registries have struggled to provide sufficient coverage and metadata standards that meet the evolving needs of modern life sciences researchers. Here, we introduce the Bioregistry, an integrative, open, community-driven metaregistry that synthesizes and substantially expands upon 23 existing registries. The Bioregistry addresses the need for a sustainable registry by leveraging public infrastructure and automation, and employing a progressive governance model centered around open code and open data to foster community contribution. The Bioregistry can be used to support the standardized annotation of data, models, ontologies, and scientific literature, thereby promoting their interoperability and reuse. The Bioregistry can be accessed through https://bioregistry.io and its source code and data are available under the MIT and CC0 Licenses at https://github.com/biopragmatics/bioregistry .
RESUMEN
Within clinical, biomedical, and translational science, an increasing number of projects are adopting graphs for knowledge representation. Graph-based data models elucidate the interconnectedness among core biomedical concepts, enable data structures to be easily updated, and support intuitive queries, visualizations, and inference algorithms. However, knowledge discovery across these "knowledge graphs" (KGs) has remained difficult. Data set heterogeneity and complexity; the proliferation of ad hoc data formats; poor compliance with guidelines on findability, accessibility, interoperability, and reusability; and, in particular, the lack of a universally accepted, open-access model for standardization across biomedical KGs has left the task of reconciling data sources to downstream consumers. Biolink Model is an open-source data model that can be used to formalize the relationships between data structures in translational science. It incorporates object-oriented classification and graph-oriented features. The core of the model is a set of hierarchical, interconnected classes (or categories) and relationships between them (or predicates) representing biomedical entities such as gene, disease, chemical, anatomic structure, and phenotype. The model provides class and edge attributes and associations that guide how entities should relate to one another. Here, we highlight the need for a standardized data model for KGs, describe Biolink Model, and compare it with other models. We demonstrate the utility of Biolink Model in various initiatives, including the Biomedical Data Translator Consortium and the Monarch Initiative, and show how it has supported easier integration and interoperability of biomedical KGs, bringing together knowledge from multiple sources and helping to realize the goals of translational science.
Asunto(s)
Reconocimiento de Normas Patrones Automatizadas , Ciencia Traslacional Biomédica , ConocimientoRESUMEN
Integrated, up-to-date data about SARS-CoV-2 and COVID-19 is crucial for the ongoing response to the COVID-19 pandemic by the biomedical research community. While rich biological knowledge exists for SARS-CoV-2 and related viruses (SARS-CoV, MERS-CoV), integrating this knowledge is difficult and time-consuming, since much of it is in siloed databases or in textual format. Furthermore, the data required by the research community vary drastically for different tasks; the optimal data for a machine learning task, for example, is much different from the data used to populate a browsable user interface for clinicians. To address these challenges, we created KG-COVID-19, a flexible framework that ingests and integrates heterogeneous biomedical data to produce knowledge graphs (KGs), and applied it to create a KG for COVID-19 response. This KG framework also can be applied to other problems in which siloed biomedical data must be quickly integrated for different research applications, including future pandemics.
RESUMEN
Integrated, up-to-date data about SARS-CoV-2 and coronavirus disease 2019 (COVID-19) is crucial for the ongoing response to the COVID-19 pandemic by the biomedical research community. While rich biological knowledge exists for SARS-CoV-2 and related viruses (SARS-CoV, MERS-CoV), integrating this knowledge is difficult and time consuming, since much of it is in siloed databases or in textual format. Furthermore, the data required by the research community varies drastically for different tasks - the optimal data for a machine learning task, for example, is much different from the data used to populate a browsable user interface for clinicians. To address these challenges, we created KG-COVID-19, a flexible framework that ingests and integrates biomedical data to produce knowledge graphs (KGs) for COVID-19 response. This KG framework can also be applied to other problems in which siloed biomedical data must be quickly integrated for different research applications, including future pandemics. BIGGER PICTURE: An effective response to the COVID-19 pandemic relies on integration of many different types of data available about SARS-CoV-2 and related viruses. KG-COVID-19 is a framework for producing knowledge graphs that can be customized for downstream applications including machine learning tasks, hypothesis-based querying, and browsable user interface to enable researchers to explore COVID-19 data and discover relationships.
RESUMEN
MaizeMine is the data mining resource of the Maize Genetics and Genome Database (MaizeGDB; http://maizemine.maizegdb.org). It enables researchers to create and export customized annotation datasets that can be merged with their own research data for use in downstream analyses. MaizeMine uses the InterMine data warehousing system to integrate genomic sequences and gene annotations from the Zea mays B73 RefGen_v3 and B73 RefGen_v4 genome assemblies, Gene Ontology annotations, single nucleotide polymorphisms, protein annotations, homologs, pathways, and precomputed gene expression levels based on RNA-seq data from the Z. mays B73 Gene Expression Atlas. MaizeMine also provides database cross references between genes of alternative gene sets from Gramene and NCBI RefSeq. MaizeMine includes several search tools, including a keyword search, built-in template queries with intuitive search menus, and a QueryBuilder tool for creating custom queries. The Genomic Regions search tool executes queries based on lists of genome coordinates, and supports both the B73 RefGen_v3 and B73 RefGen_v4 assemblies. The List tool allows you to upload identifiers to create custom lists, perform set operations such as unions and intersections, and execute template queries with lists. When used with gene identifiers, the List tool automatically provides gene set enrichment for Gene Ontology (GO) and pathways, with a choice of statistical parameters and background gene sets. With the ability to save query outputs as lists that can be input to new queries, MaizeMine provides limitless possibilities for data integration and meta-analysis.
RESUMEN
The Bovine Genome Database (BGD; http://bovinegenome.org ) is a web-accessible resource that supports bovine genomics research by providing genome annotation and data mining tools. BovineMine is a tool within BGD that integrates BGD data, including the genome, genes, precomputed gene expression levels and variant consequences, with external data sources that include quantitative trait loci (QTL), orthologues, Gene Ontology, gene interactions, and pathways. BovineMine enables researchers without programming skills to create custom integrated datasets for use in downstream analyses. This chapter describes how to enhance a bovine genomics project using the Bovine Genome Database, with data mining examples demonstrating BovineMine.
Asunto(s)
Bases de Datos Genéticas , Genoma , Genómica , Navegador Web , Animales , Bovinos , Biología Computacional/métodos , Minería de Datos/métodos , Expresión Génica , Variación Genética , Estudio de Asociación del Genoma Completo , Genómica/métodos , Metaanálisis como Asunto , Anotación de Secuencia Molecular , Sitios de Carácter Cuantitativo , Programas Informáticos , Interfaz Usuario-ComputadorRESUMEN
The Hymenoptera Genome Database (HGD; http://hymenopteragenome.org ) is a genome informatics resource for insects of the order Hymenoptera, which includes bees, ants and wasps. HGD provides genome browsers with manual annotation tools (JBrowse/Apollo), BLAST, bulk data download, and a data mining warehouse (HymenopteraMine). This chapter focuses on the use of HymenopteraMine to create annotation data sets that can be exported for use in downstream analyses. HymenopteraMine leverages the InterMine platform to combine genome assemblies and official gene sets with data from OrthoDB, RefSeq, FlyBase, Gene Ontology, UniProt, InterPro, KEGG, Reactome, dbSNP, PubMed, and BioGrid, as well as precomputed gene expression information based on publicly available RNAseq. Built-in template queries provide starting points for data exploration, while the QueryBuilder tool supports construction of complex custom queries. The List Analysis and Genomic Regions search tools execute queries based on uploaded lists of identifiers and genome coordinates, respectively. HymenopteraMine facilitates cross-species data mining based on orthology and supports meta-analyses by tracking identifiers across gene sets and genome assemblies.
Asunto(s)
Bases de Datos Genéticas , Genoma de los Insectos , Genómica , Himenópteros/genética , Animales , Biología Computacional/métodos , Minería de Datos , Genómica/métodos , Programas Informáticos , Interfaz Usuario-Computador , Navegador WebRESUMEN
The future of agricultural research depends on data. The sheer volume of agricultural biological data being produced today makes excellent data management essential. Governmental agencies, publishers and science funders require data management plans for publicly funded research. Furthermore, the value of data increases exponentially when they are properly stored, described, integrated and shared, so that they can be easily utilized in future analyses. AgBioData (https://www.agbiodata.org) is a consortium of people working at agricultural biological databases, data archives and knowledgbases who strive to identify common issues in database development, curation and management, with the goal of creating database products that are more Findable, Accessible, Interoperable and Reusable. We strive to promote authentic, detailed, accurate and explicit communication between all parties involved in scientific data. As a step toward this goal, we present the current state of biocuration, ontologies, metadata and persistence, database platforms, programmatic (machine) access to data, communication and sustainability with regard to data curation. Each section describes challenges and opportunities for these topics, along with recommendations and best practices.