RESUMO
The Human Reference Atlas (HRA) is defined as a comprehensive, three-dimensional (3D) atlas of all the cells in the healthy human body. It is compiled by an international team of experts who develop standard terminologies that they link to 3D reference objects, describing anatomical structures. The third HRA release (v1.2) covers spatial reference data and ontology annotations for 26 organs. Experts access the HRA annotations via spreadsheets and view reference object models in 3D editing tools. This paper introduces the Common Coordinate Framework (CCF) Ontology v2.0.1 that interlinks specimen, biological structure, and spatial data, together with the CCF API that makes the HRA programmatically accessible and interoperable with Linked Open Data (LOD). We detail how real-world user needs and experimental data guide CCF Ontology design and implementation, present CCF Ontology classes and properties together with exemplary usage, and report on validation methods. The CCF Ontology graph database and API are used in the HuBMAP portal, HRA Organ Gallery, and other applications that support data queries across multiple, heterogeneous sources.
Assuntos
Células , Bases de Dados Factuais , HumanosRESUMO
It is challenging to determine whether datasets are findable, accessible, interoperable, and reusable (FAIR) because the FAIR Guiding Principles refer to highly idiosyncratic criteria regarding the metadata used to annotate datasets. Specifically, the FAIR principles require metadata to be "rich" and to adhere to "domain-relevant" community standards. Scientific communities should be able to define their own machine-actionable templates for metadata that encode these "rich," discipline-specific elements. We have explored this template-based approach in the context of two software systems. One system is the CEDAR Workbench, which investigators use to author new metadata. The other is the FAIRware Workbench, which evaluates the metadata of archived datasets for their adherence to community standards. Benefits accrue when templates for metadata become central elements in an ecosystem of tools to manage online datasets-both because the templates serve as a community reference for what constitutes FAIR data, and because they embody that perspective in a form that can be distributed among a variety of software applications to assist with data stewardship and data sharing.
RESUMO
The limited volume of COVID-19 data from Africa raises concerns for global genome research, which requires a diversity of genotypes for accurate disease prediction, including on the provenance of the new SARS-CoV-2 mutations. The Virus Outbreak Data Network (VODAN)-Africa studied the possibility of increasing the production of clinical data, finding concerns about data ownership, and the limited use of health data for quality treatment at point of care. To address this, VODAN Africa developed an architecture to record clinical health data and research data collected on the incidence of COVID-19, producing these as human- and machine-readable data objects in a distributed architecture of locally governed, linked, human- and machine-readable data. This architecture supports analytics at the point of care and-through data visiting, across facilities-for generic analytics. An algorithm was run across FAIR Data Points to visit the distributed data and produce aggregate findings. The FAIR data architecture is deployed in Uganda, Ethiopia, Liberia, Nigeria, Kenya, Somalia, Tanzania, Zimbabwe, and Tunisia.
RESUMO
OBJECTIVE: Although social and environmental factors are central to provider-patient interactions, the data that reflect these factors can be incomplete, vague, and subjective. We sought to create a conceptual framework to describe and classify data about presence, the domain of interpersonal connection in medicine. METHODS: Our top-down approach for ontology development based on the concept of "relationality" included the following: 1) a broad survey of the social sciences literature and a systematic literature review of >20 000 articles around interpersonal connection in medicine, 2) relational ethnography of clinical encounters (n = 5 pilot, 27 full), and 3) interviews about relational work with 40 medical and nonmedical professionals. We formalized the model using the Web Ontology Language in the Protégé ontology editor. We iteratively evaluated and refined the Presence Ontology through manual expert review and automated annotation of literature. RESULTS AND DISCUSSION: The Presence Ontology facilitates the naming and classification of concepts that would otherwise be vague. Our model categorizes contributors to healthcare encounters and factors such as communication, emotions, tools, and environment. Ontology evaluation indicated that cognitive models (both patients' explanatory models and providers' caregiving approaches) influenced encounters and were subsequently incorporated. We show how ethnographic methods based in relationality can aid the representation of experiential concepts (eg, empathy, trust). Our ontology could support investigative methods to improve healthcare processes for both patients and healthcare providers, including annotation of videotaped encounters, development of clinical instruments to measure presence, or implementation of electronic health record-based reminders for providers. CONCLUSION: The Presence Ontology provides a model for using ethnographic approaches to classify interpersonal data.
Assuntos
Antropologia Cultural , Comunicação , Pessoal de Saúde , Humanos , Idioma , ConfiançaRESUMO
While the biomedical community has published several "open data" sources in the last decade, most researchers still endure severe logistical and technical challenges to discover, query, and integrate heterogeneous data and knowledge from multiple sources. To tackle these challenges, the community has experimented with Semantic Web and linked data technologies to create the Life Sciences Linked Open Data (LSLOD) cloud. In this paper, we extract schemas from more than 80 biomedical linked open data sources into an LSLOD schema graph and conduct an empirical meta-analysis to evaluate the extent of semantic heterogeneity across the LSLOD cloud. We observe that several LSLOD sources exist as stand-alone data sources that are not inter-linked with other sources, use unpublished schemas with minimal reuse or mappings, and have elements that are not useful for data integration from a biomedical perspective. We envision that the LSLOD schema graph and the findings from this research will aid researchers who wish to query and integrate data and knowledge from multiple biomedical sources simultaneously on the Web.
Assuntos
Disciplinas das Ciências Biológicas , Armazenamento e Recuperação da Informação , Animais , Humanos , Metanálise como Assunto , SemânticaRESUMO
Metadata that are structured using principled schemas and that use terms from ontologies are essential to making biomedical data findable and reusable for downstream analyses. The largest source of metadata that describes the experimental protocol, funding, and scientific leadership of clinical studies is ClinicalTrials.gov. We evaluated whether values in 302,091 trial records adhere to expected data types and use terms from biomedical ontologies, whether records contain fields required by government regulations, and whether structured elements could replace free-text elements. Contact information, outcome measures, and study design are frequently missing or underspecified. Important fields for search, such as condition and intervention, are not restricted to ontologies, and almost half of the conditions are not denoted by MeSH terms, as recommended. Eligibility criteria are stored as semi-structured free text. Enforcing the presence of all required elements, requiring values for certain fields to be drawn from ontologies, and creating a structured eligibility criteria element would improve the reusability of data from ClinicalTrials.gov in systematic reviews, metanalyses, and matching of eligible patients to trials.
Assuntos
Ensaios Clínicos como Assunto , Bases de Dados Factuais , Metadados , Projetos de Pesquisa/normas , Conjuntos de Dados como AssuntoRESUMO
An overarching WHO-FIC Content Model will allow uniform modeling of classifications in the WHO Family of International Classifications (WHO-FIC) and promote their joint use. We provide an initial conceptualization of such a model.
Assuntos
Classificação Internacional de Doenças , Organização Mundial da SaúdeRESUMO
The biomedical data landscape is fragmented with several isolated, heterogeneous data and knowledge sources, which use varying formats, syntaxes, schemas, and entity notations, existing on the Web. Biomedical researchers face severe logistical and technical challenges to query, integrate, analyze, and visualize data from multiple diverse sources in the context of available biomedical knowledge. Semantic Web technologies and Linked Data principles may aid toward Web-scale semantic processing and data integration in biomedicine. The biomedical research community has been one of the earliest adopters of these technologies and principles to publish data and knowledge on the Web as linked graphs and ontologies, hence creating the Life Sciences Linked Open Data (LSLOD) cloud. In this paper, we provide our perspective on some opportunities proffered by the use of LSLOD to integrate biomedical data and knowledge in three domains: (1) pharmacology, (2) cancer research, and (3) infectious diseases. We will discuss some of the major challenges that hinder the wide-spread use and consumption of LSLOD by the biomedical research community. Finally, we provide a few technical solutions and insights that can address these challenges. Eventually, LSLOD can enable the development of scalable, intelligent infrastructures that support artificial intelligence methods for augmenting human intelligence to achieve better clinical outcomes for patients, to enhance the quality of biomedical research, and to improve our understanding of living systems.
RESUMO
Metadata-the machine-readable descriptions of the data-are increasingly seen as crucial for describing the vast array of biomedical datasets that are currently being deposited in public repositories. While most public repositories have firm requirements that metadata must accompany submitted datasets, the quality of those metadata is generally very poor. A key problem is that the typical metadata acquisition process is onerous and time consuming, with little interactive guidance or assistance provided to users. Secondary problems include the lack of validation and sparse use of standardized terms or ontologies when authoring metadata. There is a pressing need for improvements to the metadata acquisition process that will help users to enter metadata quickly and accurately. In this paper, we outline a recommendation system for metadata that aims to address this challenge. Our approach uses association rule mining to uncover hidden associations among metadata values and to represent them in the form of association rules. These rules are then used to present users with real-time recommendations when authoring metadata. The novelties of our method are that it is able to combine analyses of metadata from multiple repositories when generating recommendations and can enhance those recommendations by aligning them with ontology terms. We implemented our approach as a service integrated into the CEDAR Workbench metadata authoring platform, and evaluated it using metadata from two public biomedical repositories: US-based National Center for Biotechnology Information BioSample and European Bioinformatics Institute BioSamples. The results show that our approach is able to use analyses of previously entered metadata coupled with ontology-based mappings to present users with accurate recommendations when authoring metadata.
Assuntos
Mineração de Dados/métodos , Mineração de Dados/normas , Bases de Dados Factuais/normas , Metadados , Biologia Computacional/normasRESUMO
We present an analytical study of the quality of metadata about samples used in biomedical experiments. The metadata under analysis are stored in two well-known databases: BioSample-a repository managed by the National Center for Biotechnology Information (NCBI), and BioSamples-a repository managed by the European Bioinformatics Institute (EBI). We tested whether 11.4 M sample metadata records in the two repositories are populated with values that fulfill the stated requirements for such values. Our study revealed multiple anomalies in the metadata. Most metadata field names and their values are not standardized or controlled. Even simple binary or numeric fields are often populated with inadequate values of different data types. By clustering metadata field names, we discovered there are often many distinct ways to represent the same aspect of a sample. Overall, the metadata we analyzed reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements. The significant aberrancies that we found in the metadata are likely to impede search and secondary use of the associated datasets.
Assuntos
Bancos de Espécimes Biológicos , Metadados/normas , Confiabilidade dos DadosRESUMO
Developing promising treatments in biomedicine often requires aggregation and analysis of data from disparate sources across the healthcare and research spectrum. To facilitate these approaches, there is a growing focus on supporting interoperation of datasets by standardizing data-capture and reporting requirements. Common Data Elements (CDEs)-precise specifications of questions and the set of allowable answers to each question-are increasingly being adopted to help meet these standardization goals. While CDEs can provide a strong conceptual foundation for interoperation, there are no widely recognized serialization or interchange formats to describe and exchange their definitions. As a result, CDEs defined in one system cannot be easily be reused by other systems. An additional problem is that current CDE-based systems tend to be rather heavyweight and cannot be easily adopted and used by third-parties. To address these problems, we developed extensions to a metadata management system called the CEDAR Workbench to provide a platform to simplify the creation, exchange, and use of CDEs. We show how the resulting system allows users to quickly define and share CDEs and to immediately use these CDEs to build and deploy Web-based forms to acquire conforming metadata. We also show how we incorporated a large CDE library from the National Cancer Institute's caDSR system and made these CDEs publicly available for general use.
Assuntos
Pesquisa Biomédica , Elementos de Dados Comuns , Coleta de Dados/normas , Gerenciamento de Dados/métodos , Elementos de Dados Comuns/normas , Gerenciamento de Dados/normas , Humanos , Internet , Metadados , National Institutes of Health (U.S.) , Sistema de Registros , Estados Unidos , Interface Usuário-ComputadorRESUMO
The adaptation of high-throughput sequencing to the B cell receptor and T cell receptor has made it possible to characterize the adaptive immune receptor repertoire (AIRR) at unprecedented depth. These AIRR sequencing (AIRR-seq) studies offer tremendous potential to increase the understanding of adaptive immune responses in vaccinology, infectious disease, autoimmunity, and cancer. The increasingly wide application of AIRR-seq is leading to a critical mass of studies being deposited in the public domain, offering the possibility of novel scientific insights through secondary analyses and meta-analyses. However, effective sharing of these large-scale data remains a challenge. The AIRR community has proposed minimal information about adaptive immune receptor repertoire (MiAIRR), a standard for reporting AIRR-seq studies. The MiAIRR standard has been operationalized using the National Center for Biotechnology Information (NCBI) repositories. Submissions of AIRR-seq data to the NCBI repositories typically use a combination of web-based and flat-file templates and include only a minimal amount of terminology validation. As a result, AIRR-seq studies at the NCBI are often described using inconsistent terminologies, limiting scientists' ability to access, find, interoperate, and reuse the data sets. In order to improve metadata quality and ease submission of AIRR-seq studies to the NCBI, we have leveraged the software framework developed by the Center for Expanded Data Annotation and Retrieval (CEDAR), which develops technologies involving the use of data standards and ontologies to improve metadata quality. The resulting CEDAR-AIRR (CAIRR) pipeline enables data submitters to: (i) create web-based templates whose entries are controlled by ontology terms, (ii) generate and validate metadata, and (iii) submit the ontology-linked metadata and sequence files (FASTQ) to the NCBI BioProject, BioSample, and Sequence Read Archive databases. Overall, CAIRR provides a web-based metadata submission interface that supports compliance with the MiAIRR standard. This pipeline is available at http://cairr.miairr.org, and will facilitate the NCBI submission process and improve the metadata quality of AIRR-seq studies.
Assuntos
Biologia Computacional/métodos , Bases de Dados de Ácidos Nucleicos , Receptores de Antígenos de Linfócitos B/genética , Receptores de Antígenos de Linfócitos T/genética , Software , Biologia Computacional/organização & administração , Mineração de Dados , Ontologia Genética , Humanos , Metadados , Reprodutibilidade dos Testes , Interface Usuário-Computador , Fluxo de TrabalhoRESUMO
BACKGROUND: Public biomedical data repositories often provide web-based interfaces to collect experimental metadata. However, these interfaces typically reflect the ad hoc metadata specification practices of the associated repositories, leading to a lack of standardization in the collected metadata. This lack of standardization limits the ability of the source datasets to be broadly discovered, reused, and integrated with other datasets. To increase reuse, discoverability, and reproducibility of the described experiments, datasets should be appropriately annotated by using agreed-upon terms, ideally from ontologies or other controlled term sources. RESULTS: This work presents "CEDAR OnDemand", a browser extension powered by the NCBO (National Center for Biomedical Ontology) BioPortal that enables users to seamlessly enter ontology-based metadata through existing web forms native to individual repositories. CEDAR OnDemand analyzes the web page contents to identify the text input fields and associate them with relevant ontologies which are recommended automatically based upon input fields' labels (using the NCBO ontology recommender) and a pre-defined list of ontologies. These field-specific ontologies are used for controlling metadata entry. CEDAR OnDemand works for any web form designed in the HTML format. We demonstrate how CEDAR OnDemand works through the NCBI (National Center for Biotechnology Information) BioSample web-based metadata entry. CONCLUSION: CEDAR OnDemand helps lower the barrier of incorporating ontologies into standardized metadata entry for public data repositories. CEDAR OnDemand is available freely on the Google Chrome store https://chrome.google.com/webstore/search/CEDAROnDemand.
Assuntos
Ontologias Biológicas , Internet , Metadados , Software , Algoritmos , HumanosRESUMO
Biomedical ontologies are large: Several ontologies in the BioPortal repository contain thousands or even hundreds of thousands of entities. The development and maintenance of such large ontologies is difficult. To support ontology authors and repository developers in their work, it is crucial to improve our understanding of how these ontologies are explored, queried, reused, and used in downstream applications by biomedical researchers. We present an exploratory empirical analysis of user activities in the BioPortal ontology repository by analyzing BioPortal interaction logs across different access modes over several years. We investigate how users of BioPortal query and search for ontologies and their classes, how they explore the ontologies, and how they reuse classes from different ontologies. Additionally, through three real-world scenarios, we not only analyze the usage of ontologies for annotation tasks but also compare it to the browsing and querying behaviors of BioPortal users. For our investigation, we use several different visualization techniques. To inspect large amounts of interaction, reuse, and real-world usage data at a glance, we make use of and extend PolygOnto, a visualization method that has been successfully used to analyze reuse of ontologies in previous work. Our results show that exploration, query, reuse, and actual usage behaviors rarely align, suggesting that different users tend to explore, query and use different parts of an ontology. Finally, we highlight and discuss differences and commonalities among users of BioPortal.
RESUMO
Gene Ontology (GO) enrichment analysis is ubiquitously used for interpreting high throughput molecular data and generating hypotheses about underlying biological phenomena of experiments. However, the two building blocks of this analysis - the ontology and the annotations - evolve rapidly. We used gene signatures derived from 104 disease analyses to systematically evaluate how enrichment analysis results were affected by evolution of the GO over a decade. We found low consistency between enrichment analyses results obtained with early and more recent GO versions. Furthermore, there continues to be a strong annotation bias in the GO annotations where 58% of the annotations are for 16% of the human genes. Our analysis suggests that GO evolution may have affected the interpretation and possibly reproducibility of experiments over time. Hence, researchers must exercise caution when interpreting GO enrichment analyses and should reexamine previous analyses with the most recent GO version.
Assuntos
Biologia Computacional , Bases de Dados Genéticas , Evolução Molecular , Ontologia Genética , Modelos Genéticos , Anotação de Sequência Molecular , Humanos , Reprodutibilidade dos TestesRESUMO
BioPortal is widely regarded to be the world's most comprehensive repository of biomedical ontologies. With a coverage of many biomedical subfields by 716 ontologies (June 27, 2018), BioPortal is an extremely diverse repository. BioPortal maintains easily accessible information about the ontologies submitted by ontology curators. This includes size (concepts/classes, relationships/properties), number of projects, update history, and access history. Ontologies vary by size (from a few concepts to hundreds of thousands), by frequency of update/visit and by number of projects. Interestingly, some ontologies are rarely updated even though they contain thousands of concepts. In an informal email inquiry, we attempted to understand the reasons why ontologies that were built with a major investment of effort are apparently not sustained. Our analysis indicates that lack of funding, unavailability of human resources, and folding of ontologies into other ontologies are the most common among several other factors for discontinued maintenance of these ontologies.
Assuntos
Ontologias Biológicas , Acesso à Informação , Bibliometria , HumanosRESUMO
The Gene Expression Omnibus (GEO) contains more than two million digital samples from functional genomics experiments amassed over almost two decades. However, individual sample meta-data remains poorly described by unstructured free text attributes preventing its largescale reanalysis. We introduce the Search Tag Analyze Resource for GEO as a web application (http://STARGEO.org) to curate better annotations of sample phenotypes uniformly across different studies, and to use these sample annotations to define robust genomic signatures of disease pathology by meta-analysis. In this paper, we target a small group of biomedical graduate students to show rapid crowd-curation of precise sample annotations across all phenotypes, and we demonstrate the biological validity of these crowd-curated annotations for breast cancer. STARGEO.org makes GEO data findable, accessible, interoperable and reusable (i.e., FAIR) to ultimately facilitate knowledge discovery. Our work demonstrates the utility of crowd-curation and interpretation of open 'big data' under FAIR principles as a first step towards realizing an ideal paradigm of precision medicine.
Assuntos
Curadoria de Dados , Bases de Dados Genéticas , Expressão Gênica , HumanosRESUMO
BACKGROUND: Structured data acquisition is a common task that is widely performed in biomedicine. However, current solutions for this task are far from providing a means to structure data in such a way that it can be automatically employed in decision making (e.g., in our example application domain of clinical functional assessment, for determining eligibility for disability benefits) based on conclusions derived from acquired data (e.g., assessment of impaired motor function). To use data in these settings, we need it structured in a way that can be exploited by automated reasoning systems, for instance, in the Web Ontology Language (OWL); the de facto ontology language for the Web. RESULTS: We tackle the problem of generating Web-based assessment forms from OWL ontologies, and aggregating input gathered through these forms as an ontology of "semantically-enriched" form data that can be queried using an RDF query language, such as SPARQL. We developed an ontology-based structured data acquisition system, which we present through its specific application to the clinical functional assessment domain. We found that data gathered through our system is highly amenable to automatic analysis using queries. CONCLUSIONS: We demonstrated how ontologies can be used to help structuring Web-based forms and to semantically enrich the data elements of the acquired structured data. The ontologies associated with the enriched data elements enable automated inferences and provide a rich vocabulary for performing queries.
Assuntos
Ontologias Biológicas , Armazenamento e Recuperação da Informação/métodos , Internet , SoftwareRESUMO
Clinicians and clinical decision-support systems often follow pharmacotherapy recommendations for patients based on clinical practice guidelines (CPGs). In multimorbid patients, these recommendations can potentially have clinically significant drug-drug interactions (DDIs). In this study, we describe and validate a method for programmatically detecting DDIs among CPG recommendations. The system extracts pharmacotherapy intervention recommendations from narrative CPGs, normalizes the terms, creates a mapping of drugs and drug classes, and then identifies occurrences of DDIs between CPG pairs. We used this system to analyze 75 CPGs written by authoring entities in the United States that discuss outpatient management of common chronic diseases. Using a reference list of high-risk DDIs, we identified 2198 of these DDIs in 638 CPG pairs (46 unique CPGs). Only 9 high-risk DDIs were discussed by both CPGs in a pairing. In 69 of the pairings, neither CPG had a pharmacologic reference or a warning of the possibility of a DDI.