RESUMO
BACKGROUND: Public biomedical data repositories often provide web-based interfaces to collect experimental metadata. However, these interfaces typically reflect the ad hoc metadata specification practices of the associated repositories, leading to a lack of standardization in the collected metadata. This lack of standardization limits the ability of the source datasets to be broadly discovered, reused, and integrated with other datasets. To increase reuse, discoverability, and reproducibility of the described experiments, datasets should be appropriately annotated by using agreed-upon terms, ideally from ontologies or other controlled term sources. RESULTS: This work presents "CEDAR OnDemand", a browser extension powered by the NCBO (National Center for Biomedical Ontology) BioPortal that enables users to seamlessly enter ontology-based metadata through existing web forms native to individual repositories. CEDAR OnDemand analyzes the web page contents to identify the text input fields and associate them with relevant ontologies which are recommended automatically based upon input fields' labels (using the NCBO ontology recommender) and a pre-defined list of ontologies. These field-specific ontologies are used for controlling metadata entry. CEDAR OnDemand works for any web form designed in the HTML format. We demonstrate how CEDAR OnDemand works through the NCBI (National Center for Biotechnology Information) BioSample web-based metadata entry. CONCLUSION: CEDAR OnDemand helps lower the barrier of incorporating ontologies into standardized metadata entry for public data repositories. CEDAR OnDemand is available freely on the Google Chrome store https://chrome.google.com/webstore/search/CEDAROnDemand.
Assuntos
Ontologias Biológicas , Internet , Metadados , Software , Algoritmos , HumanosRESUMO
It is challenging to determine whether datasets are findable, accessible, interoperable, and reusable (FAIR) because the FAIR Guiding Principles refer to highly idiosyncratic criteria regarding the metadata used to annotate datasets. Specifically, the FAIR principles require metadata to be "rich" and to adhere to "domain-relevant" community standards. Scientific communities should be able to define their own machine-actionable templates for metadata that encode these "rich," discipline-specific elements. We have explored this template-based approach in the context of two software systems. One system is the CEDAR Workbench, which investigators use to author new metadata. The other is the FAIRware Workbench, which evaluates the metadata of archived datasets for their adherence to community standards. Benefits accrue when templates for metadata become central elements in an ecosystem of tools to manage online datasets-both because the templates serve as a community reference for what constitutes FAIR data, and because they embody that perspective in a form that can be distributed among a variety of software applications to assist with data stewardship and data sharing.
RESUMO
The Extensible Markup Language (XML) is increasingly being used for biomedical data exchange. The parallel growth in the use of ontologies in biomedicine presents opportunities for combining the two technologies to leverage the semantic reasoning services provided by ontology-based tools. There are currently no standardized approaches for taking XML-encoded biomedical information models and representing and reasoning with them using ontologies. To address this shortcoming, we have developed a workflow and a suite of tools for transforming XML-based information models into domain ontologies encoded using OWL. In this study, we applied semantics reasoning methods to these ontologies to automatically generate domain-level inferences. We successfully used these methods to develop semantic reasoning methods for information models in the HIV and radiological image domains.
Assuntos
Informática Médica/métodos , Linguagens de Programação , Semântica , Humanos , Disseminação de Informação/métodos , InternetRESUMO
Metadata-the machine-readable descriptions of the data-are increasingly seen as crucial for describing the vast array of biomedical datasets that are currently being deposited in public repositories. While most public repositories have firm requirements that metadata must accompany submitted datasets, the quality of those metadata is generally very poor. A key problem is that the typical metadata acquisition process is onerous and time consuming, with little interactive guidance or assistance provided to users. Secondary problems include the lack of validation and sparse use of standardized terms or ontologies when authoring metadata. There is a pressing need for improvements to the metadata acquisition process that will help users to enter metadata quickly and accurately. In this paper, we outline a recommendation system for metadata that aims to address this challenge. Our approach uses association rule mining to uncover hidden associations among metadata values and to represent them in the form of association rules. These rules are then used to present users with real-time recommendations when authoring metadata. The novelties of our method are that it is able to combine analyses of metadata from multiple repositories when generating recommendations and can enhance those recommendations by aligning them with ontology terms. We implemented our approach as a service integrated into the CEDAR Workbench metadata authoring platform, and evaluated it using metadata from two public biomedical repositories: US-based National Center for Biotechnology Information BioSample and European Bioinformatics Institute BioSamples. The results show that our approach is able to use analyses of previously entered metadata coupled with ontology-based mappings to present users with accurate recommendations when authoring metadata.
Assuntos
Mineração de Dados/métodos , Mineração de Dados/normas , Bases de Dados Factuais/normas , Metadados , Biologia Computacional/normasRESUMO
Developing promising treatments in biomedicine often requires aggregation and analysis of data from disparate sources across the healthcare and research spectrum. To facilitate these approaches, there is a growing focus on supporting interoperation of datasets by standardizing data-capture and reporting requirements. Common Data Elements (CDEs)-precise specifications of questions and the set of allowable answers to each question-are increasingly being adopted to help meet these standardization goals. While CDEs can provide a strong conceptual foundation for interoperation, there are no widely recognized serialization or interchange formats to describe and exchange their definitions. As a result, CDEs defined in one system cannot be easily be reused by other systems. An additional problem is that current CDE-based systems tend to be rather heavyweight and cannot be easily adopted and used by third-parties. To address these problems, we developed extensions to a metadata management system called the CEDAR Workbench to provide a platform to simplify the creation, exchange, and use of CDEs. We show how the resulting system allows users to quickly define and share CDEs and to immediately use these CDEs to build and deploy Web-based forms to acquire conforming metadata. We also show how we incorporated a large CDE library from the National Cancer Institute's caDSR system and made these CDEs publicly available for general use.
Assuntos
Pesquisa Biomédica , Elementos de Dados Comuns , Coleta de Dados/normas , Gerenciamento de Dados/métodos , Elementos de Dados Comuns/normas , Gerenciamento de Dados/normas , Humanos , Internet , Metadados , National Institutes of Health (U.S.) , Sistema de Registros , Estados Unidos , Interface Usuário-ComputadorRESUMO
The adaptation of high-throughput sequencing to the B cell receptor and T cell receptor has made it possible to characterize the adaptive immune receptor repertoire (AIRR) at unprecedented depth. These AIRR sequencing (AIRR-seq) studies offer tremendous potential to increase the understanding of adaptive immune responses in vaccinology, infectious disease, autoimmunity, and cancer. The increasingly wide application of AIRR-seq is leading to a critical mass of studies being deposited in the public domain, offering the possibility of novel scientific insights through secondary analyses and meta-analyses. However, effective sharing of these large-scale data remains a challenge. The AIRR community has proposed minimal information about adaptive immune receptor repertoire (MiAIRR), a standard for reporting AIRR-seq studies. The MiAIRR standard has been operationalized using the National Center for Biotechnology Information (NCBI) repositories. Submissions of AIRR-seq data to the NCBI repositories typically use a combination of web-based and flat-file templates and include only a minimal amount of terminology validation. As a result, AIRR-seq studies at the NCBI are often described using inconsistent terminologies, limiting scientists' ability to access, find, interoperate, and reuse the data sets. In order to improve metadata quality and ease submission of AIRR-seq studies to the NCBI, we have leveraged the software framework developed by the Center for Expanded Data Annotation and Retrieval (CEDAR), which develops technologies involving the use of data standards and ontologies to improve metadata quality. The resulting CEDAR-AIRR (CAIRR) pipeline enables data submitters to: (i) create web-based templates whose entries are controlled by ontology terms, (ii) generate and validate metadata, and (iii) submit the ontology-linked metadata and sequence files (FASTQ) to the NCBI BioProject, BioSample, and Sequence Read Archive databases. Overall, CAIRR provides a web-based metadata submission interface that supports compliance with the MiAIRR standard. This pipeline is available at http://cairr.miairr.org, and will facilitate the NCBI submission process and improve the metadata quality of AIRR-seq studies.
Assuntos
Biologia Computacional/métodos , Bases de Dados de Ácidos Nucleicos , Receptores de Antígenos de Linfócitos B/genética , Receptores de Antígenos de Linfócitos T/genética , Software , Biologia Computacional/organização & administração , Mineração de Dados , Ontologia Genética , Humanos , Metadados , Reprodutibilidade dos Testes , Interface Usuário-Computador , Fluxo de TrabalhoRESUMO
Managing time-stamped data is essential to clinical research activities and often requires the use of considerable domain knowledge. Adequately representing this domain knowledge is difficult in relational database systems. As a result, there is a need for principled methods to overcome the disconnect between the database representation of time-oriented research data and corresponding knowledge of domain-relevant concepts. In this paper, we present a set of methodologies for undertaking knowledge level querying of temporal patterns, and discuss its application to the verification of temporal constraints in clinical-trial applications. Our approach allows knowledge generated from query results to be tied to the data and, if necessary, used for further inference. We show how the Semantic Web ontology and rule languages, OWL and SWRL, respectively, can support the temporal knowledge model needed to integrate low-level representations of relational data with high-level domain concepts used in research data management. We present a scalable bridge-based software architecture that uses this knowledge model to enable dynamic querying of time-oriented research data.
Assuntos
Bases de Dados como Assunto , Armazenamento e Recuperação da Informação , Software , Pesquisa Biomédica , Bases de Conhecimento , Semântica , Tempo , Vocabulário ControladoRESUMO
BACKGROUND: Ontologies and controlled terminologies have become increasingly important in biomedical research. Researchers use ontologies to annotate their data with ontology terms, enabling better data integration and interoperability across disparate datasets. However, the number, variety and complexity of current biomedical ontologies make it cumbersome for researchers to determine which ones to reuse for their specific needs. To overcome this problem, in 2010 the National Center for Biomedical Ontology (NCBO) released the Ontology Recommender, which is a service that receives a biomedical text corpus or a list of keywords and suggests ontologies appropriate for referencing the indicated terms. METHODS: We developed a new version of the NCBO Ontology Recommender. Called Ontology Recommender 2.0, it uses a novel recommendation approach that evaluates the relevance of an ontology to biomedical text data according to four different criteria: (1) the extent to which the ontology covers the input data; (2) the acceptance of the ontology in the biomedical community; (3) the level of detail of the ontology classes that cover the input data; and (4) the specialization of the ontology to the domain of the input data. RESULTS: Our evaluation shows that the enhanced recommender provides higher quality suggestions than the original approach, providing better coverage of the input data, more detailed information about their concepts, increased specialization for the domain of the input data, and greater acceptance and use in the community. In addition, it provides users with more explanatory information, along with suggestions of not only individual ontologies but also groups of ontologies to use together. It also can be customized to fit the needs of different ontology recommendation scenarios. CONCLUSIONS: Ontology Recommender 2.0 suggests relevant ontologies for annotating biomedical text data. It combines the strengths of its predecessor with a range of adjustments and new features that improve its reliability and usefulness. Ontology Recommender 2.0 recommends over 500 biomedical ontologies from the NCBO BioPortal platform, where it is openly available (both via the user interface at http://bioportal.bioontology.org/recommender , and via a Web service API).
Assuntos
Ontologias Biológicas , National Institutes of Health (U.S.) , Semântica , Estados UnidosRESUMO
The Center for Expanded Data Annotation and Retrieval (CEDAR) aims to revolutionize the way that metadata describing scientific experiments are authored. The software we have developed-the CEDAR Workbench-is a suite of Web-based tools and REST APIs that allows users to construct metadata templates, to fill in templates to generate high-quality metadata, and to share and manage these resources. The CEDAR Workbench provides a versatile, REST-based environment for authoring metadata that are enriched with terms from ontologies. The metadata are available as JSON, JSON-LD, or RDF for easy integration in scientific applications and reusability on the Web. Users can leverage our APIs for validating and submitting metadata to external repositories. The CEDAR Workbench is freely available and open-source.
RESUMO
In biomedicine, high-quality metadata are crucial for finding experimental datasets, for understanding how experiments were performed, and for reproducing those experiments. Despite the recent focus on metadata, the quality of metadata available in public repositories continues to be extremely poor. A key difficulty is that the typical metadata acquisition process is time-consuming and error prone, with weak or nonexistent support for linking metadata to ontologies. There is a pressing need for methods and tools to speed up the metadata acquisition process and to increase the quality of metadata that are entered. In this paper, we describe a methodology and set of associated tools that we developed to address this challenge. A core component of this approach is a value recommendation framework that uses analysis of previously entered metadata and ontology-based metadata specifications to help users rapidly and accurately enter their metadata. We performed an initial evaluation of this approach using metadata from a public metadata repository.
Assuntos
Metadados , Ontologias Biológicas , Pesquisa Biomédica , Confiabilidade dos Dados , Análise de Dados , Metadados/normas , MétodosRESUMO
The Center for Expanded Data Annotation and Retrieval is studying the creation of comprehensive and expressive metadata for biomedical datasets to facilitate data discovery, data interpretation, and data reuse. We take advantage of emerging community-based standard templates for describing different kinds of biomedical datasets, and we investigate the use of computational techniques to help investigators to assemble templates and to fill in their values. We are creating a repository of metadata from which we plan to identify metadata patterns that will drive predictive data entry when filling in metadata templates. The metadata repository not only will capture annotations specified when experimental datasets are initially created, but also will incorporate links to the published literature, including secondary analyses and possible refinements or retractions of experimental interpretations. By working initially with the Human Immunology Project Consortium and the developers of the ImmPort data repository, we are developing and evaluating an end-to-end solution to the problems of metadata authoring and management that will generalize to other data-management environments.
Assuntos
Pesquisa Biomédica , Mineração de Dados , Conjuntos de Dados como Assunto , Ontologias Biológicas , Humanos , Armazenamento e Recuperação da Informação , Estados UnidosRESUMO
Information technology can support the implementation of clinical research findings in practice settings. Technology can address the quality gap in health care by providing automated decision support to clinicians that integrates guideline knowledge with electronic patient data to present real-time, patient-specific recommendations. However, technical success in implementing decision support systems may not translate directly into system use by clinicians. Successful technology integration into clinical work settings requires explicit attention to the organizational context. We describe the application of a "sociotechnical" approach to integration of ATHENA DSS, a decision support system for the treatment of hypertension, into geographically dispersed primary care clinics. We applied an iterative technical design in response to organizational input and obtained ongoing endorsements of the project by the organization's administrative and clinical leadership. Conscious attention to organizational context at the time of development, deployment, and maintenance of the system was associated with extensive clinician use of the system.
Assuntos
Centros Médicos Acadêmicos/organização & administração , Sistemas de Apoio a Decisões Clínicas/organização & administração , Hipertensão/terapia , Terapia Assistida por Computador , Inteligência Artificial , Humanos , Sistemas Computadorizados de Registros Médicos , Inovação Organizacional , Integração de Sistemas , Interface Usuário-ComputadorRESUMO
THERE ARE TWO KEY CHALLENGES HINDERING EFFECTIVE USE OF QUANTITATIVE ASSESSMENT OF IMAGING IN CANCER RESPONSE ASSESSMENT: 1) Radiologists usually describe the cancer lesions in imaging studies subjectively and sometimes ambiguously, and 2) it is difficult to repurpose imaging data, because lesion measurements are not recorded in a format that permits machine interpretation and interoperability. We have developed a freely available software platform on the basis of open standards, the electronic Physician Annotation Device (ePAD), to tackle these challenges in two ways. First, ePAD facilitates the radiologist in carrying out cancer lesion measurements as part of routine clinical trial image interpretation workflow. Second, ePAD records all image measurements and annotations in a data format that permits repurposing image data for analyses of alternative imaging biomarkers of treatment response. To determine the impact of ePAD on radiologist efficiency in quantitative assessment of imaging studies, a radiologist evaluated computed tomography (CT) imaging studies from 20 subjects having one baseline and three consecutive follow-up imaging studies with and without ePAD. The radiologist made measurements of target lesions in each imaging study using Response Evaluation Criteria in Solid Tumors 1.1 criteria, initially with the aid of ePAD, and then after a 30-day washout period, the exams were reread without ePAD. The mean total time required to review the images and summarize measurements of target lesions was 15% (P < .039) shorter using ePAD than without using this tool. In addition, it was possible to rapidly reanalyze the images to explore lesion cross-sectional area as an alternative imaging biomarker to linear measure. We conclude that ePAD appears promising to potentially improve reader efficiency for quantitative assessment of CT examinations, and it may enable discovery of future novel image-based biomarkers of cancer treatment response.
RESUMO
BACKGROUND: A variety of informatics approaches have been developed that use information retrieval, NLP and text-mining techniques to identify biomedical concepts and relations within scientific publications or their sentences. These approaches have not typically addressed the challenge of extracting more complex knowledge such as biomedical definitions. In our efforts to facilitate knowledge acquisition of rule-based definitions of autism phenotypes, we have developed a novel semantic-based text-mining approach that can automatically identify such definitions within text. RESULTS: Using an existing knowledge base of 156 autism phenotype definitions and an annotated corpus of 26 source articles containing such definitions, we evaluated and compared the average rank of correctly identified rule definition or corresponding rule template using both our semantic-based approach and a standard term-based approach. We examined three separate scenarios: (1) the snippet of text contained a definition already in the knowledge base; (2) the snippet contained an alternative definition for a concept in the knowledge base; and (3) the snippet contained a definition not in the knowledge base. Our semantic-based approach had a higher average rank than the term-based approach for each of the three scenarios (scenario 1: 3.8 vs. 5.0; scenario 2: 2.8 vs. 4.9; and scenario 3: 4.5 vs. 6.2), with each comparison significant at the p-value of 0.05 using the Wilcoxon signed-rank test. CONCLUSIONS: Our work shows that leveraging existing domain knowledge in the information extraction of biomedical definitions significantly improves the correct identification of such knowledge within sentences. Our method can thus help researchers rapidly acquire knowledge about biomedical definitions that are specified and evolving within an ever-growing corpus of scientific publications.
RESUMO
Biomedical ontologies are increasingly being used to improve information retrieval methods. In this paper, we present a novel information retrieval approach that exploits knowledge specified by the Semantic Web ontology and rule languages OWL and SWRL. We evaluate our approach using an autism ontology that has 156 SWRL rules defining 145 autism phenotypes. Our approach uses a vector space model to correlate how well these phenotypes relate to the publications used to define them. We compare a vector space phenotype representation using class hierarchies with one that extends this method to incorporate additional semantics encoded in SWRL rules. From a PubMed-extracted corpus of 75 articles, we show that average rank of a related paper using the class hierarchy method is 4.6 whereas the average rank using the extended rule-based method is 3.3. Our results indicate that incorporating rule-based definitions in information retrieval methods can improve search for relevant publications.
Assuntos
Transtorno Autístico , Armazenamento e Recuperação da Informação/métodos , Vocabulário Controlado , Humanos , Fenótipo , SemânticaRESUMO
Identifying, tracking and reasoning about tumor lesions is a central task in cancer research and clinical practice that could potentially be automated. However, information about tumor lesions in imaging studies is not easily accessed by machines for automated reasoning. The Annotation and Image Markup (AIM) information model recently developed for the cancer Biomedical Informatics Grid provides a method for encoding the semantic information related to imaging findings, enabling their storage and transfer. However, it is currently not possible to apply automated reasoning methods to image information encoded in AIM. We have developed a methodology and a suite of tools for transforming AIM image annotations into OWL, and an ontology for reasoning with the resulting image annotations for tumor lesion assessment. Our methods enable automated inference of semantic information about cancer lesions in images.
Assuntos
Interpretação de Imagem Assistida por Computador , Processamento de Linguagem Natural , Neoplasias/diagnóstico , Linguagens de Programação , Ensaios Clínicos como Assunto , Diagnóstico por Imagem , Humanos , SemânticaRESUMO
Worldwide developments concerning infectious diseases and bioterrorism are driving forces for improving aberrancy detection in public health surveillance. The performance of an aberrancy detection algorithm can be measured in terms of sensitivity, specificity and timeliness. However, these metrics are probabilistically dependent variables and there is always a trade-off between them. This situation raises the question of how to quantify this tradeoff. The answer to this question depends on the characteristics of the specific disease under surveillance, the characteristics of data used for surveillance, and the algorithmic properties of detection methods. In practice, the evidence describing the relative performance of different algorithms remains fragmented and mainly qualitative. In this paper, we consider the development and evaluation of a Bayesian network framework for analysis of performance measures of aberrancy detection algorithms. This framework enables principled comparison of algorithms and identification of suitable algorithms for use in specific public health surveillance settings.
Assuntos
Algoritmos , Teorema de Bayes , Surtos de Doenças , Vigilância da População/métodos , Doenças Transmissíveis/diagnóstico , Doenças Transmissíveis/epidemiologia , Humanos , Informática em Saúde PúblicaRESUMO
Problem solving methods (PSMs) are software components that represent and encode reusable algorithms. They can be combined with representations of domain knowledge to produce intelligent application systems. A goal of research on PSMs is to provide principled methods and tools for composing and reusing algorithms in knowledge-based systems. The ultimate objective is to produce libraries of methods that can be easily adapted for use in these systems. Despite the intuitive appeal of PSMs as conceptual building blocks, in practice, these goals are largely unmet. There are no widely available tools for building applications using PSMs and no public libraries of PSMs available for reuse. This paper analyzes some of the reasons for the lack of widespread adoptions of PSM techniques and illustrate our analysis by describing our experiences developing a complex, high-throughput software system based on PSM principles. We conclude that many fundamental principles in PSM research are useful for building knowledge-based systems. In particular, the task-method decomposition process, which provides a means for structuring knowledge-based tasks, is a powerful abstraction for building systems of analytic methods. However, despite the power of PSMs in the conceptual modeling of knowledge-based systems, software engineering challenges have been seriously underestimated. The complexity of integrating control knowledge modeled by developers using PSMs with the domain knowledge that they model using ontologies creates a barrier to widespread use of PSM-based systems. Nevertheless, the surge of recent interest in ontologies has led to the production of comprehensive domain ontologies and of robust ontology-authoring tools. These developments present new opportunities to leverage the PSM approach.
RESUMO
Managing time-stamped data is essential to clinical research activities and often requires the use of considerable domain knowledge. Adequately representing and integrating temporal data and domain knowledge is difficult with the database technologies used in most clinical research systems. There is often a disconnect between the database representation of research data and corresponding domain knowledge of clinical research concepts. In this paper, we present a set of methodologies for undertaking ontology-based specification of temporal information, and discuss their application to the verification of protocol-specific temporal constraints among clinical trial activities. Our approach allows knowledge-level temporal constraints to be evaluated against operational trial data stored in relational databases. We show how the Semantic Web ontology and rule languages OWL and SWRL, respectively, can support tools for research data management that automatically integrate low-level representations of relational data with high-level domain concepts used in study design.
Assuntos
Ensaios Clínicos como Assunto , Integração de Sistemas , Modelos TeóricosRESUMO
Many biomedical research databases contain time-oriented data resulting from longitudinal, time-series and time-dependent study designs, knowledge of which is not handled explicitly by most data-analytic methods. To make use of such knowledge about research data, we have developed an ontology-driven temporal mining method, called ChronoMiner. Most mining algorithms require data be inputted in a single table. ChronoMiner, in contrast, can search for interesting temporal patterns among multiple input tables and at different levels of hierarchical representation. In this paper, we present the application of our method to the discovery of temporal associations between newly arising mutations in the HIV genome and past drug regimens. We discuss the various components of ChronoMiner, including its user interface, and provide results of a study indicating the efficiency and potential value of ChronoMiner on an existing HIV drug resistance data repository.