Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 26
Filtrar
1.
Eur J Epidemiol ; 38(6): 605-615, 2023 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-37099244

RESUMO

Data discovery, the ability to find datasets relevant to an analysis, increases scientific opportunity, improves rigour and accelerates activity. Rapid growth in the depth, breadth, quantity and availability of data provides unprecedented opportunities and challenges for data discovery. A potential tool for increasing the efficiency of data discovery, particularly across multiple datasets is data harmonisation.A set of 124 variables, identified as being of broad interest to neurodegeneration, were harmonised using the C-Surv data model. Harmonisation strategies used were simple calibration, algorithmic transformation and standardisation to the Z-distribution. Widely used data conventions, optimised for inclusiveness rather than aetiological precision, were used as harmonisation rules. The harmonisation scheme was applied to data from four diverse population cohorts.Of the 120 variables that were found in the datasets, correspondence between the harmonised data schema and cohort-specific data models was complete or close for 111 (93%). For the remainder, harmonisation was possible with a marginal a loss of granularity.Although harmonisation is not an exact science, sufficient comparability across datasets was achieved to enable data discovery with relatively little loss of informativeness. This provides a basis for further work extending harmonisation to a larger variable list, applying the harmonisation to further datasets, and incentivising the development of data discovery tools.


Assuntos
Conjuntos de Dados como Assunto , Descoberta do Conhecimento , Humanos , Padrões de Referência
2.
Hum Mutat ; 43(6): 791-799, 2022 06.
Artigo em Inglês | MEDLINE | ID: mdl-35297548

RESUMO

Beacon is a basic data discovery protocol issued by the Global Alliance for Genomics and Health (GA4GH). The main goal addressed by version 1 of the Beacon protocol was to test the feasibility of broadly sharing human genomic data, through providing simple "yes" or "no" responses to queries about the presence of a given variant in datasets hosted by Beacon providers. The popularity of this concept has fostered the design of a version 2, that better serves real-world requirements and addresses the needs of clinical genomics research and healthcare, as assessed by several contributing projects and organizations. Particularly, rare disease genetics and cancer research will benefit from new case level and genomic variant level requests and the enabling of richer phenotype and clinical queries as well as support for fuzzy searches. Beacon is designed as a "lingua franca" to bridge data collections hosted in software solutions with different and rich interfaces. Beacon version 2 works alongside popular standards like Phenopackets, OMOP, or FHIR, allowing implementing consortia to return matches in beacon responses and provide a handover to their preferred data exchange format. The protocol is being explored by other research domains and is being tested in several international projects.


Assuntos
Genômica , Disseminação de Informação , Humanos , Disseminação de Informação/métodos , Fenótipo , Doenças Raras , Software
3.
J Biomed Inform ; 93: 103154, 2019 05.
Artigo em Inglês | MEDLINE | ID: mdl-30922867

RESUMO

BACKGROUND: The global shift from paper health records to electronic ones has led to an impressive growth of biomedical digital data along the past two decades. Exploring and extracting knowledge from these data has the potential to enhance translational research and lead to positive outcomes for the population's health and healthcare. OBECTIVE: The aim of this study was to conduct a systematic review to identify software platforms that enable discovery, secondary use and interoperability of biomedical data. Additionally, we aim evaluating the identified solutions in terms of clinical interest and main healthcare-related outcomes. METHODS: A systematic search of the scientific literature published and indexed in Pubmed between January 2014 and September 2018 was performed. Inclusion criteria were as follows: relevance for the topic of biomedical data discovery, English language, and free full text. To increase the recall, we developed a semi-automatic and incremental methodology to retrieve articles that cite one or more of the previous set. RESULTS: A total number of 500 candidate papers were retrieved through this methodology. Of these, 85 were eligible for abstract assessment. Finally, 37 studies qualified for a full-text review, and 20 provided enough information for the study objectives. CONCLUSIONS: This study revealed that biomedical discovery platforms are both a current necessity and a significantly innovative agent in the area of healthcare. The outcomes that were identified, in terms of scientific publications, clinical studies and research collaborations stand as evidence.


Assuntos
Registros Eletrônicos de Saúde , Pesquisa Translacional Biomédica , Humanos , Software
4.
BMC Med ; 15(1): 176, 2017 09 27.
Artigo em Inglês | MEDLINE | ID: mdl-28950862

RESUMO

BACKGROUND: There are growing demands for predicting the prospects of achieving the global elimination of neglected tropical diseases as a result of the institution of large-scale nation-wide intervention programs by the WHO-set target year of 2020. Such predictions will be uncertain due to the impacts that spatial heterogeneity and scaling effects will have on parasite transmission processes, which will introduce significant aggregation errors into any attempt aiming to predict the outcomes of interventions at the broader spatial levels relevant to policy making. We describe a modeling platform that addresses this problem of upscaling from local settings to facilitate predictions at regional levels by the discovery and use of locality-specific transmission models, and we illustrate the utility of using this approach to evaluate the prospects for eliminating the vector-borne disease, lymphatic filariasis (LF), in sub-Saharan Africa by the WHO target year of 2020 using currently applied or newly proposed intervention strategies. METHODS AND RESULTS: We show how a computational platform that couples site-specific data discovery with model fitting and calibration can allow both learning of local LF transmission models and simulations of the impact of interventions that take a fuller account of the fine-scale heterogeneous transmission of this parasitic disease within endemic countries. We highlight how such a spatially hierarchical modeling tool that incorporates actual data regarding the roll-out of national drug treatment programs and spatial variability in infection patterns into the modeling process can produce more realistic predictions of timelines to LF elimination at coarse spatial scales, ranging from district to country to continental levels. Our results show that when locally applicable extinction thresholds are used, only three countries are likely to meet the goal of LF elimination by 2020 using currently applied mass drug treatments, and that switching to more intensive drug regimens, increasing the frequency of treatments, or switching to new triple drug regimens will be required if LF elimination is to be accelerated in Africa. The proportion of countries that would meet the goal of eliminating LF by 2020 may, however, reach up to 24/36 if the WHO 1% microfilaremia prevalence threshold is used and sequential mass drug deliveries are applied in countries. CONCLUSIONS: We have developed and applied a data-driven spatially hierarchical computational platform that uses the discovery of locally applicable transmission models in order to predict the prospects for eliminating the macroparasitic disease, LF, at the coarser country level in sub-Saharan Africa. We show that fine-scale spatial heterogeneity in local parasite transmission and extinction dynamics, as well as the exact nature of intervention roll-outs in countries, will impact the timelines to achieving national LF elimination on this continent.


Assuntos
Filariose Linfática/prevenção & controle , África Subsaariana/epidemiologia , Filariose Linfática/epidemiologia , História do Século XXI , Humanos , Prevalência
5.
Hum Mutat ; 36(10): 957-64, 2015 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-26224250

RESUMO

Biomedical data sharing is desirable, but problematic. Data "discovery" approaches-which establish the existence rather than the substance of data-precisely connect data owners with data seekers, and thereby promote data sharing. Cafe Variome (http://www.cafevariome.org) was therefore designed to provide a general-purpose, Web-based, data discovery tool that can be quickly installed by any genotype-phenotype data owner, or network of data owners, to make safe or sensitive content appropriately discoverable. Data fields or content of any type can be accommodated, from simple ID and label fields through to extensive genotype and phenotype details based on ontologies. The system provides a "shop window" in front of data, with main interfaces being a simple search box and a powerful "query-builder" that enable very elaborate queries to be formulated. After a successful search, counts of records are reported grouped by "openAccess" (data may be directly accessed), "linkedAccess" (a source link is provided), and "restrictedAccess" (facilitated data requests and subsequent provision of approved records). An administrator interface provides a wide range of options for system configuration, enabling highly customized single-site or federated networks to be established. Current uses include rare disease data discovery, patient matchmaking, and a Beacon Web service.


Assuntos
Bases de Dados Bibliográficas , Disseminação de Informação/métodos , Doenças Raras/genética , Predisposição Genética para Doença , Genótipo , Humanos , Fenótipo , Software , Interface Usuário-Computador , Navegador
6.
Stud Health Technol Inform ; 316: 1689-1693, 2024 Aug 22.
Artigo em Inglês | MEDLINE | ID: mdl-39176535

RESUMO

Multicentre studies become possible with the current strategies to solve the interoperability problems between databases. With the great adoption of those strategies, new problems regarding data discovery were raised. Some were solved using database catalogues and graphical dashboards for data analysis and comparison. However, when these communities grow, these strategies become obsolete. In this work, we addressed those challenges by proposing a platform with a chatbot-like mechanism to help medical researchers identify databases of interest. The tool was developed using the metadata extracted from OMOP CDM databases.


Assuntos
Bases de Dados Factuais , Humanos , Metadados , Registros Eletrônicos de Saúde
7.
J Public Health Dent ; 2024 Jul 02.
Artigo em Inglês | MEDLINE | ID: mdl-38953657

RESUMO

BACKGROUND/OBJECTIVES: Effective use of longitudinal study data is challenging because of divergences in the construct definitions and measurement approaches over time, between studies and across disciplines. One approach to overcome these challenges is data harmonization. Data harmonization is a practice used to improve variable comparability and reduce heterogeneity across studies. This study describes the process used to evaluate the harmonization potential of oral health-related variables across each survey wave. METHODS: National child cohort surveys with similar themes/objectives conducted in the last two decades were selected. The Maelstrom Research Guidelines were followed for harmonization potential evaluation. RESULTS: Seven nationally representative child cohort surveys were included and questionnaires examined from 50 survey waves. Questionnaires were classified into three domains and fifteen constructs and summarized by age groups. A DataSchema (a list of core variables representing the suitable version of the oral health outcomes and risk factors) was compiled comprising 42 variables. For each study wave, the potential (or not) to generate each DataSchema variable was evaluated. Of the 2100 harmonization status assessments, 543 (26%) were complete. Approximately 50% of the DataSchema variables can be generated across at least four cohort surveys while only 10% (n = 4) variables can be generated across all surveys. For each survey, the DataSchema variables that can be generated ranged between 26% and 76%. CONCLUSION: Data harmonization can improve the comparability of variables both within and across surveys. For future cohort surveys, the authors advocate more consistency and standardization in survey questionnaires within and between surveys.

8.
J Am Med Inform Assoc ; 30(10): 1693-1700, 2023 09 25.
Artigo em Inglês | MEDLINE | ID: mdl-37414539

RESUMO

OBJECTIVE: Researchers at New York University (NYU) Grossman School of Medicine contacted the Health Sciences Library for help with locating large datasets for reuse. In response, the library developed and maintained the NYU Data Catalog, a public-facing data catalog that has supported not only faculty acquisition of data but also the dissemination of the products of their research in various ways. MATERIALS AND METHODS: The current NYU Data Catalog is built upon the Symfony framework with a tailored metadata schema reflecting the scope of faculty research areas. The project team curates new resources, including datasets and supporting software code, and conducts quarterly and annual evaluations to assess user interactions with the NYU Data Catalog and opportunities for growth. RESULTS: Since its launch in 2015, the NYU Data Catalog underwent a number of changes prompted by an increase in the disciplines represented by faculty contributors. The catalog has also utilized faculty feedback to enhance support of data reuse and researcher collaboration through alterations to its schema, layout, and visibility of records. DISCUSSION: These findings demonstrate the flexibility of data catalogs as a platform for enabling the discovery of disparate sources of data. While not a repository, the NYU Data Catalog is well-positioned to support mandates for data sharing from study sponsors and publishers. CONCLUSION: The NYU Data Catalog makes the most of the data that researchers share and can be harnessed as a modular and adaptable platform to promote data sharing as a cultural practice.


Assuntos
Medicina , Software , Humanos , New York , Universidades
9.
Earth Sci Inform ; 15(3): 1471-1480, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36003899

RESUMO

NASA's Ice, Cloud, and land Elevation Satellite-2 (ICESat-2) carries a laser altimeter that fires 10,000 pulses per second towards Earth and records the travel time of individual photons to measure the elevation of the surface below. The volume of data produced by ICESat-2, nearly a TB per day, presents significant challenges for users wishing to efficiently explore the dataset. NASA's National Snow and Ice Data Center (NSIDC) Distributed Active Archive Center (DAAC), which is responsible for archiving and distributing ICESat-2 data, provides search and subsetting services on mission data products, but providing interactive data discovery and visualization tools needed to assess data coverage and quality in a given area of interest is outside of NSIDC's mandate. The OpenAltimetry project, a NASA-funded collaboration between NSIDC, UNAVCO and the University of California San Diego, has developed a web-based cyberinfrastructure platform that allows users to locate, visualize, and download ICESat-2 surface elevation data and photon clouds for any location on Earth, on demand. OpenAltimetry also provides access to elevations and waveforms for ICESat (the predecessor mission to ICESat-2). In addition, OpenAltimetry enables data access via APIs, opening opportunities for rapid access, experimentation, and computation via third party applications like Jupyter notebooks. OpenAltimetry emphasizes ease-of-use for new users and rapid access to entire altimetry datasets for experts and has been successful in meeting the needs of different user groups. In this paper we describe the principles that guided the design and development of the OpenAltimetry platform and provide a high-level overview of the cyberinfrastructure components of the system.

10.
Biol Psychiatry ; 91(8): 753-768, 2022 04 15.
Artigo em Inglês | MEDLINE | ID: mdl-35027165

RESUMO

BACKGROUND: The functional significance and mechanisms determining the development and individual variability of structural brain asymmetry remain unclear. Here, we systematically analyzed all relevant components of the most prominent structural asymmetry, brain torque (BT), and their relationships with potential genetic and nongenetic modifiers in a sample comprising 24,112 individuals from six cohorts. METHODS: BT features, including petalia, bending, dorsoventral shift, brain tissue distribution asymmetries, and cortical surface positional asymmetries, were directly modeled using a set of automatic three-dimensional brain shape analysis approaches. Age-, sex-, and handedness-related effects on BT were assessed. The genetic architecture and phenomic associations of BT were investigated using genome- and phenome-wide association scans. RESULTS: Our results confirmed the population-level predominance of the typical counterclockwise torque and suggested a first attenuating, then enlarging dynamic across the life span (3-81 years) primarily for frontal, occipital, and perisylvian BT features. Sex/handedness, BT, and cognitive function of verbal-numerical reasoning were found to be interrelated statistically. We observed differential heritability of up to 56% for BT, especially in temporal language areas. Individual variations of BT were also associated with various phenotypic variables of neuroanatomy, cognition, lifestyle, sociodemographics, anthropometry, physical health, and adult and child mental health. Our genomic analyses identified a number of genetic associations at lenient significance levels, which need to be further validated using larger samples in the future. CONCLUSIONS: This study provides a comprehensive description of BT and insights into biological and other factors that may contribute to the development and individual variations of BT.


Assuntos
Imageamento por Ressonância Magnética , Fenômica , Adulto , Encéfalo/diagnóstico por imagem , Mapeamento Encefálico , Criança , Lateralidade Funcional/genética , Humanos , Torque
11.
J Organ End User Comput ; 23(4): 17-30, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-24729759

RESUMO

In this paper, the authors present the results of a qualitative case-study seeking to characterize data discovery needs and barriers of principal investigators and research support staff in clinical translational science. Several implications for designing and implementing translational research systems have emerged through the authors' analysis. The results also illustrate the benefits of forming early partnerships with scientists to better understand their workflow processes and end-user computing practices in accessing data for research. The authors use this user-centered, iterative development approach to guide the implementation and extension of i2b2, a system they have adapted to support cross-institutional aggregate anonymized clinical data querying. With ongoing evaluation, the goal is to maximize the utility and extension of this system and develop an interface that appropriately fits the swiftly evolving needs of clinical translational scientists.

12.
Data Sci J ; 202021.
Artigo em Inglês | MEDLINE | ID: mdl-34795758

RESUMO

As a result of a number of national initiatives, we are seeing rapid growth in the data important to materials science that are available over the web. Consequently, it is becoming increasingly difficult for researchers to learn what data are available and how to access them. To address this problem, the Research Data Alliance (RDA) Working Group for International Materials Science Registries (IMRR) was established to bring together materials science and information technology experts to develop an international federation of registries that can be used for global discovery of data resources for materials science. A resource registry collects high-level metadata descriptions of resources such as data repositories, archives, websites, and services that are useful for data-driven research. By making the collection searchable, it aids scientists in industry, universities, and government laboratories to discover data relevant to their research and work interests. We present the results of our successful piloting of a registry federation for materials science data discovery. In particular, we out a blueprint for creating such a federation that is capable of amassing a global view of all available materials science data, and we enumerate the requirements for the standards that make the registries interoperable within the federation. These standards include a protocol for exchanging resource descriptions and a standard metadata schema for encoding those descriptions. We summarize how we leveraged an existing standard (OAI-PMH) for metadata exchange. Finally, we review the registry software developed to realize the federation and describe the user experience.

13.
Patterns (N Y) ; 2(11): 100370, 2021 Nov 12.
Artigo em Inglês | MEDLINE | ID: mdl-34820651

RESUMO

With a rising number of scientific datasets published and the need to test their Findable, Accessible, Interoperable, and Reusable (FAIR) compliance repeatedly, data stakeholders have recognized the importance of an automated FAIR assessment. This paper presents a programmatic solution for assessing the FAIRness of research data. We describe the translation of the FAIR data principles into measurable metrics and the application of the metrics in evaluating FAIR compliance of research data through an open-source tool we developed. For each metric, we conceptualized and implemented practical tests drawn upon prevailing data curation and sharing practices, and the paper discusses their rationales. We demonstrate the work by evaluating multidisciplinary datasets from trustworthy repositories, followed by recommendations and improvements. We believe our experience in developing and applying the metrics in practice and the lessons we learned from it will provide helpful information to others developing similar approaches to assess different types of digital objects and services.

14.
Patterns (N Y) ; 2(3): 100210, 2021 Mar 12.
Artigo em Inglês | MEDLINE | ID: mdl-33748794

RESUMO

The institutional review of interdisciplinary bodies of research lacks methods to systematically produce higher-level abstractions. Abstraction methods, like the "distant reading" of corpora, are increasingly important for knowledge discovery in the sciences and humanities. We demonstrate how abstraction methods complement the metrics on which research reviews currently rely. We model cross-disciplinary topics of research publications and projects emerging at multiple levels of detail in the context of an institutional review of the Earth Research Institute (ERI) at the University of California at Santa Barbara. From these, we design science maps that reveal the latent thematic structure of ERI's interdisciplinary research and enable reviewers to "read" a body of research at multiple levels of detail. We find that our approach provides decision support and reveals trends that strengthen the institutional review process by exposing regions of thematic expertise, distributions and clusters of work, and the evolution of these aspects.

15.
Stud Health Technol Inform ; 270: 317-321, 2020 Jun 16.
Artigo em Inglês | MEDLINE | ID: mdl-32570398

RESUMO

Medical studies are usually time consuming, cumbersome and extremely costly to perform, and for exploratory research, their results are also difficult to predict a priori. This is particularly the case for rare diseases, for which finding enough patients is difficult and usually requires an international-scale research. In this case, the process can be even more difficult due to the heterogeneity of data-protection regulations, making the data sharing process particularly hard. In this short paper, we propose MedCo2 (pronounced MedCo square), a distributed system that streamlines the process of a medical study by bridging and enabling both data discovery and data analysis among multiple databases, while protecting data confidentiality and patients' privacy. MedCo2 relies on interactive protocols, homomorphic encryption and differential privacy. It enables the privacy-preserving computations of multiple statistics such as cosine similarity and variance, and the training of machine learning models, on patients that are obliviously selected according to specific criteria among multiple databases.


Assuntos
Privacidade , Estudos de Coortes , Segurança Computacional , Confidencialidade , Humanos , Aprendizado de Máquina
16.
Gigascience ; 9(2)2020 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-32031623

RESUMO

BACKGROUND: Data reuse is often controlled to protect the privacy of subjects and patients. Data discovery tools need ways to inform researchers about restrictions on data access and re-use. RESULTS: We present elements in the Data Tags Suite (DATS) metadata schema describing data access, data use conditions, and consent information. DATS metadata are explained in terms of the administrative, legal, and technical systems used to protect confidential data. CONCLUSIONS: The access and use metadata items in DATS are designed from the perspective of a researcher who wants to find and re-use existing data. We call for standard ways of describing informed consent and data use agreements that will enable automated systems for managing research data.


Assuntos
Gerenciamento de Dados/métodos , Segurança Computacional/normas , Gerenciamento de Dados/normas , Mineração de Dados/métodos , Mineração de Dados/normas , Metadados
17.
Int J Med Inform ; 126: 35-45, 2019 06.
Artigo em Inglês | MEDLINE | ID: mdl-31029262

RESUMO

OBJECTIVE: The collaboration and knowledge exchange between researchers are often hindered by the nonexistence of accurate information about which databases may support research studies. Even though a considerable amount of patient health information does exist, it is usually distributed and hidden in many institutions. The goal of this project is to provide, for any research community, a holistic view of biomedical datasets of interests, from which researchers can explore several distinct levels of granularity. METHODS: We developed a community-centered approach to facilitate data sharing while ensuring privacy. A dynamic schema allows exposing any metadata model about existing repositories. The framework was developed following a modular plugin-based architecture that facilitates the integration of internal and external tools. RESULTS: The EMIF Catalogue, a web platform for sharing and reusing biomedical data. Through this system, data custodians can publish and share different levels of information, while the researchers can search for databases that fulfill research requirements. CONCLUSIONS: The EMIF Catalogue currently fosters several distinct research communities, with different levels of data governance, combining, for instance, data available in pan-European EHR and Alzheimer cohorts. This portal is publicly available at https://emif-catalogue.eu.


Assuntos
Pesquisa Biomédica , Comportamento Cooperativo , Sistemas de Gerenciamento de Base de Dados , Disseminação de Informação , Humanos , Gestão do Conhecimento , Editoração
18.
J Am Med Inform Assoc ; 25(1): 13-16, 2018 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-29228196

RESUMO

The DAta Tag Suite (DATS) is a model supporting dataset description, indexing, and discovery. It is available as an annotated serialization with schema.org, a vocabulary used by major search engines, thus making the datasets discoverable on the web. DATS underlies DataMed, the National Institutes of Health Big Data to Knowledge Data Discovery Index prototype, which aims to provide a "PubMed for datasets." The experience gained while indexing a heterogeneous range of >60 repositories in DataMed helped in evaluating DATS's entities, attributes, and scope. In this work, 3 additional exemplary and diverse data sources were mapped to DATS by their representatives or experts, offering a deep scan of DATS fitness against a new set of existing data. The procedure, including feedback from users and implementers, resulted in DATS implementation guidelines and best practices, and identification of a path for evolving and optimizing the model. Finally, the work exposed additional needs when defining datasets for indexing, especially in the context of clinical and observational information.


Assuntos
Indexação e Redação de Resumos , Conjuntos de Dados como Assunto , Alergia e Imunologia , Atenção à Saúde , Humanos , Armazenamento e Recuperação da Informação , Ferramenta de Busca , Ciências Sociais , Vocabulário Controlado
19.
J Am Med Inform Assoc ; 25(3): 337-344, 2018 Mar 01.
Artigo em Inglês | MEDLINE | ID: mdl-29202203

RESUMO

OBJECTIVE: To present user needs and usability evaluations of DataMed, a Data Discovery Index (DDI) that allows searching for biomedical data from multiple sources. MATERIALS AND METHODS: We conducted 2 phases of user studies. Phase 1 was a user needs analysis conducted before the development of DataMed, consisting of interviews with researchers. Phase 2 involved iterative usability evaluations of DataMed prototypes. We analyzed data qualitatively to document researchers' information and user interface needs. RESULTS: Biomedical researchers' information needs in data discovery are complex, multidimensional, and shaped by their context, domain knowledge, and technical experience. User needs analyses validate the need for a DDI, while usability evaluations of DataMed show that even though aggregating metadata into a common search engine and applying traditional information retrieval tools are promising first steps, there remain challenges for DataMed due to incomplete metadata and the complexity of data discovery. DISCUSSION: Biomedical data poses distinct problems for search when compared to websites or publications. Making data available is not enough to facilitate biomedical data discovery: new retrieval techniques and user interfaces are necessary for dataset exploration. Consistent, complete, and high-quality metadata are vital to enable this process. CONCLUSION: While available data and researchers' information needs are complex and heterogeneous, a successful DDI must meet those needs and fit into the processes of biomedical researchers. Research directions include formalizing researchers' information needs, standardizing overviews of data to facilitate relevance judgments, implementing user interfaces for concept-based searching, and developing evaluation methods for open-ended discovery systems such as DDIs.

20.
J Am Med Inform Assoc ; 25(3): 300-308, 2018 Mar 01.
Artigo em Inglês | MEDLINE | ID: mdl-29346583

RESUMO

OBJECTIVE: Finding relevant datasets is important for promoting data reuse in the biomedical domain, but it is challenging given the volume and complexity of biomedical data. Here we describe the development of an open source biomedical data discovery system called DataMed, with the goal of promoting the building of additional data indexes in the biomedical domain. MATERIALS AND METHODS: DataMed, which can efficiently index and search diverse types of biomedical datasets across repositories, is developed through the National Institutes of Health-funded biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) consortium. It consists of 2 main components: (1) a data ingestion pipeline that collects and transforms original metadata information to a unified metadata model, called DatA Tag Suite (DATS), and (2) a search engine that finds relevant datasets based on user-entered queries. In addition to describing its architecture and techniques, we evaluated individual components within DataMed, including the accuracy of the ingestion pipeline, the prevalence of the DATS model across repositories, and the overall performance of the dataset retrieval engine. RESULTS AND CONCLUSION: Our manual review shows that the ingestion pipeline could achieve an accuracy of 90% and core elements of DATS had varied frequency across repositories. On a manually curated benchmark dataset, the DataMed search engine achieved an inferred average precision of 0.2033 and a precision at 10 (P@10, the number of relevant results in the top 10 search results) of 0.6022, by implementing advanced natural language processing and terminology services. Currently, we have made the DataMed system publically available as an open source package for the biomedical community.

SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa