Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 690
Filtrar
1.
Sci Data ; 11(1): 732, 2024 Jul 05.
Artigo em Inglês | MEDLINE | ID: mdl-38969627

RESUMO

To explore complex biological questions, it is often necessary to access various data types from public data repositories. As the volume and complexity of biological sequence data grow, public repositories face significant challenges in ensuring that the data is easily discoverable and usable by the biological research community. To address these challenges, the National Center for Biotechnology Information (NCBI) has created NCBI Datasets. This resource provides straightforward, comprehensive, and scalable access to biological sequences, annotations, and metadata for a wide range of taxa. Following the FAIR (Findable, Accessible, Interoperable, and Reusable) data management principles, NCBI Datasets offers user-friendly web interfaces, command-line tools, and documented APIs, empowering researchers to access NCBI data seamlessly. The data is delivered as packages of sequences and metadata, thus facilitating improved data retrieval, sharing, and usability in research. Moreover, this data delivery method fosters effective data attribution and promotes its further reuse. This paper outlines the current scope of data accessible through NCBI Datasets and explains various options for exploring and downloading the data.


Assuntos
Metadados , Bases de Dados Genéticas , Estados Unidos , Armazenamento e Recuperação da Informação
2.
Gigascience ; 132024 Jan 02.
Artigo em Inglês | MEDLINE | ID: mdl-38991851

RESUMO

BACKGROUND: As biological data increase, we need additional infrastructure to share them and promote interoperability. While major effort has been put into sharing data, relatively less emphasis is placed on sharing metadata. Yet, sharing metadata is also important and in some ways has a wider scope than sharing data themselves. RESULTS: Here, we present PEPhub, an approach to improve sharing and interoperability of biological metadata. PEPhub provides an API, natural-language search, and user-friendly web-based sharing and editing of sample metadata tables. We used PEPhub to process more than 100,000 published biological research projects and index them with fast semantic natural-language search. PEPhub thus provides a fast and user-friendly way to finding existing biological research data or to share new data. AVAILABILITY: https://pephub.databio.org.


Assuntos
Bases de Dados Factuais , Disseminação de Informação , Internet , Metadados , Software , Interface Usuário-Computador , Disseminação de Informação/métodos , Biologia Computacional/métodos
3.
Sci Data ; 11(1): 754, 2024 Jul 10.
Artigo em Inglês | MEDLINE | ID: mdl-38987254

RESUMO

Ancient DNA is producing a rich record of past genetic diversity in humans and other species. However, unless the primary data is appropriately archived, its long-term value will not be fully realised. I surveyed publicly archived data from 42 recent ancient genomics studies. Half of the studies archived incomplete datasets, preventing accurate replication and representing a loss of data of potential future use. No studies met all criteria that could be considered best practice. Based on these results, I make six recommendations for data producers: (1) archive all sequencing reads, not just those that aligned to a reference genome, (2) archive read alignments too, but as secondary analysis files, (3) provide correct experiment metadata on samples, libraries and sequencing runs, (4) provide informative sample metadata, (5) archive data from low-coverage and negative experiments, and (6) document archiving choices in papers, and peer review these. Given the reliance on destructive sampling of finite material, ancient genomics studies have a particularly strong responsibility to ensure the longevity and reusability of generated data.


Assuntos
DNA Antigo , Genômica , Humanos , DNA Antigo/análise , Animais , Metadados
4.
Sci Data ; 11(1): 772, 2024 Jul 13.
Artigo em Inglês | MEDLINE | ID: mdl-39003329

RESUMO

The German initiative "National Research Data Infrastructure for Personal Health Data" (NFDI4Health) focuses on research data management in health research. It aims to foster and develop harmonized informatics standards for public health, epidemiological studies, and clinical trials, facilitating access to relevant data and metadata standards. This publication lists syntactic and semantic data standards of potential use for NFDI4Health and beyond, based on interdisciplinary meetings and workshops, mappings of study questionnaires and the NFDI4Health metadata schema, and literature search. Included are 7 syntactic, 32 semantic and 9 combined syntactic and semantic standards. In addition, 101 ISO Standards from ISO/TC 215 Health Informatics and ISO/TC 276 Biotechnology could be identified as being potentially relevant. The work emphasizes the utilization of standards for epidemiological and health research data ensuring interoperability as well as the compatibility to NFDI4Health, its use cases, and to (inter-)national efforts within these sectors. The goal is to foster collaborative and inter-sectoral work in health research and initiate a debate around the potential of using common standards.


Assuntos
Interoperabilidade da Informação em Saúde , Humanos , Metadados , Alemanha , Registros de Saúde Pessoal , Gerenciamento de Dados
5.
BMC Med ; 22(1): 296, 2024 Jul 18.
Artigo em Inglês | MEDLINE | ID: mdl-39020355

RESUMO

BACKGROUND: Sexually transmitted infections (STIs) pose a significant global public health challenge. Early diagnosis and treatment reduce STI transmission, but rely on recognising symptoms and care-seeking behaviour of the individual. Digital health software that distinguishes STI skin conditions could improve health-seeking behaviour. We developed and evaluated a deep learning model to differentiate STIs from non-STIs based on clinical images and symptoms. METHODS: We used 4913 clinical images of genital lesions and metadata from the Melbourne Sexual Health Centre collected during 2010-2023. We developed two binary classification models to distinguish STIs from non-STIs: (1) a convolutional neural network (CNN) using images only and (2) an integrated model combining both CNN and fully connected neural network (FCN) using images and metadata. We evaluated the model performance by the area under the ROC curve (AUC) and assessed metadata contributions to the Image-only model. RESULTS: Our study included 1583 STI and 3330 non-STI images. Common STI diagnoses were syphilis (34.6%), genital warts (24.5%) and herpes (19.4%), while most non-STIs (80.3%) were conditions such as dermatitis, lichen sclerosis and balanitis. In both STI and non-STI groups, the most frequently observed groups were 25-34 years (48.6% and 38.2%, respectively) and heterosexual males (60.3% and 45.9%, respectively). The Image-only model showed a reasonable performance with an AUC of 0.859 (SD 0.013). The Image + Metadata model achieved a significantly higher AUC of 0.893 (SD 0.018) compared to the Image-only model (p < 0.01). Out of 21 metadata, the integration of demographic and dermatological metadata led to the most significant improvement in model performance, increasing AUC by 6.7% compared to the baseline Image-only model. CONCLUSIONS: The Image + Metadata model outperformed the Image-only model in distinguishing STIs from other skin conditions. Using it as a screening tool in a clinical setting may require further development and evaluation with larger datasets.


Assuntos
Metadados , Infecções Sexualmente Transmissíveis , Humanos , Infecções Sexualmente Transmissíveis/diagnóstico , Masculino , Feminino , Adulto , Inteligência Artificial , Pessoa de Meia-Idade , Redes Neurais de Computação , Adulto Jovem , Programas de Rastreamento/métodos , Dermatopatias/diagnóstico , Aprendizado Profundo
6.
PLoS One ; 19(6): e0306100, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38917182

RESUMO

Making data FAIR-findable, accessible, interoperable, reproducible-has become the recurring theme behind many research data management efforts. dtool is a lightweight data management tool that packages metadata with immutable data to promote accessibility, interoperability, and reproducibility. Each dataset is self-contained and does not require metadata to be stored in a centralised system. This decentralised approach means that finding datasets can be difficult. dtool's lookup server, short dserver, as defined by a REST API, makes dtool datasets findable, hence rendering the dtool ecosystem fit for a FAIR data management world. Its simplicity, modularity, accessibility and standardisation via API distinguish dtool and dserver from other solutions and enable it to serve as a common denominator for cross-disciplinary research data management. The dtool ecosystem bridges the gap between standardisation-free data management by individuals and FAIR platform solutions with rigid metadata requirements.


Assuntos
Software , Gerenciamento de Dados/métodos , Metadados , Ecossistema , Reprodutibilidade dos Testes , Internet
7.
Health Informatics J ; 30(2): 14604582241262961, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38881290

RESUMO

Objectives: This study aims to address the critical challenges of data integrity, accuracy, consistency, and precision in the application of electronic medical record (EMR) data within the healthcare sector, particularly within the context of Chinese medical information data management. The research seeks to propose a solution in the form of a medical metadata governance framework that is efficient and suitable for clinical research and transformation. Methods: The article begins by outlining the background of medical information data management and reviews the advancements in artificial intelligence (AI) technology relevant to the field. It then introduces the "Service, Patient, Regression, base/Away, Yeast" (SPRAY)-type AI application as a case study to illustrate the potential of AI in EMR data management. Results: The research identifies the scarcity of scientific research on the transformation of EMR data in Chinese hospitals and proposes a medical metadata governance framework as a solution. This framework is designed to achieve scientific governance of clinical data by integrating metadata management and master data management, grounded in clinical practices, medical disciplines, and scientific exploration. Furthermore, it incorporates an information privacy security architecture to ensure data protection. Conclusion: The proposed medical metadata governance framework, supported by AI technology, offers a structured approach to managing and transforming EMR data into valuable scientific research outcomes. This framework provides guidance for the identification, cleaning, mining, and deep application of EMR data, thereby addressing the bottlenecks currently faced in the healthcare scenario and paving the way for more effective clinical research and data-driven decision-making.


Assuntos
Inteligência Artificial , Registros Eletrônicos de Saúde , Inteligência Artificial/tendências , China , Humanos , Registros Eletrônicos de Saúde/tendências , Gerenciamento de Dados/métodos , Metadados
8.
Sci Data ; 11(1): 574, 2024 Jun 04.
Artigo em Inglês | MEDLINE | ID: mdl-38834597

RESUMO

Experts from 18 consortia are collaborating on the Human Reference Atlas (HRA) which aims to map the 37 trillion cells in the healthy human body. Information relevant for HRA construction and usage is held by experts, published in scholarly papers, and captured in experimental data. However, these data sources use different metadata schemas and cannot be cross-searched efficiently. This paper documents the compilation of a dataset, named HRAlit, that links the 136 HRA v1.4 digital objects (31 organs with 4,279 anatomical structures, 1,210 cell types, 2,089 biomarkers) to 583,117 experts; 7,103,180 publications; 896,680 funded projects, and 1,816 experimental datasets. The resulting HRAlit has 22 tables with 20,939,937 records including 6 junction tables with 13,170,651 relationships. The HRAlit can be mined to identify leading experts, major papers, funding trends, or alignment with existing ontologies in support of systematic HRA construction and usage.


Assuntos
Células , Metadados , Humanos
9.
Sci Data ; 11(1): 634, 2024 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-38879585

RESUMO

In low- and middle-income countries, the substantial costs associated with traditional data collection pose an obstacle to facilitating decision-making in the field of public health. Satellite imagery offers a potential solution, but the image extraction and analysis can be costly and requires specialized expertise. We introduce SatelliteBench, a scalable framework for satellite image extraction and vector embeddings generation. We also propose a novel multimodal fusion pipeline that utilizes a series of satellite imagery and metadata. The framework was evaluated generating a dataset with a collection of 12,636 images and embeddings accompanied by comprehensive metadata, from 81 municipalities in Colombia between 2016 and 2018. The dataset was then evaluated in 3 tasks: including dengue case prediction, poverty assessment, and access to education. The performance showcases the versatility and practicality of SatelliteBench, offering a reproducible, accessible and open tool to enhance decision-making in public health.


Assuntos
Dengue , Saúde Pública , Imagens de Satélites , Colômbia , Humanos , Metadados
10.
Nat Ecol Evol ; 8(7): 1224-1232, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-38789640

RESUMO

Genetic and genomic data are collected for a vast array of scientific and applied purposes. Despite mandates for public archiving, data are typically used only by the generating authors. The reuse of genetic and genomic datasets remains uncommon because it is difficult, if not impossible, due to non-standard archiving practices and lack of contextual metadata. But as the new field of macrogenetics is demonstrating, if genetic data and their metadata were more accessible and FAIR (findable, accessible, interoperable and reusable) compliant, they could be reused for many additional purposes. We discuss the main challenges with existing genetic and genomic data archives, and suggest best practices for archiving genetic and genomic data. Recognizing that this is a longstanding issue due to little formal data management training within the fields of ecology and evolution, we highlight steps that research institutions and publishers could take to improve data archiving.


Assuntos
Genômica , Bases de Dados Genéticas , Gerenciamento de Dados , Metadados
12.
J Am Med Inform Assoc ; 31(7): 1578-1582, 2024 Jun 20.
Artigo em Inglês | MEDLINE | ID: mdl-38700253

RESUMO

OBJECTIVE: Leverage electronic health record (EHR) audit logs to develop a machine learning (ML) model that predicts which notes a clinician wants to review when seeing oncology patients. MATERIALS AND METHODS: We trained logistic regression models using note metadata and a Term Frequency Inverse Document Frequency (TF-IDF) text representation. We evaluated performance with precision, recall, F1, AUC, and a clinical qualitative assessment. RESULTS: The metadata only model achieved an AUC 0.930 and the metadata and TF-IDF model an AUC 0.937. Qualitative assessment revealed a need for better text representation and to further customize predictions for the user. DISCUSSION: Our model effectively surfaces the top 10 notes a clinician wants to review when seeing an oncology patient. Further studies can characterize different types of clinician users and better tailor the task for different care settings. CONCLUSION: EHR audit logs can provide important relevance data for training ML models that assist with note-writing in the oncology setting.


Assuntos
Registros Eletrônicos de Saúde , Aprendizado de Máquina , Oncologia , Humanos , Modelos Logísticos , Metadados , Auditoria Médica , Estudo de Prova de Conceito
13.
J Am Med Inform Assoc ; 31(7): 1463-1470, 2024 Jun 20.
Artigo em Inglês | MEDLINE | ID: mdl-38722233

RESUMO

OBJECTIVE: ModelDB (https://modeldb.science) is a discovery platform for computational neuroscience, containing over 1850 published model codes with standardized metadata. These codes were mainly supplied from unsolicited model author submissions, but this approach is inherently limited. For example, we estimate we have captured only around one-third of NEURON models, the most common type of models in ModelDB. To more completely characterize the state of computational neuroscience modeling work, we aim to identify works containing results derived from computational neuroscience approaches and their standardized associated metadata (eg, cell types, research topics). MATERIALS AND METHODS: Known computational neuroscience work from ModelDB and identified neuroscience work queried from PubMed were included in our study. After pre-screening with SPECTER2 (a free document embedding method), GPT-3.5, and GPT-4 were used to identify likely computational neuroscience work and relevant metadata. RESULTS: SPECTER2, GPT-4, and GPT-3.5 demonstrated varied but high abilities in identification of computational neuroscience work. GPT-4 achieved 96.9% accuracy and GPT-3.5 improved from 54.2% to 85.5% through instruction-tuning and Chain of Thought. GPT-4 also showed high potential in identifying relevant metadata annotations. DISCUSSION: Accuracy in identification and extraction might further be improved by dealing with ambiguity of what are computational elements, including more information from papers (eg, Methods section), improving prompts, etc. CONCLUSION: Natural language processing and large language model techniques can be added to ModelDB to facilitate further model discovery, and will contribute to a more standardized and comprehensive framework for establishing domain-specific resources.


Assuntos
Biologia Computacional , Neurociências , Biologia Computacional/métodos , Humanos , Metadados , Curadoria de Dados/métodos , Modelos Neurológicos , Mineração de Dados/métodos , Bases de Dados Factuais
14.
Front Cell Infect Microbiol ; 14: 1384809, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38774631

RESUMO

Introduction: Sharing microbiome data among researchers fosters new innovations and reduces cost for research. Practically, this means that the (meta)data will have to be standardized, transparent and readily available for researchers. The microbiome data and associated metadata will then be described with regards to composition and origin, in order to maximize the possibilities for application in various contexts of research. Here, we propose a set of tools and protocols to develop a real-time FAIR (Findable. Accessible, Interoperable and Reusable) compliant database for the handling and storage of human microbiome and host-associated data. Methods: The conflicts arising from privacy laws with respect to metadata, possible human genome sequences in the metagenome shotgun data and FAIR implementations are discussed. Alternate pathways for achieving compliance in such conflicts are analyzed. Sample traceable and sensitive microbiome data, such as DNA sequences or geolocalized metadata are identified, and the role of the GDPR (General Data Protection Regulation) data regulations are considered. For the construction of the database, procedures have been realized to make data FAIR compliant, while preserving privacy of the participants providing the data. Results and discussion: An open-source development platform, Supabase, was used to implement the microbiome database. Researchers can deploy this real-time database to access, upload, download and interact with human microbiome data in a FAIR complaint manner. In addition, a large language model (LLM) powered by ChatGPT is developed and deployed to enable knowledge dissemination and non-expert usage of the database.


Assuntos
Microbiota , Humanos , Microbiota/genética , Bases de Dados Factuais , Metadados , Metagenoma , Disseminação de Informação , Biologia Computacional/métodos , Metagenômica/métodos , Bases de Dados Genéticas
15.
Sci Data ; 11(1): 524, 2024 May 22.
Artigo em Inglês | MEDLINE | ID: mdl-38778016

RESUMO

Datasets consist of measurement data and metadata. Metadata provides context, essential for understanding and (re-)using data. Various metadata standards exist for different methods, systems and contexts. However, relevant information resides at differing stages across the data-lifecycle. Often, this information is defined and standardized only at publication stage, which can lead to data loss and workload increase. In this study, we developed Metadatasheet, a metadata standard based on interviews with members of two biomedical consortia and systematic screening of data repositories. It aligns with the data-lifecycle allowing synchronous metadata recording within Microsoft Excel, a widespread data recording software. Additionally, we provide an implementation, the Metadata Workbook, that offers user-friendly features like automation, dynamic adaption, metadata integrity checks, and export options for various metadata standards. By design and due to its extensive documentation, the proposed metadata standard simplifies recording and structuring of metadata for biomedical scientists, promoting practicality and convenience in data management. This framework can accelerate scientific progress by enhancing collaboration and knowledge transfer throughout the intermediate steps of data creation.


Assuntos
Gerenciamento de Dados , Metadados , Pesquisa Biomédica , Gerenciamento de Dados/normas , Metadados/normas , Software
17.
Philos Trans R Soc Lond B Biol Sci ; 379(1904): 20230104, 2024 Jun 24.
Artigo em Inglês | MEDLINE | ID: mdl-38705176

RESUMO

Technological advancements in biological monitoring have facilitated the study of insect communities at unprecedented spatial scales. The progress allows more comprehensive coverage of the diversity within a given area while minimizing disturbance and reducing the need for extensive human labour. Compared with traditional methods, these novel technologies offer the opportunity to examine biological patterns that were previously beyond our reach. However, to address the pressing scientific inquiries of the future, data must be easily accessible, interoperable and reusable for the global research community. Biodiversity information standards and platforms provide the necessary infrastructure to standardize and share biodiversity data. This paper explores the possibilities and prerequisites of publishing insect data obtained through novel monitoring methods through GBIF, the most comprehensive global biodiversity data infrastructure. We describe the essential components of metadata standards and existing data standards for occurrence data on insects, including data extensions. By addressing the current opportunities, limitations, and future development of GBIF's publishing framework, we hope to encourage researchers to both share data and contribute to the further development of biodiversity data standards and publishing models. Wider commitments to open data initiatives will promote data interoperability and support cross-disciplinary scientific research and key policy indicators. This article is part of the theme issue 'Towards a toolkit for global insect biodiversity monitoring'.


Assuntos
Biodiversidade , Disseminação de Informação , Insetos , Animais , Entomologia/métodos , Entomologia/normas , Disseminação de Informação/métodos , Metadados
18.
BMC Bioinformatics ; 25(1): 184, 2024 May 09.
Artigo em Inglês | MEDLINE | ID: mdl-38724907

RESUMO

BACKGROUND: Major advances in sequencing technologies and the sharing of data and metadata in science have resulted in a wealth of publicly available datasets. However, working with and especially curating public omics datasets remains challenging despite these efforts. While a growing number of initiatives aim to re-use previous results, these present limitations that often lead to the need for further in-house curation and processing. RESULTS: Here, we present the Omics Dataset Curation Toolkit (OMD Curation Toolkit), a python3 package designed to accompany and guide the researcher during the curation process of metadata and fastq files of public omics datasets. This workflow provides a standardized framework with multiple capabilities (collection, control check, treatment and integration) to facilitate the arduous task of curating public sequencing data projects. While centered on the European Nucleotide Archive (ENA), the majority of the provided tools are generic and can be used to curate datasets from different sources. CONCLUSIONS: Thus, it offers valuable tools for the in-house curation previously needed to re-use public omics data. Due to its workflow structure and capabilities, it can be easily used and benefit investigators in developing novel omics meta-analyses based on sequencing data.


Assuntos
Curadoria de Dados , Software , Fluxo de Trabalho , Curadoria de Dados/métodos , Metadados , Bases de Dados Genéticas , Genômica/métodos , Biologia Computacional/métodos
19.
BMC Med Inform Decis Mak ; 24(1): 136, 2024 May 27.
Artigo em Inglês | MEDLINE | ID: mdl-38802886

RESUMO

BACKGROUND: The selection of data elements is a decisive task within the development of a health registry. Having the right metadata is crucial for answering the particular research questions. Furthermore, the set of data elements determines the registries' readiness of interoperability and data reusability to a major extent. Six health registries shared and published their metadata within a German funding initiative. As one step in the direction of a common set of data elements, a selection of those metadata was evaluated with regard to their appropriateness for a broader usage. METHODS: Each registry was asked to contribute a 10%-selection of their data elements to an evaluation sample. The survey was set up with the online survey tool "LimeSurvey Cloud". The registries and an accompanying project participated in the survey with one vote for each project. The data elements were offered in content groups along with the question of whether the data element is appropriate for health registries on a broader scale. The question could be answered using a Likert scale with five options. Furthermore, "no answer" was allowed. The level of agreement was assessed using weighted Cohen's kappa and Kendall's coefficient of concordance. RESULTS: The evaluation sample consisted of 269 data elements. With a grade of "perhaps recommendable" or higher in the mean, 169 data elements were selected. These data elements belong preferably to groups' demography, education/occupation, medication, and nutrition. Half of the registries lost significance compared with their percentage of data elements in the evaluation sample, one remained stable. The level of concordance was adequate. CONCLUSIONS: The survey revealed a set of 169 data elements recommended for health registries. When developing a registry, this set could be valuable help in selecting the metadata appropriate to answer the registry's research questions. However, due to the high specificity of research questions, data elements beyond this set will be needed to cover the whole range of interests of a register. A broader discussion and subsequent surveys are needed to establish a common set of data elements on an international scale.


Assuntos
Sistema de Registros , Sistema de Registros/normas , Alemanha , Humanos , Inquéritos e Questionários , Metadados
20.
Sci Data ; 11(1): 503, 2024 May 16.
Artigo em Inglês | MEDLINE | ID: mdl-38755173

RESUMO

Nanomaterials hold great promise for improving our society, and it is crucial to understand their effects on biological systems in order to enhance their properties and ensure their safety. However, the lack of consistency in experimental reporting, the absence of universally accepted machine-readable metadata standards, and the challenge of combining such standards hamper the reusability of previously produced data for risk assessment. Fortunately, the research community has responded to these challenges by developing minimum reporting standards that address several of these issues. By converting twelve published minimum reporting standards into a machine-readable representation using FAIR maturity indicators, we have created a machine-friendly approach to annotate and assess datasets' reusability according to those standards. Furthermore, our NanoSafety Data Reusability Assessment (NSDRA) framework includes a metadata generator web application that can be integrated into experimental data management, and a new web application that can summarize the reusability of nanosafety datasets for one or more subsets of maturity indicators, tailored to specific computational risk assessment use cases. This approach enhances the transparency, communication, and reusability of experimental data and metadata. With this improved FAIR approach, we can facilitate the reuse of nanosafety research for exploration, toxicity prediction, and regulation, thereby advancing the field and benefiting society as a whole.


Assuntos
Nanoestruturas , Metadados , Nanoestruturas/toxicidade , Medição de Risco
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...