Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 675
Filter
1.
BMC Bioinformatics ; 25(1): 184, 2024 May 09.
Article in English | MEDLINE | ID: mdl-38724907

ABSTRACT

BACKGROUND: Major advances in sequencing technologies and the sharing of data and metadata in science have resulted in a wealth of publicly available datasets. However, working with and especially curating public omics datasets remains challenging despite these efforts. While a growing number of initiatives aim to re-use previous results, these present limitations that often lead to the need for further in-house curation and processing. RESULTS: Here, we present the Omics Dataset Curation Toolkit (OMD Curation Toolkit), a python3 package designed to accompany and guide the researcher during the curation process of metadata and fastq files of public omics datasets. This workflow provides a standardized framework with multiple capabilities (collection, control check, treatment and integration) to facilitate the arduous task of curating public sequencing data projects. While centered on the European Nucleotide Archive (ENA), the majority of the provided tools are generic and can be used to curate datasets from different sources. CONCLUSIONS: Thus, it offers valuable tools for the in-house curation previously needed to re-use public omics data. Due to its workflow structure and capabilities, it can be easily used and benefit investigators in developing novel omics meta-analyses based on sequencing data.


Subject(s)
Data Curation , Software , Workflow , Data Curation/methods , Metadata , Databases, Genetic , Genomics/methods , Computational Biology/methods
2.
Front Cell Infect Microbiol ; 14: 1384809, 2024.
Article in English | MEDLINE | ID: mdl-38774631

ABSTRACT

Introduction: Sharing microbiome data among researchers fosters new innovations and reduces cost for research. Practically, this means that the (meta)data will have to be standardized, transparent and readily available for researchers. The microbiome data and associated metadata will then be described with regards to composition and origin, in order to maximize the possibilities for application in various contexts of research. Here, we propose a set of tools and protocols to develop a real-time FAIR (Findable. Accessible, Interoperable and Reusable) compliant database for the handling and storage of human microbiome and host-associated data. Methods: The conflicts arising from privacy laws with respect to metadata, possible human genome sequences in the metagenome shotgun data and FAIR implementations are discussed. Alternate pathways for achieving compliance in such conflicts are analyzed. Sample traceable and sensitive microbiome data, such as DNA sequences or geolocalized metadata are identified, and the role of the GDPR (General Data Protection Regulation) data regulations are considered. For the construction of the database, procedures have been realized to make data FAIR compliant, while preserving privacy of the participants providing the data. Results and discussion: An open-source development platform, Supabase, was used to implement the microbiome database. Researchers can deploy this real-time database to access, upload, download and interact with human microbiome data in a FAIR complaint manner. In addition, a large language model (LLM) powered by ChatGPT is developed and deployed to enable knowledge dissemination and non-expert usage of the database.


Subject(s)
Microbiota , Humans , Microbiota/genetics , Databases, Factual , Metadata , Metagenome , Information Dissemination , Computational Biology/methods , Metagenomics/methods , Databases, Genetic
3.
Philos Trans R Soc Lond B Biol Sci ; 379(1904): 20230104, 2024 Jun 24.
Article in English | MEDLINE | ID: mdl-38705176

ABSTRACT

Technological advancements in biological monitoring have facilitated the study of insect communities at unprecedented spatial scales. The progress allows more comprehensive coverage of the diversity within a given area while minimizing disturbance and reducing the need for extensive human labour. Compared with traditional methods, these novel technologies offer the opportunity to examine biological patterns that were previously beyond our reach. However, to address the pressing scientific inquiries of the future, data must be easily accessible, interoperable and reusable for the global research community. Biodiversity information standards and platforms provide the necessary infrastructure to standardize and share biodiversity data. This paper explores the possibilities and prerequisites of publishing insect data obtained through novel monitoring methods through GBIF, the most comprehensive global biodiversity data infrastructure. We describe the essential components of metadata standards and existing data standards for occurrence data on insects, including data extensions. By addressing the current opportunities, limitations, and future development of GBIF's publishing framework, we hope to encourage researchers to both share data and contribute to the further development of biodiversity data standards and publishing models. Wider commitments to open data initiatives will promote data interoperability and support cross-disciplinary scientific research and key policy indicators. This article is part of the theme issue 'Towards a toolkit for global insect biodiversity monitoring'.


Subject(s)
Biodiversity , Information Dissemination , Insecta , Animals , Entomology/methods , Entomology/standards , Information Dissemination/methods , Metadata
5.
Sci Data ; 11(1): 503, 2024 May 16.
Article in English | MEDLINE | ID: mdl-38755173

ABSTRACT

Nanomaterials hold great promise for improving our society, and it is crucial to understand their effects on biological systems in order to enhance their properties and ensure their safety. However, the lack of consistency in experimental reporting, the absence of universally accepted machine-readable metadata standards, and the challenge of combining such standards hamper the reusability of previously produced data for risk assessment. Fortunately, the research community has responded to these challenges by developing minimum reporting standards that address several of these issues. By converting twelve published minimum reporting standards into a machine-readable representation using FAIR maturity indicators, we have created a machine-friendly approach to annotate and assess datasets' reusability according to those standards. Furthermore, our NanoSafety Data Reusability Assessment (NSDRA) framework includes a metadata generator web application that can be integrated into experimental data management, and a new web application that can summarize the reusability of nanosafety datasets for one or more subsets of maturity indicators, tailored to specific computational risk assessment use cases. This approach enhances the transparency, communication, and reusability of experimental data and metadata. With this improved FAIR approach, we can facilitate the reuse of nanosafety research for exploration, toxicity prediction, and regulation, thereby advancing the field and benefiting society as a whole.


Subject(s)
Nanostructures , Nanostructures/adverse effects , Risk Assessment , Metadata
7.
PLoS One ; 19(4): e0295474, 2024.
Article in English | MEDLINE | ID: mdl-38568922

ABSTRACT

Insect monitoring is essential to design effective conservation strategies, which are indispensable to mitigate worldwide declines and biodiversity loss. For this purpose, traditional monitoring methods are widely established and can provide data with a high taxonomic resolution. However, processing of captured insect samples is often time-consuming and expensive, which limits the number of potential replicates. Automated monitoring methods can facilitate data collection at a higher spatiotemporal resolution with a comparatively lower effort and cost. Here, we present the Insect Detect DIY (do-it-yourself) camera trap for non-invasive automated monitoring of flower-visiting insects, which is based on low-cost off-the-shelf hardware components combined with open-source software. Custom trained deep learning models detect and track insects landing on an artificial flower platform in real time on-device and subsequently classify the cropped detections on a local computer. Field deployment of the solar-powered camera trap confirmed its resistance to high temperatures and humidity, which enables autonomous deployment during a whole season. On-device detection and tracking can estimate insect activity/abundance after metadata post-processing. Our insect classification model achieved a high top-1 accuracy on the test dataset and generalized well on a real-world dataset with captured insect images. The camera trap design and open-source software are highly customizable and can be adapted to different use cases. With custom trained detection and classification models, as well as accessible software programming, many possible applications surpassing our proposed deployment method can be realized.


Subject(s)
Insecta , Software , Animals , Biodiversity , Data Collection , Metadata
8.
Genome Biol ; 25(1): 100, 2024 Apr 19.
Article in English | MEDLINE | ID: mdl-38641812

ABSTRACT

Multiplexed assays of variant effect (MAVEs) have emerged as a powerful approach for interrogating thousands of genetic variants in a single experiment. The flexibility and widespread adoption of these techniques across diverse disciplines have led to a heterogeneous mix of data formats and descriptions, which complicates the downstream use of the resulting datasets. To address these issues and promote reproducibility and reuse of MAVE data, we define a set of minimum information standards for MAVE data and metadata and outline a controlled vocabulary aligned with established biomedical ontologies for describing these experimental designs.


Subject(s)
Metadata , Research Design , Reproducibility of Results
9.
PLoS One ; 19(4): e0302426, 2024.
Article in English | MEDLINE | ID: mdl-38662676

ABSTRACT

Research data sharing has become an expected component of scientific research and scholarly publishing practice over the last few decades, due in part to requirements for federally funded research. As part of a larger effort to better understand the workflows and costs of public access to research data, this project conducted a high-level analysis of where academic research data is most frequently shared. To do this, we leveraged the DataCite and Crossref application programming interfaces (APIs) in search of Publisher field elements demonstrating which data repositories were utilized by researchers from six academic research institutions between 2012-2022. In addition, we also ran a preliminary analysis of the quality of the metadata associated with these published datasets, comparing the extent to which information was missing from metadata fields deemed important for public access to research data. Results show that the top 10 publishers accounted for 89.0% to 99.8% of the datasets connected with the institutions in our study. Known data repositories, including institutional data repositories hosted by those institutions, were initially lacking from our sample due to varying metadata standards and practices. We conclude that the metadata quality landscape for published research datasets is uneven; key information, such as author affiliation, is often incomplete or missing from source data repositories and aggregators. To enhance the findability, interoperability, accessibility, and reusability (FAIRness) of research data, we provide a set of concrete recommendations that repositories and data authors can take to improve scholarly metadata associated with shared datasets.


Subject(s)
Information Dissemination , Metadata , Information Dissemination/methods , Humans , Biomedical Research
10.
Database (Oxford) ; 20242024 Apr 03.
Article in English | MEDLINE | ID: mdl-38581360

ABSTRACT

When the scientific dataset evolves or is reused in workflows creating derived datasets, the integrity of the dataset with its metadata information, including provenance, needs to be securely preserved while providing assurances that they are not accidentally or maliciously altered during the process. Providing a secure method to efficiently share and verify the data as well as metadata is essential for the reuse of the scientific data. The National Science Foundation (NSF) funded Open Science Chain (OSC) utilizes consortium blockchain to provide a cyberinfrastructure solution to maintain integrity of the provenance metadata for published datasets and provides a way to perform independent verification of the dataset while promoting reuse and reproducibility. The NSF- and National Institutes of Health (NIH)-funded Neuroscience Gateway (NSG) provides a freely available web portal that allows neuroscience researchers to execute computational data analysis pipeline on high performance computing resources. Combined, the OSC and NSG platforms form an efficient, integrated framework to automatically and securely preserve and verify the integrity of the artifacts used in research workflows while using the NSG platform. This paper presents the results of the first study that integrates OSC-NSG frameworks to track the provenance of neurophysiological signal data analysis to study brain network dynamics using the Neuro-Integrative Connectivity tool, which is deployed in the NSG platform. Database URL: https://www.opensciencechain.org.


Subject(s)
Neurosciences , Publications , Reproducibility of Results , Databases, Factual , Metadata
11.
Stud Health Technol Inform ; 313: 198-202, 2024 Apr 26.
Article in English | MEDLINE | ID: mdl-38682530

ABSTRACT

Secondary use of clinical health data implies a prior integration of mostly heterogenous and multidimensional data sets. A clinical data warehouse addresses the technological and organizational framework conditions required for this, by making any data available for analysis. However, users of a data warehouse often do not have a comprehensive overview of all available data and only know about their own data in their own systems - a situation which is also referred to as 'data siloed state'. This problem can be addressed and ultimately solved by implementation of a data catalog. Its core function is a search engine, which allows for searching the metadata collected from different data sources and thereby accessing all data there is. With this in mind, we conducted an explorative online market survey followed by vendor comparison as a pre-requisite for system selection of a data catalog. Assessment of vendor performance was based on seven predetermined and weighted selection criteria. Although three vendors achieved the highest score, results were lying closely together. Detailed investigations and test installations are needed for further narrowing down the selection process.


Subject(s)
Data Warehousing , Electronic Health Records , Search Engine , Humans , Information Storage and Retrieval/methods , Metadata
12.
Lab Anim (NY) ; 53(3): 67-79, 2024 Mar.
Article in English | MEDLINE | ID: mdl-38438748

ABSTRACT

Although biomedical research is experiencing a data explosion, the accumulation of vast quantities of data alone does not guarantee a primary objective for science: building upon existing knowledge. Data collected that lack appropriate metadata cannot be fully interrogated or integrated into new research projects, leading to wasted resources and missed opportunities for data repurposing. This issue is particularly acute for research using animals, where concerns regarding data reproducibility and ensuring animal welfare are paramount. Here, to address this problem, we propose a minimal metadata set (MNMS) designed to enable the repurposing of in vivo data. MNMS aligns with an existing validated guideline for reporting in vivo data (ARRIVE 2.0) and contributes to making in vivo data FAIR-compliant. Scenarios where MNMS should be implemented in diverse research environments are presented, highlighting opportunities and challenges for data repurposing at different scales. We conclude with a 'call for action' to key stakeholders in biomedical research to adopt and apply MNMS to accelerate both the advancement of knowledge and the betterment of animal welfare.


Subject(s)
Biomedical Research , Metadata , Animals , Reproducibility of Results , Animal Welfare
13.
PLoS One ; 19(3): e0297404, 2024.
Article in English | MEDLINE | ID: mdl-38446758

ABSTRACT

Film festivals are a key component in the global film industry in terms of trendsetting, publicity, trade, and collaboration. We present an unprecedented analysis of the international film festival circuit, which has so far remained relatively understudied quantitatively, partly due to the limited availability of suitable data sets. We use large-scale data from the Cinando platform of the Cannes Film Market, widely used by industry professionals. We explicitly model festival events as a global network connected by shared films and quantify festivals as aggregates of the metadata of their showcased films. Importantly, we argue against using simple count distributions for discrete labels such as language or production country, as such categories are typically not equidistant. Rather, we propose embedding them in continuous latent vector spaces. We demonstrate how these "festival embeddings" provide insight into changes in programmed content over time, predict festival connections, and can be used to measure diversity in film festival programming across various cultural, social, and geographical variables-which all constitute an aspect of public value creation by film festivals. Our results provide a novel mapping of the film festival circuit between 2009-2021 (616 festivals, 31,989 unique films), highlighting festival types that occupy specific niches, diverse series, and those that evolve over time. We also discuss how these quantitative findings fit into media studies and research on public value creation by cultural industries. With festivals occupying a central position in the film industry, investigations into the data they generate hold opportunities for researchers to better understand industry dynamics and cultural impact, and for organizers, policymakers, and industry actors to make more informed, data-driven decisions. We hope our proposed methodological approach to festival data paves way for more comprehensive film festival studies and large-scale quantitative cultural event analytics in general.


Subject(s)
Holidays , Industry , Geography , Language , Metadata
14.
PLoS One ; 19(3): e0296810, 2024.
Article in English | MEDLINE | ID: mdl-38483886

ABSTRACT

Contact matrices are a commonly adopted data representation, used to develop compartmental models for epidemic spreading, accounting for the contact heterogeneities across age groups. Their estimation, however, is generally time and effort consuming and model-driven strategies to quantify the contacts are often needed. In this article we focus on household contact matrices, describing the contacts among the members of a family and develop a parametric model to describe them. This model combines demographic and easily quantifiable survey-based data and is tested on high resolution proximity data collected in two sites in South Africa. Given its simplicity and interpretability, we expect our method to be easily applied to other contexts as well and we identify relevant questions that need to be addressed during the data collection procedure.


Subject(s)
Epidemics , Metadata , Surveys and Questionnaires , Epidemiological Models , South Africa , Contact Tracing/methods
15.
Bioinformatics ; 40(3)2024 Mar 04.
Article in English | MEDLINE | ID: mdl-38445753

ABSTRACT

SUMMARY: Python is the most commonly used language for deep learning (DL). Existing Python packages for mass spectrometry imaging (MSI) data are not optimized for DL tasks. We, therefore, introduce pyM2aia, a Python package for MSI data analysis with a focus on memory-efficient handling, processing and convenient data-access for DL applications. pyM2aia provides interfaces to its parent application M2aia, which offers interactive capabilities for exploring and annotating MSI data in imzML format. pyM2aia utilizes the image input and output routines, data formats, and processing functions of M2aia, ensures data interchangeability, and enables the writing of readable and easy-to-maintain DL pipelines by providing batch generators for typical MSI data access strategies. We showcase the package in several examples, including imzML metadata parsing, signal processing, ion-image generation, and, in particular, DL model training and inference for spectrum-wise approaches, ion-image-based approaches, and approaches that use spectral and spatial information simultaneously. AVAILABILITY AND IMPLEMENTATION: Python package, code and examples are available at (https://m2aia.github.io/m2aia).


Subject(s)
Deep Learning , Software , Mass Spectrometry/methods , Language , Metadata
16.
J Am Med Inform Assoc ; 31(4): 910-918, 2024 Apr 03.
Article in English | MEDLINE | ID: mdl-38308819

ABSTRACT

OBJECTIVES: Despite federally mandated collection of sex and gender demographics in the electronic health record (EHR), longitudinal assessments are lacking. We assessed sex and gender demographic field utilization using EHR metadata. MATERIALS AND METHODS: Patients ≥18 years of age in the Mass General Brigham health system with a first Legal Sex entry (registration requirement) between January 8, 2018 and January 1, 2022 were included in this retrospective study. Metadata for all sex and gender fields (Legal Sex, Sex Assigned at Birth [SAAB], Gender Identity) were quantified by completion rates, user types, and longitudinal change. A nested qualitative study of providers from specialties with high and low field use identified themes related to utilization. RESULTS: 1 576 120 patients met inclusion criteria: 100% had a Legal Sex, 20% a Gender Identity, and 19% a SAAB; 321 185 patients had field changes other than initial Legal Sex entry. About 2% of patients had a subsequent Legal Sex change, and 25% of those had ≥2 changes; 20% of patients had ≥1 update to Gender Identity and 19% to SAAB. Excluding the first Legal Sex entry, administrators made most changes (67%) across all fields, followed by patients (25%), providers (7.2%), and automated Health Level-7 (HL7) interface messages (0.7%). Provider utilization varied by subspecialty; themes related to systems barriers and personal perceptions were identified. DISCUSSION: Sex and gender demographic fields are primarily used by administrators and raise concern about data accuracy; provider use is heterogenous and lacking. Provider awareness of field availability and variable workflows may impede use. CONCLUSION: EHR metadata highlights areas for improvement of sex and gender field utilization.


Subject(s)
Gender Identity , Transgender Persons , Infant, Newborn , Humans , Male , Female , Electronic Health Records , Metadata , Retrospective Studies , Demography
17.
Sci Data ; 11(1): 179, 2024 Feb 08.
Article in English | MEDLINE | ID: mdl-38332144

ABSTRACT

Data standardization promotes a common framework through which researchers can utilize others' data and is one of the leading methods neuroimaging researchers use to share and replicate findings. As of today, standardizing datasets requires technical expertise such as coding and knowledge of file formats. We present ezBIDS, a tool for converting neuroimaging data and associated metadata to the Brain Imaging Data Structure (BIDS) standard. ezBIDS contains four major features: (1) No installation or programming requirements. (2) Handling of both imaging and task events data and metadata. (3) Semi-automated inference and guidance for adherence to BIDS. (4) Multiple data management options: download BIDS data to local system, or transfer to OpenNeuro.org or to brainlife.io. In sum, ezBIDS requires neither coding proficiency nor knowledge of BIDS, and is the first BIDS tool to offer guided standardization, support for task events conversion, and interoperability with OpenNeuro.org and brainlife.io.


Subject(s)
Metadata , Neuroimaging , Data Display , Data Analysis
18.
Astrobiology ; 24(2): 131-137, 2024 Feb.
Article in English | MEDLINE | ID: mdl-38393827

ABSTRACT

As scientific investigations increasingly adopt Open Science practices, reuse of data becomes paramount. However, despite decades of progress in internet search tools, finding relevant astrobiology datasets for an envisioned investigation remains challenging due to the precise and atypical needs of the astrobiology researcher. In response, we have developed the Astrobiology Resource Metadata Standard (ARMS), a metadata standard designed to uniformly describe astrobiology "resources," that is, virtually any product of astrobiology research. Those resources include datasets, physical samples, software (modeling codes and scripts), publications, websites, images, videos, presentations, and so on. ARMS has been formulated to describe astrobiology resources generated by individual scientists or smaller scientific teams, rather than larger mission teams who may be required to use more complex archival metadata schemes. In the following, we discuss the participatory development process, give an overview of the metadata standard, describe its current use in practice, and close with a discussion of additional possible uses and extensions.


Subject(s)
Exobiology , Metadata , Software
19.
Mol Cell Proteomics ; 23(3): 100731, 2024 Mar.
Article in English | MEDLINE | ID: mdl-38331191

ABSTRACT

Proteomics data sharing has profound benefits at the individual level as well as at the community level. While data sharing has increased over the years, mostly due to journal and funding agency requirements, the reluctance of researchers with regard to data sharing is evident as many shares only the bare minimum dataset required to publish an article. In many cases, proper metadata is missing, essentially making the dataset useless. This behavior can be explained by a lack of incentives, insufficient awareness, or a lack of clarity surrounding ethical issues. Through adequate training at research institutes, researchers can realize the benefits associated with data sharing and can accelerate the norm of data sharing for the field of proteomics, as has been the standard in genomics for decades. In this article, we have put together various repository options available for proteomics data. We have also added pros and cons of those repositories to facilitate researchers in selecting the repository most suitable for their data submission. It is also important to note that a few types of proteomics data have the potential to re-identify an individual in certain scenarios. In such cases, extra caution should be taken to remove any personal identifiers before sharing on public repositories. Data sets that will be useless without personal identifiers need to be shared in a controlled access repository so that only authorized researchers can access the data and personal identifiers are kept safe.


Subject(s)
Privacy , Proteomics , Humans , Genomics , Metadata , Information Dissemination
20.
J Environ Manage ; 354: 120349, 2024 Mar.
Article in English | MEDLINE | ID: mdl-38401497

ABSTRACT

Flow obstructed by bridge piers can increase sediment transport leading to local scour. This local scour poses a risk to the stability of bridge structures, which could lead to structural failures. There are two main approaches for evaluating the scour depth (ds) of bridge piers. The first is based on understanding hydraulic phenomena and developing relationships with properties affecting scour. The second uses data-driven soft computing models that lack physical interpretations but rely on algorithms to predict outcomes. Methods are chosen by researchers based on their goals and resources. This study aims to create innovative ensemble frameworks comprising support vector machine for regression (SVMR), random forest regression (RFR), and reduced error pruning tree (REPTree) as base learners, alongside bagging regression tree (BRT) and stochastic gradient boosting (SGB) as meta learners. These ensembles were developed to analyse maximum scour depths (dsm) in clear water conditions, utilizing 35 literature's experimental data published in last 63 years. The performance of each machine learning (ML) approach was assessed using statistical performance indicators. The proposed model was also compared with top six empirical equations with strong predictive ability. Results show that among these empirical equations, the equation from Nandi and Das (2023) performs best. Performance evaluation considering training, testing, and the entire dataset, SGB (REPTree), BRT(SVMR-PUK), and SGB (REPTree) exhibited the highest performance, securing the top rank among all ML models and empirical equations. Sensitivity analysis identified sediment gradation and flow intensity as the most influential variables for predicting dsm during both training and testing phases, respectively.


Subject(s)
Metadata , Water , Algorithms , Machine Learning
SELECTION OF CITATIONS
SEARCH DETAIL
...