Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 723
Filter
Add more filters

Publication year range
1.
Nat Methods ; 20(3): 400-402, 2023 03.
Article in English | MEDLINE | ID: mdl-36759590

ABSTRACT

The design of biocatalytic reaction systems is highly complex owing to the dependency of the estimated kinetic parameters on the enzyme, the reaction conditions, and the modeling method. Consequently, reproducibility of enzymatic experiments and reusability of enzymatic data are challenging. We developed the XML-based markup language EnzymeML to enable storage and exchange of enzymatic data such as reaction conditions, the time course of the substrate and the product, kinetic parameters and the kinetic model, thus making enzymatic data findable, accessible, interoperable and reusable (FAIR). The feasibility and usefulness of the EnzymeML toolbox is demonstrated in six scenarios, for which data and metadata of different enzymatic reactions are collected and analyzed. EnzymeML serves as a seamless communication channel between experimental platforms, electronic lab notebooks, tools for modeling of enzyme kinetics, publication platforms and enzymatic reaction databases. EnzymeML is open and transparent, and invites the community to contribute. All documents and codes are freely available at https://enzymeml.org .


Subject(s)
Data Management , Metadata , Reproducibility of Results , Databases, Factual , Kinetics
2.
Mol Cell Proteomics ; 23(3): 100731, 2024 Mar.
Article in English | MEDLINE | ID: mdl-38331191

ABSTRACT

Proteomics data sharing has profound benefits at the individual level as well as at the community level. While data sharing has increased over the years, mostly due to journal and funding agency requirements, the reluctance of researchers with regard to data sharing is evident as many shares only the bare minimum dataset required to publish an article. In many cases, proper metadata is missing, essentially making the dataset useless. This behavior can be explained by a lack of incentives, insufficient awareness, or a lack of clarity surrounding ethical issues. Through adequate training at research institutes, researchers can realize the benefits associated with data sharing and can accelerate the norm of data sharing for the field of proteomics, as has been the standard in genomics for decades. In this article, we have put together various repository options available for proteomics data. We have also added pros and cons of those repositories to facilitate researchers in selecting the repository most suitable for their data submission. It is also important to note that a few types of proteomics data have the potential to re-identify an individual in certain scenarios. In such cases, extra caution should be taken to remove any personal identifiers before sharing on public repositories. Data sets that will be useless without personal identifiers need to be shared in a controlled access repository so that only authorized researchers can access the data and personal identifiers are kept safe.


Subject(s)
Privacy , Proteomics , Humans , Genomics , Metadata , Information Dissemination
3.
Nucleic Acids Res ; 52(D1): D640-D646, 2024 Jan 05.
Article in English | MEDLINE | ID: mdl-37971328

ABSTRACT

MetaboLights is a global database for metabolomics studies including the raw experimental data and the associated metadata. The database is cross-species and cross-technique and covers metabolite structures and their reference spectra as well as their biological roles and locations where available. MetaboLights is the recommended metabolomics repository for a number of leading journals and ELIXIR, the European infrastructure for life science information. In this article, we describe the continued growth and diversity of submissions and the significant developments in recent years. In particular, we highlight MetaboLights Labs, our new Galaxy Project instance with repository-scale standardized workflows, and how data public on MetaboLights are being reused by the community. Metabolomics resources and data are available under the EMBL-EBI's Terms of Use at https://www.ebi.ac.uk/metabolights and under Apache 2.0 at https://github.com/EBI-Metabolights.


Subject(s)
Databases, Genetic , Metabolomics , Metabolomics/methods , Metadata , Internet
4.
Nucleic Acids Res ; 52(D1): D1024-D1032, 2024 Jan 05.
Article in English | MEDLINE | ID: mdl-37941143

ABSTRACT

The silkworm Bombyx mori is a domesticated insect that serves as an animal model for research and agriculture. The silkworm super-pan-genome dataset, which we published last year, is a unique resource for the study of global genomic diversity and phenotype-genotype association. Here we present SilkMeta (http://silkmeta.org.cn), a comprehensive database covering the available silkworm pan-genome and multi-omics data. The database contains 1082 short-read genomes, 546 long-read assembled genomes, 1168 transcriptomes, 294 phenotype characterizations (phenome), tens of millions of variations (variome), 7253 long non-coding RNAs (lncRNAs), 18 717 full length transcripts and a set of population statistics. We have compiled publications on functional genomics research and genetic stock deciphering (mutant map). A range of bioinformatics tools is also provided for data visualization and retrieval. The large batch of omics data and tools were integrated in twelve functional modules that provide useful strategies and data for comparative and functional genomics research. The interactive bioinformatics platform SilkMeta will benefit not only the silkworm but also the insect biology communities.


Subject(s)
Bombyx , Genome, Insect , Animals , Bombyx/genetics , Computational Biology , Genomics , Metadata , Multiomics
5.
Nucleic Acids Res ; 52(D1): D107-D114, 2024 Jan 05.
Article in English | MEDLINE | ID: mdl-37992296

ABSTRACT

Expression Atlas (www.ebi.ac.uk/gxa) and its newest counterpart the Single Cell Expression Atlas (www.ebi.ac.uk/gxa/sc) are EMBL-EBI's knowledgebases for gene and protein expression and localisation in bulk and at single cell level. These resources aim to allow users to investigate their expression in normal tissue (baseline) or in response to perturbations such as disease or changes to genotype (differential) across multiple species. Users are invited to search for genes or metadata terms across species or biological conditions in a standardised consistent interface. Alongside these data, new features in Single Cell Expression Atlas allow users to query metadata through our new cell type wheel search. At the experiment level data can be explored through two types of dimensionality reduction plots, t-distributed Stochastic Neighbor Embedding (tSNE) and Uniform Manifold Approximation and Projection (UMAP), overlaid with either clustering or metadata information to assist users' understanding. Data are also visualised as marker gene heatmaps identifying genes that help confer cluster identity. For some data, additional visualisations are available as interactive cell level anatomograms and cell type gene expression heatmaps.


Subject(s)
Databases, Genetic , Gene Expression Profiling , Proteomics , Genotype , Metadata , Single-Cell Analysis , Internet , Humans , Animals
6.
Nucleic Acids Res ; 52(D1): D164-D173, 2024 Jan 05.
Article in English | MEDLINE | ID: mdl-37930866

ABSTRACT

Plasmids are mobile genetic elements found in many clades of Archaea and Bacteria. They drive horizontal gene transfer, impacting ecological and evolutionary processes within microbial communities, and hold substantial importance in human health and biotechnology. To support plasmid research and provide scientists with data of an unprecedented diversity of plasmid sequences, we introduce the IMG/PR database, a new resource encompassing 699 973 plasmid sequences derived from genomes, metagenomes and metatranscriptomes. IMG/PR is the first database to provide data of plasmid that were systematically identified from diverse microbiome samples. IMG/PR plasmids are associated with rich metadata that includes geographical and ecosystem information, host taxonomy, similarity to other plasmids, functional annotation, presence of genes involved in conjugation and antibiotic resistance. The database offers diverse methods for exploring its extensive plasmid collection, enabling users to navigate plasmids through metadata-centric queries, plasmid comparisons and BLAST searches. The web interface for IMG/PR is accessible at https://img.jgi.doe.gov/pr. Plasmid metadata and sequences can be downloaded from https://genome.jgi.doe.gov/portal/IMG_PR.


Subject(s)
Metagenome , Microbiota , Humans , Metadata , Software , Databases, Genetic , Plasmids/genetics
7.
Nucleic Acids Res ; 52(D1): D67-D71, 2024 Jan 05.
Article in English | MEDLINE | ID: mdl-37971299

ABSTRACT

The Bioinformation and DNA Data Bank of Japan (DDBJ) Center (https://www.ddbj.nig.ac.jp) provides database archives that cover a wide range of fields in life sciences. As a founding member of the International Nucleotide Sequence Database Collaboration (INSDC), DDBJ accepts and distributes nucleotide sequence data as well as their study and sample information along with the National Center for Biotechnology Information in the United States and the European Bioinformatics Institute (EBI). Besides INSDC databases, the DDBJ Center provides databases for functional genomics (GEA: Genomic Expression Archive), metabolomics (MetaboBank) and human genetic and phenotypic data (JGA: Japanese Genotype-phenotype Archive). These database systems have been built on the National Institute of Genetics (NIG) supercomputer, which is also open for domestic life science researchers to analyze large-scale sequence data. This paper reports recent updates on the archival databases and the services of the DDBJ Center, highlighting the newly redesigned MetaboBank. MetaboBank uses BioProject and BioSample in its metadata description making it suitable for multi-omics large studies. Its collaboration with MetaboLights at EBI brings synergy in locating and reusing public data.


Subject(s)
Databases, Nucleic Acid , Metabolomics , Metadata , Humans , Computational Biology , Genomics , Internet , Japan , Multiomics/methods
8.
Bioinformatics ; 40(Suppl 2): ii4-ii10, 2024 09 01.
Article in English | MEDLINE | ID: mdl-39230700

ABSTRACT

With the development of high-throughput technologies, genomics datasets rapidly grow in size, including functional genomics data. This has allowed the training of large Deep Learning (DL) models to predict epigenetic readouts, such as protein binding or histone modifications, from genome sequences. However, large dataset sizes come at a price of data consistency, often aggregating results from a large number of studies, conducted under varying experimental conditions. While data from large-scale consortia are useful as they allow studying the effects of different biological conditions, they can also contain unwanted biases from confounding experimental factors. Here, we introduce Metadata-guided Feature Disentanglement (MFD)-an approach that allows disentangling biologically relevant features from potential technical biases. MFD incorporates target metadata into model training, by conditioning weights of the model output layer on different experimental factors. It then separates the factors into disjoint groups and enforces independence of the corresponding feature subspaces with an adversarially learned penalty. We show that the metadata-driven disentanglement approach allows for better model introspection, by connecting latent features to experimental factors, without compromising, or even improving performance in downstream tasks, such as enhancer prediction, or genetic variant discovery. The code will be made available at https://github.com/HealthML/MFD.


Subject(s)
Genomics , Metadata , Genomics/methods , Deep Learning , Humans
9.
Bioinformatics ; 40(3)2024 03 04.
Article in English | MEDLINE | ID: mdl-38445753

ABSTRACT

SUMMARY: Python is the most commonly used language for deep learning (DL). Existing Python packages for mass spectrometry imaging (MSI) data are not optimized for DL tasks. We, therefore, introduce pyM2aia, a Python package for MSI data analysis with a focus on memory-efficient handling, processing and convenient data-access for DL applications. pyM2aia provides interfaces to its parent application M2aia, which offers interactive capabilities for exploring and annotating MSI data in imzML format. pyM2aia utilizes the image input and output routines, data formats, and processing functions of M2aia, ensures data interchangeability, and enables the writing of readable and easy-to-maintain DL pipelines by providing batch generators for typical MSI data access strategies. We showcase the package in several examples, including imzML metadata parsing, signal processing, ion-image generation, and, in particular, DL model training and inference for spectrum-wise approaches, ion-image-based approaches, and approaches that use spectral and spatial information simultaneously. AVAILABILITY AND IMPLEMENTATION: Python package, code and examples are available at (https://m2aia.github.io/m2aia).


Subject(s)
Deep Learning , Software , Mass Spectrometry/methods , Language , Metadata
10.
Bioinformatics ; 40(8)2024 08 02.
Article in English | MEDLINE | ID: mdl-39037960

ABSTRACT

MOTIVATION: Software plays a crucial and growing role in research. Unfortunately, the computational component in Life Sciences research is often challenging to reproduce and verify. It could be undocumented, opaque, contain unknown errors that affect the outcome, or be directly unavailable and impossible to use for others. These issues are detrimental to the overall quality of scientific research. One step to address this problem is the formulation of principles that research software in the domain should meet to ensure its quality and sustainability, resembling the FAIR (findable, accessible, interoperable, and reusable) data principles. RESULTS: We present here a comprehensive series of quantitative indicators based on a pragmatic interpretation of the FAIR Principles and their implementation on OpenEBench, ELIXIR's open platform providing both support for scientific benchmarking and an active observatory of quality-related features for Life Sciences research software. The results serve to understand the current practices around research software quality-related features and provide objective indications for improving them. AVAILABILITY AND IMPLEMENTATION: Software metadata, from 11 different sources, collected, integrated, and analysed in the context of this manuscript are available at https://doi.org/10.5281/zenodo.7311067. Code used for software metadata retrieval and processing is available in the following repository: https://gitlab.bsc.es/inb/elixir/software-observatory/FAIRsoft_ETL.


Subject(s)
Software , Computational Biology/methods , Metadata
12.
Nucleic Acids Res ; 51(D1): D733-D743, 2023 01 06.
Article in English | MEDLINE | ID: mdl-36399502

ABSTRACT

Viruses are widely recognized as critical members of all microbiomes. Metagenomics enables large-scale exploration of the global virosphere, progressively revealing the extensive genomic diversity of viruses on Earth and highlighting the myriad of ways by which viruses impact biological processes. IMG/VR provides access to the largest collection of viral sequences obtained from (meta)genomes, along with functional annotation and rich metadata. A web interface enables users to efficiently browse and search viruses based on genome features and/or sequence similarity. Here, we present the fourth version of IMG/VR, composed of >15 million virus genomes and genome fragments, a ≈6-fold increase in size compared to the previous version. These clustered into 8.7 million viral operational taxonomic units, including 231 408 with at least one high-quality representative. Viral sequences in IMG/VR are now systematically identified from genomes, metagenomes, and metatranscriptomes using a new detection approach (geNomad), and IMG standard annotation are complemented with genome quality estimation using CheckV, taxonomic classification reflecting the latest taxonomic standards, and microbial host taxonomy prediction. IMG/VR v4 is available at https://img.jgi.doe.gov/vr, and the underlying data are available to download at https://genome.jgi.doe.gov/portal/IMG_VR.


Subject(s)
Databases, Genetic , Genome, Viral , Metadata , Metagenomics , Software
13.
BMC Bioinformatics ; 25(1): 184, 2024 May 09.
Article in English | MEDLINE | ID: mdl-38724907

ABSTRACT

BACKGROUND: Major advances in sequencing technologies and the sharing of data and metadata in science have resulted in a wealth of publicly available datasets. However, working with and especially curating public omics datasets remains challenging despite these efforts. While a growing number of initiatives aim to re-use previous results, these present limitations that often lead to the need for further in-house curation and processing. RESULTS: Here, we present the Omics Dataset Curation Toolkit (OMD Curation Toolkit), a python3 package designed to accompany and guide the researcher during the curation process of metadata and fastq files of public omics datasets. This workflow provides a standardized framework with multiple capabilities (collection, control check, treatment and integration) to facilitate the arduous task of curating public sequencing data projects. While centered on the European Nucleotide Archive (ENA), the majority of the provided tools are generic and can be used to curate datasets from different sources. CONCLUSIONS: Thus, it offers valuable tools for the in-house curation previously needed to re-use public omics data. Due to its workflow structure and capabilities, it can be easily used and benefit investigators in developing novel omics meta-analyses based on sequencing data.


Subject(s)
Data Curation , Software , Workflow , Data Curation/methods , Metadata , Databases, Genetic , Genomics/methods , Computational Biology/methods
14.
BMC Med ; 22(1): 296, 2024 Jul 18.
Article in English | MEDLINE | ID: mdl-39020355

ABSTRACT

BACKGROUND: Sexually transmitted infections (STIs) pose a significant global public health challenge. Early diagnosis and treatment reduce STI transmission, but rely on recognising symptoms and care-seeking behaviour of the individual. Digital health software that distinguishes STI skin conditions could improve health-seeking behaviour. We developed and evaluated a deep learning model to differentiate STIs from non-STIs based on clinical images and symptoms. METHODS: We used 4913 clinical images of genital lesions and metadata from the Melbourne Sexual Health Centre collected during 2010-2023. We developed two binary classification models to distinguish STIs from non-STIs: (1) a convolutional neural network (CNN) using images only and (2) an integrated model combining both CNN and fully connected neural network (FCN) using images and metadata. We evaluated the model performance by the area under the ROC curve (AUC) and assessed metadata contributions to the Image-only model. RESULTS: Our study included 1583 STI and 3330 non-STI images. Common STI diagnoses were syphilis (34.6%), genital warts (24.5%) and herpes (19.4%), while most non-STIs (80.3%) were conditions such as dermatitis, lichen sclerosis and balanitis. In both STI and non-STI groups, the most frequently observed groups were 25-34 years (48.6% and 38.2%, respectively) and heterosexual males (60.3% and 45.9%, respectively). The Image-only model showed a reasonable performance with an AUC of 0.859 (SD 0.013). The Image + Metadata model achieved a significantly higher AUC of 0.893 (SD 0.018) compared to the Image-only model (p < 0.01). Out of 21 metadata, the integration of demographic and dermatological metadata led to the most significant improvement in model performance, increasing AUC by 6.7% compared to the baseline Image-only model. CONCLUSIONS: The Image + Metadata model outperformed the Image-only model in distinguishing STIs from other skin conditions. Using it as a screening tool in a clinical setting may require further development and evaluation with larger datasets.


Subject(s)
Metadata , Sexually Transmitted Diseases , Humans , Sexually Transmitted Diseases/diagnosis , Male , Female , Adult , Artificial Intelligence , Middle Aged , Neural Networks, Computer , Young Adult , Mass Screening/methods , Skin Diseases/diagnosis , Deep Learning
15.
Nat Methods ; 18(12): 1489-1495, 2021 12.
Article in English | MEDLINE | ID: mdl-34862503

ABSTRACT

For quality, interpretation, reproducibility and sharing value, microscopy images should be accompanied by detailed descriptions of the conditions that were used to produce them. Micro-Meta App is an intuitive, highly interoperable, open-source software tool that was developed in the context of the 4D Nucleome (4DN) consortium and is designed to facilitate the extraction and collection of relevant microscopy metadata as specified by the recent 4DN-BINA-OME tiered-system of Microscopy Metadata specifications. In addition to substantially lowering the burden of quality assurance, the visual nature of Micro-Meta App makes it particularly suited for training purposes.


Subject(s)
Metadata , Microscopy, Confocal/instrumentation , Microscopy, Confocal/methods , Microscopy, Fluorescence/instrumentation , Microscopy, Fluorescence/methods , Mobile Applications , Programming Languages , Software , Animals , Cell Line , Computational Biology/methods , Humans , Image Processing, Computer-Assisted , Mice , Pattern Recognition, Automated , Quality Control , Reproducibility of Results , User-Computer Interface , Workflow
16.
Nat Methods ; 18(12): 1496-1498, 2021 12.
Article in English | MEDLINE | ID: mdl-34845388

ABSTRACT

The rapid pace of innovation in biological imaging and the diversity of its applications have prevented the establishment of a community-agreed standardized data format. We propose that complementing established open formats such as OME-TIFF and HDF5 with a next-generation file format such as Zarr will satisfy the majority of use cases in bioimaging. Critically, a common metadata format used in all these vessels can deliver truly findable, accessible, interoperable and reusable bioimaging data.


Subject(s)
Computational Biology/instrumentation , Computational Biology/standards , Metadata , Microscopy/instrumentation , Microscopy/standards , Software , Benchmarking , Computational Biology/methods , Data Compression , Databases, Factual , Information Storage and Retrieval , Internet , Microscopy/methods , Programming Languages , SARS-CoV-2
17.
Brief Bioinform ; 23(5)2022 09 20.
Article in English | MEDLINE | ID: mdl-35901452

ABSTRACT

Measuring the semantic similarity between Gene Ontology (GO) terms is a fundamental step in numerous functional bioinformatics applications. To fully exploit the metadata of GO terms, word embedding-based methods have been proposed recently to map GO terms to low-dimensional feature vectors. However, these representation methods commonly overlook the key information hidden in the whole GO structure and the relationship between GO terms. In this paper, we propose a novel representation model for GO terms, named GT2Vec, which jointly considers the GO graph structure obtained by graph contrastive learning and the semantic description of GO terms based on BERT encoders. Our method is evaluated on a protein similarity task on a collection of benchmark datasets. The experimental results demonstrate the effectiveness of using a joint encoding graph structure and textual node descriptors to learn vector representations for GO terms.


Subject(s)
Computational Biology , Semantics , Computational Biology/methods , Gene Ontology , Metadata
18.
Brief Bioinform ; 23(3)2022 05 13.
Article in English | MEDLINE | ID: mdl-35438138

ABSTRACT

Since its launch in 2008, the European Genome-Phenome Archive (EGA) has been leading the archiving and distribution of human identifiable genomic data. In this regard, one of the community concerns is the potential usability of the stored data, as of now, data submitters are not mandated to perform any quality control (QC) before uploading their data and associated metadata information. Here, we present a new File QC Portal developed at EGA, along with QC reports performed and created for 1 694 442 files [Fastq, sequence alignment map (SAM)/binary alignment map (BAM)/CRAM and variant call format (VCF)] submitted at EGA. QC reports allow anonymous EGA users to view summary-level information regarding the files within a specific dataset, such as quality of reads, alignment quality, number and type of variants and other features. Researchers benefit from being able to assess the quality of data prior to the data access decision and thereby, increasing the reusability of data (https://ega-archive.org/blog/data-upcycling-powered-by-ega/).


Subject(s)
Genome , Genomics , High-Throughput Nucleotide Sequencing , Humans , Metadata , Quality Control , Software
19.
Bioinformatics ; 39(3)2023 03 01.
Article in English | MEDLINE | ID: mdl-36857584

ABSTRACT

MOTIVATION: The Gene Expression Omnibus has become an important source of biological data for secondary analysis. However, there is no simple, programmatic way to download data and metadata from Gene Expression Omnibus (GEO) in a standardized annotation format. RESULTS: To address this, we present GEOfetch-a command-line tool that downloads and organizes data and metadata from GEO and SRA. GEOfetch formats the downloaded metadata as a Portable Encapsulated Project, providing universal format for the reanalysis of public data. AVAILABILITY AND IMPLEMENTATION: GEOfetch is available on Bioconda and the Python Package Index (PyPI).


Subject(s)
Gene Expression , Metadata , Computational Biology
20.
Bioinformatics ; 39(1)2023 01 01.
Article in English | MEDLINE | ID: mdl-36322820

ABSTRACT

MOTIVATION: Drug discovery practitioners in industry and academia use semantic tools to extract information from online scientific literature to generate new insights into targets, therapeutics and diseases. However, due to complexities in access and analysis, patent-based literature is often overlooked as a source of information. As drug discovery is a highly competitive field, naturally, tools that tap into patent literature can provide any actor in the field an advantage in terms of better informed decision-making. Hence, we aim to facilitate access to patent literature through the creation of an automatic tool for extracting information from patents described in existing public resources. RESULTS: Here, we present PEMT, a novel patent enrichment tool, that takes advantage of public databases like ChEMBL and SureChEMBL to extract relevant patent information linked to chemical structures and/or gene names described through FAIR principles and metadata annotations. PEMT aims at supporting drug discovery and research by establishing a patent landscape around genes of interest. The pharmaceutical focus of the tool is mainly due to the subselection of International Patent Classification codes, but in principle, it can be used for other patent fields, provided that a link between a concept and chemical structure is investigated. Finally, we demonstrate a use-case in rare diseases by generating a gene-patent list based on the epidemiological prevalence of these diseases and exploring their underlying patent landscapes. AVAILABILITY AND IMPLEMENTATION: PEMT is an open-source Python tool and its source code and PyPi package are available at https://github.com/Fraunhofer-ITMP/PEMT and https://pypi.org/project/PEMT/, respectively. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Metadata , Software , Databases, Factual
SELECTION OF CITATIONS
SEARCH DETAIL