Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 90
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Nat Methods ; 19(7): 865-870, 2022 07.
Artigo em Inglês | MEDLINE | ID: mdl-35637304

RESUMO

Current methods for structure elucidation of small molecules rely on finding similarity with spectra of known compounds, but do not predict structures de novo for unknown compound classes. We present MSNovelist, which combines fingerprint prediction with an encoder-decoder neural network to generate structures de novo solely from tandem mass spectrometry (MS2) spectra. In an evaluation with 3,863 MS2 spectra from the Global Natural Product Social Molecular Networking site, MSNovelist predicted 25% of structures correctly on first rank, retrieved 45% of structures overall and reproduced 61% of correct database annotations, without having ever seen the structure in the training phase. Similarly, for the CASMI 2016 challenge, MSNovelist correctly predicted 26% and retrieved 57% of structures, recovering 64% of correct database annotations. Finally, we illustrate the application of MSNovelist in a bryophyte MS2 dataset, in which de novo structure prediction substantially outscored the best database candidate for seven spectra. MSNovelist is ideally suited to complement library-based annotation in the case of poorly represented analyte classes and novel compounds.


Assuntos
Espectrometria de Massas em Tandem , Bases de Dados Factuais
2.
Proc Natl Acad Sci U S A ; 119(35): e2122636119, 2022 08 30.
Artigo em Inglês | MEDLINE | ID: mdl-36018838

RESUMO

Taxonomic classification, that is, the assignment to biological clades with shared ancestry, is a common task in genetics, mainly based on a genome similarity search of large genome databases. The classification quality depends heavily on the database, since representative relatives must be present. Many genomic sequences cannot be classified at all or only with a high misclassification rate. Here we present BERTax, a deep neural network program based on natural language processing to precisely classify the superkingdom and phylum of DNA sequences taxonomically without the need for a known representative relative from a database. We show BERTax to be at least on par with the state-of-the-art approaches when taxonomically similar species are part of the training data. For novel organisms, however, BERTax clearly outperforms any existing approach. Finally, we show that BERTax can also be combined with database approaches to further increase the prediction quality in almost all cases. Since BERTax is not based on similar entries in databases, it allows precise taxonomic classification of a broader range of genomic sequences, thus increasing the overall information gain.


Assuntos
Código de Barras de DNA Taxonômico , DNA , Aprendizado Profundo , Software , Algoritmos , Sequência de Bases , DNA/classificação , DNA/genética , Código de Barras de DNA Taxonômico/métodos , Genoma , Genômica
3.
Anal Chem ; 95(32): 11901-11907, 2023 08 15.
Artigo em Inglês | MEDLINE | ID: mdl-37540774

RESUMO

The inability to identify the structures of most metabolites detected in environmental or biological samples limits the utility of nontargeted metabolomics. The most widely used analytical approaches combine mass spectrometry and machine learning methods to rank candidate structures contained in large chemical databases. Given the large chemical space typically searched, the use of additional orthogonal data may improve the identification rates and reliability. Here, we present results of combining experimental and computational mass and IR spectral data for high-throughput nontargeted chemical structure identification. Experimental MS/MS and gas-phase IR data for 148 test compounds were obtained from NIST. Candidate structures for each of the test compounds were obtained from PubChem (mean = 4444 candidate structures per test compound). Our workflow used CSI:FingerID to initially score and rank the candidate structures. The top 1000 ranked candidates were subsequently used for IR spectra prediction, scoring, and ranking using density functional theory (DFT-IR). Final ranking of the candidates was based on a composite score calculated as the average of the CSI:FingerID and DFT-IR rankings. This approach resulted in the correct identification of 88 of the 148 test compounds (59%). 129 of the 148 test compounds (87%) were ranked within the top 20 candidates. These identification rates are the highest yet reported when candidate structures are used from PubChem. Combining experimental and computational MS/MS and IR spectral data is a potentially powerful option for prioritizing candidates for final structure verification.


Assuntos
Bases de Dados de Compostos Químicos , Espectrometria de Massas em Tandem , Reprodutibilidade dos Testes , Metabolômica/métodos , Aprendizado de Máquina
4.
Nat Chem Biol ; 17(2): 146-151, 2021 02.
Artigo em Inglês | MEDLINE | ID: mdl-33199911

RESUMO

Untargeted mass spectrometry is employed to detect small molecules in complex biospecimens, generating data that are difficult to interpret. We developed Qemistree, a data exploration strategy based on the hierarchical organization of molecular fingerprints predicted from fragmentation spectra. Qemistree allows mass spectrometry data to be represented in the context of sample metadata and chemical ontologies. By expressing molecular relationships as a tree, we can apply ecological tools that are designed to analyze and visualize the relatedness of DNA sequences to metabolomics data. Here we demonstrate the use of tree-guided data exploration tools to compare metabolomics samples across different experimental conditions such as chromatographic shifts. Additionally, we leverage a tree representation to visualize chemical diversity in a heterogeneous collection of samples. The Qemistree software pipeline is freely available to the microbiome and metabolomics communities in the form of a QIIME2 plugin, and a global natural products social molecular networking workflow.


Assuntos
Espectrometria de Massas/métodos , Metabolômica , Algoritmos , Análise por Conglomerados , DNA/química , Impressões Digitais de DNA , Bases de Dados Factuais , Ecologia , Análise de Alimentos , Microbiota , Análise Multivariada , Software , Espectrometria de Massas em Tandem , Fluxo de Trabalho
5.
J Proteome Res ; 21(4): 1204-1207, 2022 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-35119864

RESUMO

Machine learning is increasingly applied in proteomics and metabolomics to predict molecular structure, function, and physicochemical properties, including behavior in chromatography, ion mobility, and tandem mass spectrometry. These must be described in sufficient detail to apply or evaluate the performance of trained models. Here we look at and interpret the recently published and general DOME (Data, Optimization, Model, Evaluation) recommendations for conducting and reporting on machine learning in the specific context of proteomics and metabolomics.


Assuntos
Metabolômica , Proteômica , Aprendizado de Máquina , Metabolômica/métodos , Proteômica/métodos , Espectrometria de Massas em Tandem
6.
Environ Microbiol ; 24(11): 5408-5424, 2022 11.
Artigo em Inglês | MEDLINE | ID: mdl-36222155

RESUMO

The exchange of metabolites mediates algal and bacterial interactions that maintain ecosystem function. Yet, while thousands of metabolites are produced, only a few molecules have been identified in these associations. Using the ubiquitous microalgae Pseudo-nitzschia sp., as a model, we employed an untargeted metabolomics strategy to assign structural characteristics to the metabolites that distinguished specific diatom-microbiome associations. We cultured five species of Pseudo-nitzschia, including two species that produced the toxin domoic acid, and examined their microbiomes and metabolomes. A total of 4826 molecular features were detected by tandem mass spectrometry. Only 229 of these could be annotated using available mass spectral libraries, but by applying new in silico annotation tools, characterization was expanded to 2710 features. The metabolomes of the Pseudo-nitzschia-microbiome associations were distinct and distinguished by structurally diverse nitrogen compounds, ranging from simple amines and amides to cyclic compounds such as imidazoles, pyrrolidines and lactams. By illuminating the dark metabolomes, this study expands our capacity to discover new chemical targets that facilitate microbial partnerships and uncovers the chemical diversity that underpins algae-bacteria interactions.


Assuntos
Diatomáceas , Microbiota , Diatomáceas/metabolismo , Espectrometria de Massas em Tandem , Metaboloma
8.
Nat Methods ; 16(4): 299-302, 2019 04.
Artigo em Inglês | MEDLINE | ID: mdl-30886413

RESUMO

Mass spectrometry is a predominant experimental technique in metabolomics and related fields, but metabolite structural elucidation remains highly challenging. We report SIRIUS 4 (https://bio.informatik.uni-jena.de/sirius/), which provides a fast computational approach for molecular structure identification. SIRIUS 4 integrates CSI:FingerID for searching in molecular structure databases. Using SIRIUS 4, we achieved identification rates of more than 70% on challenging metabolomics datasets.


Assuntos
Metabolômica/métodos , Estrutura Molecular , Processamento de Sinais Assistido por Computador , Espectrometria de Massas em Tandem/métodos , Algoritmos , Teorema de Bayes , Biomarcadores , Análise por Conglomerados , Biologia Computacional/métodos , Gráficos por Computador , Bases de Dados Factuais , Processamento Eletrônico de Dados , Internet , Isótopos , Funções Verossimilhança , Metaboloma , Redes Neurais de Computação , Linguagens de Programação , Interface Usuário-Computador
9.
Metabolomics ; 18(12): 97, 2022 11 27.
Artigo em Inglês | MEDLINE | ID: mdl-36436113

RESUMO

INTRODUCTION: The structural identification of metabolites represents one of the current bottlenecks in non-targeted liquid chromatography-mass spectrometry (LC-MS) based metabolomics. The Metabolomics Standard Initiative has developed a multilevel system to report confidence in metabolite identification, which involves the use of MS, MS/MS and orthogonal data. Limitations due to similar or same fragmentation pattern (e.g. isomeric compounds) can be overcome by the additional orthogonal information of the retention time (RT), since it is a system property that is different for each chromatographic setup. OBJECTIVES: In contrast to MS data, sharing of RT data is not as widespread. The quality of data and its (re-)useability depend very much on the quality of the metadata. We aimed to evaluate the coverage and quality of this metadata from public metabolomics repositories. METHODS: We acquired an overview on the current reporting of chromatographic separation conditions. For this purpose, we defined the following information as important details that have to be provided: column name and dimension, flow rate, temperature, composition of eluents and gradient. RESULTS: We found that 70% of descriptions of the chromatographic setups are incomplete (according to our definition) and an additional 10% of the descriptions contained ambiguous and/or incorrect information. Accordingly, only about 20% of the descriptions allow further (re-)use of the data, e.g. for RT prediction. Therefore, we have started to develop a unified and standardized notation for chromatographic metadata with detailed and specific description of eluents, columns and gradients. CONCLUSION: Reporting of chromatographic metadata is currently not unified. Our recommended suggestions for metadata reporting will enable more standardization and automatization in future reporting.


Assuntos
Metabolômica , Metadados , Espectrometria de Massas em Tandem , Cromatografia Líquida , Temperatura
10.
Environ Sci Technol ; 56(15): 11027-11040, 2022 08 02.
Artigo em Inglês | MEDLINE | ID: mdl-35834352

RESUMO

Ultrahigh-resolution Fourier transform mass spectrometry (FTMS) has revealed unprecedented details of natural complex mixtures such as dissolved organic matter (DOM) on a molecular formula level, but we lack approaches to access the underlying structural complexity. We here explore the hypothesis that every DOM precursor ion is potentially linked with all emerging product ions in FTMS2 experiments. The resulting mass difference (Δm) matrix is deconvoluted to isolate individual precursor ion Δm profiles and matched with structural information, which was derived from 42 Δm features from 14 in-house reference compounds and a global set of 11 477 Δm features with assigned structure specificities, using a dataset of ∼18 000 unique structures. We show that Δm matching is highly sensitive in predicting potential precursor ion identities in terms of molecular and structural composition. Additionally, the approach identified unresolved precursor ions and missing elements in molecular formula annotation (P, Cl, F). Our study provides first results on how Δm matching refines structural annotations in van Krevelen space but simultaneously demonstrates the wide overlap between potential structural classes. We show that this effect is likely driven by chemodiversity and offers an explanation for the observed ubiquitous presence of molecules in the center of the van Krevelen space. Our promising first results suggest that Δm matching can both unfold the structural information encrypted in DOM and assess the quality of FTMS-derived molecular formulas of complex mixtures in general.


Assuntos
Matéria Orgânica Dissolvida , Espectrometria de Massas por Ionização por Electrospray , Misturas Complexas , Estrutura Molecular , Espectrometria de Massas por Ionização por Electrospray/métodos
11.
J Sep Sci ; 43(9-10): 1746-1754, 2020 May.
Artigo em Inglês | MEDLINE | ID: mdl-32144942

RESUMO

Metabolite identification is a crucial step in nontargeted metabolomics, but also represents one of its current bottlenecks. Accurate identifications are required for correct biological interpretation. To date, annotation and identification are usually based on the use of accurate mass search or tandem mass spectrometry analysis, but neglect orthogonal information such as retention times obtained by chromatographic separation. While several tools are available for the analysis and prediction of tandem mass spectrometry data, prediction of retention times for metabolite identification are not widespread. Here, we review the current state of retention time prediction in liquid chromatography-mass spectrometry-based metabolomics, with a focus on publications published after 2010.

12.
Plant J ; 93(1): 193-206, 2018 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-29117637

RESUMO

Spatially resolved analysis of a multitude of compound classes has become feasible with the rapid advancement in mass spectrometry imaging strategies. In this study, we present a protocol that combines high lateral resolution time-of-flight secondary ion mass spectrometry (TOF-SIMS) imaging with a multivariate data analysis (MVA) approach to probe the complex leaf surface chemistry of Populus trichocarpa. Here, epicuticular waxes (EWs) found on the adaxial leaf surface of P. trichocarpa were blotted on silicon wafers and imaged using TOF-SIMS at 10 µm and 1 µm lateral resolution. Intense M+● and M-● molecular ions were clearly visible, which made it possible to resolve the individual compound classes present in EWs. Series of long-chain aliphatic saturated alcohols (C21 -C30 ), hydrocarbons (C25 -C33 ) and wax esters (WEs; C44 -C48 ) were clearly observed. These data correlated with the 7 Li-chelation matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) analysis, which yielded mostly molecular adduct ions of the analyzed compounds. Subsequently, MVA was used to interrogate the TOF-SIMS dataset for identifying hidden patterns on the leaf's surface based on its chemical profile. After the application of principal component analysis (PCA), a small number of principal components (PCs) were found to be sufficient to explain maximum variance in the data. To further confirm the contributions from pure components, a five-factor multivariate curve resolution (MCR) model was applied. Two distinct patterns of small islets, here termed 'crystals', were apparent from the resulting score plots. Based on PCA and MCR results, the crystals were found to be formed by C23 or C29 alcohols. Other less obvious patterns observed in the PCs revealed that the adaxial leaf surface is coated with a relatively homogenous layer of alcohols, hydrocarbons and WEs. The ultra-high-resolution TOF-SIMS imaging combined with the MVA approach helped to highlight the diverse patterns underlying the leaf's surface. Currently, the methods available to analyze the surface chemistry of waxes in conjunction with the spatial information related to the distribution of compounds are limited. This study uses tools that may provide important biological insights into the composition of the wax layer, how this layer is repaired after mechanical damage or insect feeding, and which transport mechanisms are involved in deploying wax constituents to specific regions on the leaf surface.


Assuntos
Epiderme Vegetal/química , Populus/química , Espectrometria de Massa de Íon Secundário/métodos , Análise por Conglomerados , Análise Multivariada , Folhas de Planta/química , Análise de Componente Principal , Ceras/química
13.
Bioinformatics ; 34(13): i333-i340, 2018 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-29949965

RESUMO

Motivation: Metabolites, small molecules that are involved in cellular reactions, provide a direct functional signature of cellular state. Untargeted metabolomics experiments usually rely on tandem mass spectrometry to identify the thousands of compounds in a biological sample. Recently, we presented CSI:FingerID for searching in molecular structure databases using tandem mass spectrometry data. CSI:FingerID predicts a molecular fingerprint that encodes the structure of the query compound, then uses this to search a molecular structure database such as PubChem. Scoring of the predicted query fingerprint and deterministic target fingerprints is carried out assuming independence between the molecular properties constituting the fingerprint. Results: We present a scoring that takes into account dependencies between molecular properties. As before, we predict posterior probabilities of molecular properties using machine learning. Dependencies between molecular properties are modeled as a Bayesian tree network; the tree structure is estimated on the fly from the instance data. For each edge, we also estimate the expected covariance between the two random variables. For fixed marginal probabilities, we then estimate conditional probabilities using the known covariance. Now, the corrected posterior probability of each candidate can be computed, and candidates are ranked by this score. Modeling dependencies improves identification rates of CSI:FingerID by 2.85 percentage points. Availability and implementation: The new scoring Bayesian (fixed tree) is integrated into SIRIUS 4.0 (https://bio.informatik.uni-jena.de/software/sirius/).


Assuntos
Bases de Dados de Compostos Químicos , Metabolômica , Espectrometria de Massas em Tandem , Teorema de Bayes , Aprendizado de Máquina , Metabolômica/métodos , Software
14.
Bioinformatics ; 34(17): i875-i883, 2018 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-30423079

RESUMO

Motivation: Liquid Chromatography (LC) followed by tandem Mass Spectrometry (MS/MS) is one of the predominant methods for metabolite identification. In recent years, machine learning has started to transform the analysis of tandem mass spectra and the identification of small molecules. In contrast, LC data is rarely used to improve metabolite identification, despite numerous published methods for retention time prediction using machine learning. Results: We present a machine learning method for predicting the retention order of molecules; that is, the order in which molecules elute from the LC column. Our method has important advantages over previous approaches: We show that retention order is much better conserved between instruments than retention time. To this end, our method can be trained using retention time measurements from different LC systems and configurations without tedious pre-processing, significantly increasing the amount of available training data. Our experiments demonstrate that retention order prediction is an effective way to learn retention behaviour of molecules from heterogeneous retention time data. Finally, we demonstrate how retention order prediction and MS/MS-based scores can be combined for more accurate metabolite identifications when analyzing a complete LC-MS/MS run. Availability and implementation: Implementation of the method is available at https://version.aalto.fi/gitlab/bache1/retention_order_prediction.git.


Assuntos
Cromatografia Líquida/métodos , Espectrometria de Massas em Tandem/métodos
16.
J Proteome Res ; 17(12): 4051-4060, 2018 12 07.
Artigo em Inglês | MEDLINE | ID: mdl-30270626

RESUMO

The 2017 Dagstuhl Seminar on Computational Proteomics provided an opportunity for a broad discussion on the current state and future directions of the generation and use of peptide tandem mass spectrometry spectral libraries. Their use in proteomics is growing slowly, but there are multiple challenges in the field that must be addressed to further increase the adoption of spectral libraries and related techniques. The primary bottlenecks are the paucity of high quality and comprehensive libraries and the general difficulty of adopting spectral library searching into existing workflows. There are several existing spectral library formats, but none captures a satisfactory level of metadata; therefore, a logical next improvement is to design a more advanced, Proteomics Standards Initiative-approved spectral library format that can encode all of the desired metadata. The group discussed a series of metadata requirements organized into three designations of completeness or quality, tentatively dubbed bronze, silver, and gold. The metadata can be organized at four different levels of granularity: at the collection (library) level, at the individual entry (peptide ion) level, at the peak (fragment ion) level, and at the peak annotation level. Strategies for encoding mass modifications in a consistent manner and the requirement for encoding high-quality and commonly seen but as-yet-unidentified spectra were discussed. The group also discussed related topics, including strategies for comparing two spectra, techniques for generating representative spectra for a library, approaches for selection of optimal signature ions for targeted workflows, and issues surrounding the merging of two or more libraries into one. We present here a review of this field and the challenges that the community must address in order to accelerate the adoption of spectral libraries in routine analysis of proteomics datasets.


Assuntos
Bases de Dados de Proteínas/normas , Biblioteca de Peptídeos , Proteômica/métodos , Animais , Humanos , Espectrometria de Massas em Tandem/métodos , Fluxo de Trabalho
17.
Mol Biol Evol ; 34(9): 2408-2421, 2017 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-28873954

RESUMO

Supertree methods merge a set of overlapping phylogenetic trees into a supertree containing all taxa of the input trees. The challenge in supertree reconstruction is the way of dealing with conflicting information in the input trees. Many different algorithms for different objective functions have been suggested to resolve these conflicts. In particular, there exist methods based on encoding the source trees in a matrix, where the supertree is constructed applying a local search heuristic to optimize the respective objective function. We present a novel heuristic supertree algorithm called Bad Clade Deletion (BCD) supertrees. It uses minimum cuts to delete a locally minimal number of columns from such a matrix representation so that it is compatible. This is the complement problem to Matrix Representation with Compatibility (Maximum Split Fit). Our algorithm has guaranteed polynomial worst-case running time and performs swiftly in practice. Different from local search heuristics, it guarantees to return the directed perfect phylogeny for the input matrix, corresponding to the parent tree of the input trees, if one exists. Comparing supertrees to model trees for simulated data, BCD shows a better accuracy (F1 score) than the state-of-the-art algorithms SuperFine (up to 3%) and Matrix Representation with Parsimony (up to 7%); at the same time, BCD is up to 7 times faster than SuperFine, and up to 600 times faster than Matrix Representation with Parsimony. Finally, using the BCD supertree as a starting tree for a combined Maximum Likelihood analysis using RAxML, we reach significantly improved accuracy (1% higher F1 score) and running time (1.7-fold speedup).


Assuntos
Biologia Computacional/métodos , Algoritmos , Simulação por Computador , Filogenia , Software
18.
Mass Spectrom Rev ; 36(5): 624-633, 2017 09.
Artigo em Inglês | MEDLINE | ID: mdl-26763615

RESUMO

Mass spectrometry (MS) is a key technology for the analysis of small molecules. For the identification and structural elucidation of novel molecules, new approaches beyond straightforward spectral comparison are required. In this review, we will cover computational methods that help with the identification of small molecules by analyzing fragmentation MS data. We focus on the four main approaches to mine a database of metabolite structures, that is rule-based fragmentation spectrum prediction, combinatorial fragmentation, competitive fragmentation modeling, and molecular fingerprint prediction. © 2016 Wiley Periodicals, Inc. Mass Spec Rev 36:624-633, 2017.

19.
Nucleic Acids Res ; 44(20): 9600-9610, 2016 Nov 16.
Artigo em Inglês | MEDLINE | ID: mdl-27679480

RESUMO

Gene-order-based comparison of multiple genomes provides signals for functional analysis of genes and the evolutionary process of genome organization. Gene clusters are regions of co-localized genes on genomes of different species. The rapid increase in sequenced genomes necessitates bioinformatics tools for finding gene clusters in hundreds of genomes. Existing tools are often restricted to few (in many cases, only two) genomes, and often make restrictive assumptions such as short perfect conservation, conserved gene order or monophyletic gene clusters. We present Gecko 3, an open-source software for finding gene clusters in hundreds of bacterial genomes, that comes with an easy-to-use graphical user interface. The underlying gene cluster model is intuitive, can cope with low degrees of conservation as well as misannotations and is complemented by a sound statistical evaluation. To evaluate the biological benefit of Gecko 3 and to exemplify our method, we search for gene clusters in a dataset of 678 bacterial genomes using Synechocystis sp. PCC 6803 as a reference. We confirm detected gene clusters reviewing the literature and comparing them to a database of operons; we detect two novel clusters, which were confirmed by publicly available experimental RNA-Seq data. The computational analysis is carried out on a laptop computer in <40 min.


Assuntos
Biologia Computacional/métodos , Genômica/métodos , Família Multigênica , Software , Algoritmos , Conjuntos de Dados como Assunto , Genes Bacterianos , Genoma Bacteriano , Modelos Estatísticos , Navegador , Fluxo de Trabalho
20.
Proc Natl Acad Sci U S A ; 112(41): 12580-5, 2015 Oct 13.
Artigo em Inglês | MEDLINE | ID: mdl-26392543

RESUMO

Metabolites provide a direct functional signature of cellular state. Untargeted metabolomics experiments usually rely on tandem MS to identify the thousands of compounds in a biological sample. Today, the vast majority of metabolites remain unknown. We present a method for searching molecular structure databases using tandem MS data of small molecules. Our method computes a fragmentation tree that best explains the fragmentation spectrum of an unknown molecule. We use the fragmentation tree to predict the molecular structure fingerprint of the unknown compound using machine learning. This fingerprint is then used to search a molecular structure database such as PubChem. Our method is shown to improve on the competing methods for computational metabolite identification by a considerable margin.


Assuntos
Bases de Dados de Proteínas , Aprendizado de Máquina , Espectrometria de Massas , Metabolômica , Animais , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA