Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 80
Filtrar
1.
Bioinformatics ; 40(3)2024 Mar 04.
Artigo em Inglês | MEDLINE | ID: mdl-38383048

RESUMO

MOTIVATION: Random forests (RFs) can deal with a large number of variables, achieve reasonable prediction scores, and yield highly interpretable feature importance values. As such, RFs are appropriate models for feature selection and further dimension reduction. However, RFs are often not appropriate for correlated datasets due to their mode of selecting individual features for splitting. Addressing correlation relationships in high-dimensional datasets is imperative for reducing the number of variables that are assigned high importance, hence making the dimension reduction most efficient. Here, we propose the LAtent VAriable Stochastic Ensemble of Trees (LAVASET) method that derives latent variables based on the distance characteristics of each feature and aims to incorporate the correlation factor in the splitting step. RESULTS: Without compromising on performance in the majority of examples, LAVASET outperforms RF by accurately determining feature importance across all correlated variables and ensuring proper distribution of importance values. LAVASET yields mostly non-inferior prediction accuracies to traditional RFs when tested in simulated and real 1D datasets, as well as more complex and high-dimensional 3D datatypes. Unlike traditional RFs, LAVASET is unaffected by single 'important' noisy features (false positives), as it considers the local neighbourhood. LAVASET, therefore, highlights neighbourhoods of features, reflecting real signals that collectively impact the model's predictive ability. AVAILABILITY AND IMPLEMENTATION: LAVASET is freely available as a standalone package from https://github.com/melkasapi/LAVASET.

2.
BMC Bioinformatics ; 23(1): 481, 2022 Nov 14.
Artigo em Inglês | MEDLINE | ID: mdl-36376837

RESUMO

BACKGROUND: Single sample pathway analysis (ssPA) transforms molecular level omics data to the pathway level, enabling the discovery of patient-specific pathway signatures. Compared to conventional pathway analysis, ssPA overcomes the limitations by enabling multi-group comparisons, alongside facilitating numerous downstream analyses such as pathway-based machine learning. While in transcriptomics ssPA is a widely used technique, there is little literature evaluating its suitability for metabolomics. Here we provide a benchmark of established ssPA methods (ssGSEA, GSVA, SVD (PLAGE), and z-score) alongside the evaluation of two novel methods we propose: ssClustPA and kPCA, using semi-synthetic metabolomics data. We then demonstrate how ssPA can facilitate pathway-based interpretation of metabolomics data by performing a case-study on inflammatory bowel disease mass spectrometry data, using clustering to determine subtype-specific pathway signatures. RESULTS: While GSEA-based and z-score methods outperformed the others in terms of recall, clustering/dimensionality reduction-based methods provided higher precision at moderate-to-high effect sizes. A case study applying ssPA to inflammatory bowel disease data demonstrates how these methods yield a much richer depth of interpretation than conventional approaches, for example by clustering pathway scores to visualise a pathway-based patient subtype-specific correlation network. We also developed the sspa python package (freely available at https://pypi.org/project/sspa/ ), providing implementations of all the methods benchmarked in this study. CONCLUSION: This work underscores the value ssPA methods can add to metabolomic studies and provides a useful reference for those wishing to apply ssPA methods to metabolomics data.


Assuntos
Doenças Inflamatórias Intestinais , Metabolômica , Humanos , Metabolômica/métodos , Transcriptoma , Análise por Conglomerados , Espectrometria de Massas
3.
Int J Cancer ; 151(12): 2115-2127, 2022 Dec 15.
Artigo em Inglês | MEDLINE | ID: mdl-35866293

RESUMO

Prostate cancer (PCa) is the most common cancer form in males in many European and American countries, but there are still open questions regarding its etiology. Untargeted metabolomics can produce an unbiased global metabolic profile, with the opportunity for uncovering new plasma metabolites prospectively associated with risk of PCa, providing insights into disease etiology. We conducted a prospective untargeted liquid chromatography-mass spectrometry (LC-MS) metabolomics analysis using prediagnostic fasting plasma samples from 752 PCa case-control pairs nested within the Northern Sweden Health and Disease Study (NSHDS). The pairs were matched by age, BMI, and sample storage time. Discriminating features were identified by a combination of orthogonal projection to latent structures-effect projections (OPLS-EP) and Wilcoxon signed-rank tests. Their prospective associations with PCa risk were investigated by conditional logistic regression. Subgroup analyses based on stratification by disease aggressiveness and baseline age were also conducted. Various free fatty acids and phospholipids were positively associated with overall risk of PCa and in various stratification subgroups. Aromatic amino acids were positively associated with overall risk of PCa. Uric acid was positively, and glucose negatively, associated with risk of PCa in the older subgroup. This is the largest untargeted LC-MS based metabolomics study to date on plasma metabolites prospectively associated with risk of developing PCa. Different subgroups of disease aggressiveness and baseline age showed different associations with metabolites. The findings suggest that shifts in plasma concentrations of metabolites in lipid, aromatic amino acid, and glucose metabolism are associated with risk of developing PCa during the following two decades.


Assuntos
Ácidos Graxos não Esterificados , Neoplasias da Próstata , Masculino , Humanos , Estudos de Casos e Controles , Ácido Úrico , Suécia/epidemiologia , Metabolômica/métodos , Espectrometria de Massas , Neoplasias da Próstata/diagnóstico , Neoplasias da Próstata/epidemiologia , Aminoácidos Aromáticos , Glucose
4.
Anal Chem ; 94(8): 3446-3455, 2022 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-35180347

RESUMO

Untargeted metabolomics and lipidomics LC-MS experiments produce complex datasets, usually containing tens of thousands of features from thousands of metabolites whose annotation requires additional MS/MS experiments and expert knowledge. All-ion fragmentation (AIF) LC-MS/MS acquisition provides fragmentation data at no additional experimental time cost. However, analysis of such datasets requires reconstruction of parent-fragment relationships and annotation of the resulting pseudo-MS/MS spectra. Here, we propose a novel approach for automated annotation of isotopologues, adducts, and in-source fragments from AIF LC-MS datasets by combining correlation-based parent-fragment linking with molecular fragment matching. Our workflow focuses on a subset of features rather than trying to annotate the full dataset, saving time and simplifying the process. We demonstrate the workflow in three human serum datasets containing 599 features manually annotated by experts. Precision and recall values of 82-92% and 82-85%, respectively, were obtained for features found in the highest-rank scores (1-5). These results equal or outperform those obtained using MS-DIAL software, the current state of the art for AIF data annotation. Further validation for other biological matrices and different instrument types showed variable precision (60-89%) and recall (10-88%) particularly for datasets dominated by nonlipid metabolites. The workflow is freely available as an open-source R package, MetaboAnnotatoR, together with the fragment libraries from Github (https://github.com/gggraca/MetaboAnnotatoR).


Assuntos
Metabolômica , Espectrometria de Massas em Tandem , Cromatografia Líquida/métodos , Humanos , Metabolômica/métodos , Software , Espectrometria de Massas em Tandem/métodos , Fluxo de Trabalho
5.
Metabolomics ; 18(12): 102, 2022 12 05.
Artigo em Inglês | MEDLINE | ID: mdl-36469142

RESUMO

BACKGROUND: Compound identification remains a critical bottleneck in the process of exploiting Nuclear Magnetic Resonance (NMR) metabolomics data, especially for 1H 1-dimensional (1H 1D) data. As databases of reference compound spectra have grown, workflows have evolved to rely heavily on their search functions to facilitate this process by generating lists of potential metabolites found in complex mixture data, facilitating annotation and identification. However, approaches for validating and communicating annotations are most often guided by expert knowledge, and therefore are highly variable despite repeated efforts to align practices and define community standards. AIM OF REVIEW: This review is aimed at broadening the application of automated annotation tools by discussing the key ideas of spectral matching and beginning to describe a set of terms to classify this information, thus advancing standards for communicating annotation confidence. Additionally, we hope that this review will facilitate the growing collaboration between chemical data scientists, software developers and the NMR metabolomics community aiding development of long-term software solutions. KEY SCIENTIFIC CONCEPTS OF REVIEW: We begin with a brief discussion of the typical untargeted NMR identification workflow. We differentiate between annotation (hypothesis generation, filtering), and identification (hypothesis testing, verification), and note the utility of different NMR data features for annotation. We then touch on three parts of annotation: (1) generation of queries, (2) matching queries to reference data, and (3) scoring and confidence estimation of potential matches for verification. In doing so, we highlight existing approaches to automated and semi-automated annotation from the perspective of the structural information they utilize, as well as how this information can be represented computationally.


Assuntos
Metabolômica , Software , Metabolômica/métodos , Espectroscopia de Ressonância Magnética/métodos , Imageamento por Ressonância Magnética , Bases de Dados Factuais
6.
BMC Bioinformatics ; 22(1): 67, 2021 Feb 12.
Artigo em Inglês | MEDLINE | ID: mdl-33579202

RESUMO

BACKGROUND: The search for statistically significant relationships between molecular markers and outcomes is challenging when dealing with high-dimensional, noisy and collinear multivariate omics data, such as metabolomic profiles. Permutation procedures allow for the estimation of adjusted significance levels without assuming independence among metabolomic variables. Nevertheless, the complex non-normal structure of metabolic profiles and outcomes may bias the permutation results leading to overly conservative threshold estimates i.e. lower than those from a Bonferroni or Sidak correction. METHODS: Within a univariate permutation procedure we employ parametric simulation methods based on the multivariate (log-)Normal distribution to obtain adjusted significance levels which are consistent across different outcomes while effectively controlling the type I error rate. Next, we derive an alternative closed-form expression for the estimation of the number of non-redundant metabolic variates based on the spectral decomposition of their correlation matrix. The performance of the method is tested for different model parametrizations and across a wide range of correlation levels of the variates using synthetic and real data sets. RESULTS: Both the permutation-based formulation and the more practical closed form expression are found to give an effective indication of the number of independent metabolic effects exhibited by the system, while guaranteeing that the derived adjusted threshold is stable across outcome measures with diverse properties.


Assuntos
Metaboloma , Metabolômica , Modelos Biológicos , Marcadores Genéticos/genética , Metabolômica/métodos , Distribuições Estatísticas
7.
Regul Toxicol Pharmacol ; 125: 105020, 2021 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-34333066

RESUMO

Omics methodologies are widely used in toxicological research to understand modes and mechanisms of toxicity. Increasingly, these methodologies are being applied to questions of regulatory interest such as molecular point-of-departure derivation and chemical grouping/read-across. Despite its value, widespread regulatory acceptance of omics data has not yet occurred. Barriers to the routine application of omics data in regulatory decision making have been: 1) lack of transparency for data processing methods used to convert raw data into an interpretable list of observations; and 2) lack of standardization in reporting to ensure that omics data, associated metadata and the methodologies used to generate results are available for review by stakeholders, including regulators. Thus, in 2017, the Organisation for Economic Co-operation and Development (OECD) Extended Advisory Group on Molecular Screening and Toxicogenomics (EAGMST) launched a project to develop guidance for the reporting of omics data aimed at fostering further regulatory use. Here, we report on the ongoing development of the first formal reporting framework describing the processing and analysis of both transcriptomic and metabolomic data for regulatory toxicology. We introduce the modular structure, content, harmonization and strategy for trialling this reporting framework prior to its publication by the OECD.


Assuntos
Metabolômica/normas , Organização para a Cooperação e Desenvolvimento Econômico/normas , Toxicogenética/normas , Toxicologia/normas , Transcriptoma/fisiologia , Documentação/normas , Humanos
8.
BMC Bioinformatics ; 21(1): 11, 2020 Jan 09.
Artigo em Inglês | MEDLINE | ID: mdl-31918658

RESUMO

BACKGROUND: Metabolomics time-course experiments provide the opportunity to understand the changes to an organism by observing the evolution of metabolic profiles in response to internal or external stimuli. Along with other omic longitudinal profiling technologies, these techniques have great potential to uncover complex relations between variations across diverse omic variables and provide unique insights into the underlying biology of the system. However, many statistical methods currently used to analyse short time-series omic data are i) prone to overfitting, ii) do not fully take into account the experimental design or iii) do not make full use of the multivariate information intrinsic to the data or iv) are unable to uncover multiple associations between different omic data. The model we propose is an attempt to i) overcome overfitting by using a weakly informative Bayesian model, ii) capture experimental design conditions through a mixed-effects model, iii) model interdependencies between variables by augmenting the mixed-effects model with a conditional auto-regressive (CAR) component and iv) identify potential associations between heterogeneous omic variables by using a horseshoe prior. RESULTS: We assess the performance of our model on synthetic and real datasets and show that it can outperform comparable models for metabolomic longitudinal data analysis. In addition, our proposed method provides the analyst with new insights on the data as it is able to identify metabolic biomarkers related to treatment, infer perturbed pathways as a result of treatment and find significant associations with additional omic variables. We also show through simulation that our model is fairly robust against inaccuracies in metabolite assignments. On real data, we demonstrate that the number of profiled metabolites slightly affects the predictive ability of the model. CONCLUSIONS: Our single model approach to longitudinal analysis of metabolomics data provides an approach simultaneously for integrative analysis and biomarker discovery. In addition, it lends better interpretation by allowing analysis at the pathway level. An accompanying R package for the model has been developed using the probabilistic programming language Stan. The package offers user-friendly functions for simulating data, fitting the model, assessing model fit and postprocessing the results. The main aim of the R package is to offer freely accessible resources for integrative longitudinal analysis for metabolomics scientists and various visualization functions easy-to-use for applied researchers to interpret results.


Assuntos
Biomarcadores/metabolismo , Metabolômica/métodos , Modelos Teóricos , Bactérias/metabolismo , Teorema de Bayes , Metaboloma
9.
Eur Heart J ; 40(34): 2883-2896, 2019 09 07.
Artigo em Inglês | MEDLINE | ID: mdl-31102408

RESUMO

AIMS: To characterize serum metabolic signatures associated with atherosclerosis in the coronary or carotid arteries and subsequently their association with incident cardiovascular disease (CVD). METHODS AND RESULTS: We used untargeted one-dimensional (1D) serum metabolic profiling by proton nuclear magnetic resonance spectroscopy (1H NMR) among 3867 participants from the Multi-Ethnic Study of Atherosclerosis (MESA), with replication among 3569 participants from the Rotterdam and LOLIPOP studies. Atherosclerosis was assessed by coronary artery calcium (CAC) and carotid intima-media thickness (IMT). We used multivariable linear regression to evaluate associations between NMR features and atherosclerosis accounting for multiplicity of comparisons. We then examined associations between metabolites associated with atherosclerosis and incident CVD available in MESA and Rotterdam and explored molecular networks through bioinformatics analyses. Overall, 30 1H NMR measured metabolites were associated with CAC and/or IMT, P = 1.3 × 10-14 to 1.0 × 10-6 (discovery) and P = 5.6 × 10-10 to 1.1 × 10-2 (replication). These associations were substantially attenuated after adjustment for conventional cardiovascular risk factors. Metabolites associated with atherosclerosis revealed disturbances in lipid and carbohydrate metabolism, branched chain, and aromatic amino acid metabolism, as well as oxidative stress and inflammatory pathways. Analyses of incident CVD events showed inverse associations with creatine, creatinine, and phenylalanine, and direct associations with mannose, acetaminophen-glucuronide, and lactate as well as apolipoprotein B (P < 0.05). CONCLUSION: Metabolites associated with atherosclerosis were largely consistent between the two vascular beds (coronary and carotid arteries) and predominantly tag pathways that overlap with the known cardiovascular risk factors. We present an integrated systems network that highlights a series of inter-connected pathways underlying atherosclerosis.


Assuntos
Doenças Cardiovasculares/etiologia , Doenças das Artérias Carótidas/complicações , Doenças das Artérias Carótidas/metabolismo , Doença da Artéria Coronariana/complicações , Doença da Artéria Coronariana/metabolismo , Adulto , Idoso , Doenças Cardiovasculares/sangue , Doenças das Artérias Carótidas/sangue , Doença da Artéria Coronariana/sangue , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Estudos Prospectivos , Espectroscopia de Prótons por Ressonância Magnética
10.
BMC Bioinformatics ; 20(1): 543, 2019 Nov 04.
Artigo em Inglês | MEDLINE | ID: mdl-31684857

RESUMO

BACKGROUND: Transcriptomic data is often used to build statistical models which are predictive of a given phenotype, such as disease status. Genes work together in pathways and it is widely thought that pathway representations will be more robust to noise in the gene expression levels. We aimed to test this hypothesis by constructing models based on either genes alone, or based on sample specific scores for each pathway, thus transforming the data to a 'pathway space'. We progressively degraded the raw data by addition of noise and examined the ability of the models to maintain predictivity. RESULTS: Models in the pathway space indeed had higher predictive robustness than models in the gene space. This result was independent of the workflow, parameters, classifier and data set used. Surprisingly, randomised pathway mappings produced models of similar accuracy and robustness to true mappings, suggesting that the success of pathway space models is not conferred by the specific definitions of the pathway. Instead, predictive models built on the true pathway mappings led to prediction rules with fewer influential pathways than those built on randomised pathways. The extent of this effect was used to differentiate pathway collections coming from a variety of widely used pathway databases. CONCLUSIONS: Prediction models based on pathway scores are more robust to degradation of gene expression information than the equivalent models based on ungrouped genes. While models based on true pathway scores are not more robust or accurate than those based on randomised pathways, true pathways produced simpler prediction rules, emphasizing a smaller number of pathways.


Assuntos
Biologia Computacional/métodos , Perfilação da Expressão Gênica , Transdução de Sinais , Bases de Dados Factuais , Expressão Gênica , Humanos , Modelos Estatísticos , Fenótipo , Transcriptoma
11.
J Proteome Res ; 17(4): 1586-1595, 2018 04 06.
Artigo em Inglês | MEDLINE | ID: mdl-29457906

RESUMO

Metabolism is altered by genetics, diet, disease status, environment, and many other factors. Modeling either one of these is often done without considering the effects of the other covariates. Attributing differences in metabolic profile to one of these factors needs to be done while controlling for the metabolic influence of the rest. We describe here a data analysis framework and novel confounder-adjustment algorithm for multivariate analysis of metabolic profiling data. Using simulated data, we show that similar numbers of true associations and significantly less false positives are found compared to other commonly used methods. Covariate-adjusted projections to latent structures (CA-PLS) are exemplified here using a large-scale metabolic phenotyping study of two Chinese populations at different risks for cardiovascular disease. Using CA-PLS, we find that some previously reported differences are actually associated with external factors and discover a number of previously unreported biomarkers linked to different metabolic pathways. CA-PLS can be applied to any multivariate data where confounding may be an issue and the confounder-adjustment procedure is translatable to other multivariate regression techniques.


Assuntos
Biomarcadores , Fatores de Confusão Epidemiológicos , Metaboloma , Modelos Estatísticos , Fenótipo , Algoritmos , Povo Asiático , Doenças Cardiovasculares , Simulação por Computador , Humanos , Análise Multivariada , Risco , Análise Espectral
12.
Anal Chem ; 90(1): 649-656, 2018 01 02.
Artigo em Inglês | MEDLINE | ID: mdl-29035042

RESUMO

NMR is a widely used analytical technique with a growing number of repositories available. As a result, demands for a vendor-agnostic, open data format for long-term archiving of NMR data have emerged with the aim to ease and encourage sharing, comparison, and reuse of NMR data. Here we present nmrML, an open XML-based exchange and storage format for NMR spectral data. The nmrML format is intended to be fully compatible with existing NMR data for chemical, biochemical, and metabolomics experiments. nmrML can capture raw NMR data, spectral data acquisition parameters, and where available spectral metadata, such as chemical structures associated with spectral assignments. The nmrML format is compatible with pure-compound NMR data for reference spectral libraries as well as NMR data from complex biomixtures, i.e., metabolomics experiments. To facilitate format conversions, we provide nmrML converters for Bruker, JEOL and Agilent/Varian vendor formats. In addition, easy-to-use Web-based spectral viewing, processing, and spectral assignment tools that read and write nmrML have been developed. Software libraries and Web services for data validation are available for tool developers and end-users. The nmrML format has already been adopted for capturing and disseminating NMR data for small molecules by several open source data processing tools and metabolomics reference spectral libraries, e.g., serving as storage format for the MetaboLights data repository. The nmrML open access data standard has been endorsed by the Metabolomics Standards Initiative (MSI), and we here encourage user participation and feedback to increase usability and make it a successful standard.


Assuntos
Bases de Dados de Compostos Químicos/normas , Espectroscopia de Ressonância Magnética/estatística & dados numéricos , Metabolômica/métodos , Software
13.
Metabolomics ; 14(3): 32, 2018 02 13.
Artigo em Inglês | MEDLINE | ID: mdl-30830335

RESUMO

INTRODUCTION: Differences in the metabolite profiles between serum and plasma are incompletely understood. OBJECTIVES: To evaluate metabolic profile differences between serum and plasma and among plasma sample subtypes. METHODS: We analyzed serum, platelet rich plasma (PRP), platelet poor plasma (PPP), and platelet free plasma (PFP), collected from 8 non-fasting apparently healthy women, using untargeted standard 1D and CPMG 1H NMR and reverse phase and hydrophilic (HILIC) UPLC-MS. Differences between metabolic profiles were evaluated using validated principal component and orthogonal partial least squares discriminant analysis. RESULTS: Explorative analysis showed the main source of variation among samples was due to inter-individual differences with no grouping by sample type. After correcting for inter-individual differences, lipoproteins, lipids in VLDL/LDL, lactate, glutamine, and glucose were found to discriminate serum from plasma in NMR analyses. In UPLC-MS analyses, lysophosphatidylethanolamine (lysoPE)(18:0) and lysophosphatidic acid(20:0) were higher in serum, and phosphatidylcholines (PC)(16:1/18:2, 20:3/18:0, O-20:0/22:4), lysoPC(16:0), PE(O-18:2/20:4), sphingomyelin(18:0/22:0), and linoleic acid were lower. In plasma subtype analyses, isoleucine, leucine, valine, phenylalanine, glutamate, and pyruvate were higher among PRP samples compared with PPP and PFP by NMR while lipids in VLDL/LDL, citrate, and glutamine were lower. By UPLC-MS, PE(18:0/18:2) and PC(P-16:0/20:4) were higher in PRP compared with PFP samples. CONCLUSIONS: Correction for inter-individual variation was required to detect metabolite differences between serum and plasma. Our results suggest the potential importance of inter-individual effects and sample type on the results from serum and plasma metabolic phenotyping studies.


Assuntos
Metaboloma , Plasma/química , Soro/química , Adulto , Aminoácidos/análise , Glicemia/análise , Feminino , Humanos , Lipídeos/análise , Lipoproteínas/análise , Espectrometria de Massas , Pessoa de Meia-Idade , Espectroscopia de Prótons por Ressonância Magnética
14.
J Proteome Res ; 16(10): 3623-3633, 2017 10 06.
Artigo em Inglês | MEDLINE | ID: mdl-28823158

RESUMO

1H NMR spectroscopy of biofluids generates reproducible data allowing detection and quantification of small molecules in large population cohorts. Statistical models to analyze such data are now well-established, and the use of univariate metabolome wide association studies (MWAS) investigating the spectral features separately has emerged as a computationally efficient and interpretable alternative to multivariate models. The MWAS rely on the accurate estimation of a metabolome wide significance level (MWSL) to be applied to control the family wise error rate. Subsequent interpretation requires efficient visualization and formal feature annotation, which, in-turn, call for efficient prioritization of spectral variables of interest. Using human serum 1H NMR spectroscopic profiles from 3948 participants from the Multi-Ethnic Study of Atherosclerosis (MESA), we have performed a series of MWAS for serum levels of glucose. We first propose an extension of the conventional MWSL that yields stable estimates of the MWSL across the different model parameterizations and distributional features of the outcome. We propose both efficient visualization methods and a strategy based on subsampling and internal validation to prioritize the associations. Our work proposes and illustrates practical and scalable solutions to facilitate the implementation of the MWAS approach and improve interpretation in large cohort studies.


Assuntos
Aterosclerose/sangue , Metaboloma/genética , Metabolômica , Adulto , Idoso , Idoso de 80 Anos ou mais , Aterosclerose/epidemiologia , Aterosclerose/patologia , Glicemia , Estudos de Coortes , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Espectroscopia de Prótons por Ressonância Magnética
16.
J Proteome Res ; 15(12): 4188-4194, 2016 12 02.
Artigo em Inglês | MEDLINE | ID: mdl-27628670

RESUMO

Large-scale metabolomics studies involving thousands of samples present multiple challenges in data analysis, particularly when an untargeted platform is used. Studies with multiple cohorts and analysis platforms exacerbate existing problems such as peak alignment and normalization. Therefore, there is a need for robust processing pipelines that can ensure reliable data for statistical analysis. The COMBI-BIO project incorporates serum from ∼8000 individuals, in three cohorts, profiled by six assays in two phases using both 1H NMR and UPLC-MS. Here we present the COMBI-BIO NMR analysis pipeline and demonstrate its fitness for purpose using representative quality control (QC) samples. NMR spectra were first aligned and normalized. After eliminating interfering signals, outliers identified using Hotelling's T2 were removed and a cohort/phase adjustment was applied, resulting in two NMR data sets (CPMG and NOESY). Alignment of the NMR data was shown to increase the correlation-based alignment quality measure from 0.319 to 0.391 for CPMG and from 0.536 to 0.586 for NOESY, showing that the improvement was present across both large and small peaks. End-to-end quality assessment of the pipeline was achieved using Hotelling's T2 distributions. For CPMG spectra, the interquartile range decreased from 1.425 in raw QC data to 0.679 in processed spectra, while the corresponding change for NOESY spectra was from 0.795 to 0.636, indicating an improvement in precision following processing. PCA indicated that gross phase and cohort differences were no longer present. These results illustrate that the pipeline produces robust and reproducible data, successfully addressing the methodological challenges of this large multifaceted study.


Assuntos
Interpretação Estatística de Dados , Metabolômica/métodos , Espectroscopia de Prótons por Ressonância Magnética/métodos , Humanos , Metabolômica/instrumentação , Metabolômica/estatística & dados numéricos , Epidemiologia Molecular , Espectroscopia de Prótons por Ressonância Magnética/normas , Espectroscopia de Prótons por Ressonância Magnética/estatística & dados numéricos , Controle de Qualidade , Reprodutibilidade dos Testes , Fluxo de Trabalho
17.
Anal Chem ; 88(5): 2583-9, 2016 Mar 01.
Artigo em Inglês | MEDLINE | ID: mdl-26824414

RESUMO

NMR spectroscopy and mass spectrometry are the two major analytical platforms for metabolomics, and both generate substantial data with hundreds to thousands of observed peaks for a single sample. Many of these are unknown, and peak assignment is generally complex and time-consuming. Statistical correlations between data types have proven useful in expediting this process, for example, in prioritizing candidate assignments. However, this approach has not been formally assessed for the comparison of direct-infusion mass spectrometry (DIMS) and NMR data. Here, we present a systematic analysis of a sample set (tissue extracts), and the utility of a simple correlation threshold to aid metabolite identification. The correlations were surprisingly successful in linking structurally related signals, with 15 of 26 NMR-detectable metabolites having their highest correlation to a cognate MS ion. However, we found that the distribution of the correlations was highly dependent on the nature of the MS ion, such as the adduct type. This approach should help to alleviate this important bottleneck where both 1D NMR and DIMS data sets have been collected.


Assuntos
Espectroscopia de Ressonância Magnética , Espectrometria de Massas , Oligoquetos/metabolismo , Extratos de Tecidos/análise , Animais , Lisina/análogos & derivados , Lisina/análise , Metabolômica , Oligoquetos/química , Serina/análogos & derivados , Serina/análise , Estatística como Assunto , Ácido Succínico/análise
18.
Anal Chem ; 88(10): 5179-88, 2016 05 17.
Artigo em Inglês | MEDLINE | ID: mdl-27116637

RESUMO

Estimation of statistical power and sample size is a key aspect of experimental design. However, in metabolic phenotyping, there is currently no accepted approach for these tasks, in large part due to the unknown nature of the expected effect. In such hypothesis free science, neither the number or class of important analytes nor the effect size are known a priori. We introduce a new approach, based on multivariate simulation, which deals effectively with the highly correlated structure and high-dimensionality of metabolic phenotyping data. First, a large data set is simulated based on the characteristics of a pilot study investigating a given biomedical issue. An effect of a given size, corresponding either to a discrete (classification) or continuous (regression) outcome is then added. Different sample sizes are modeled by randomly selecting data sets of various sizes from the simulated data. We investigate different methods for effect detection, including univariate and multivariate techniques. Our framework allows us to investigate the complex relationship between sample size, power, and effect size for real multivariate data sets. For instance, we demonstrate for an example pilot data set that certain features achieve a power of 0.8 for a sample size of 20 samples or that a cross-validated predictivity QY(2) of 0.8 is reached with an effect size of 0.2 and 200 samples. We exemplify the approach for both nuclear magnetic resonance and liquid chromatography-mass spectrometry data from humans and the model organism C. elegans.


Assuntos
Metaboloma , Metabolômica/estatística & dados numéricos , Análise Multivariada , Animais , Caenorhabditis elegans , Conjuntos de Dados como Assunto/estatística & dados numéricos , Humanos , Modelos Estatísticos , Dados Preliminares , Tamanho da Amostra
19.
Bioinformatics ; 31(1): 102-8, 2015 Jan 01.
Artigo em Inglês | MEDLINE | ID: mdl-25183485

RESUMO

MOTIVATION: Often during the analysis of biological data, it is of importance to interpret the correlation structure that exists between variables. Such correlations may reveal patterns of co-regulation that are indicative of biochemical pathways or common mechanisms of response to a related set of treatments. However, analyses of correlations are usually conducted by either subjective interpretation of the univariate covariance matrix or by applying multivariate modeling techniques, which do not take prior biological knowledge into account. Over-representation analysis (ORA) is a simple method for objectively deciding whether a set of variables of known or suspected biological relevance, such as a gene set or pathway, is more prevalent in a set of variables of interest than we expect by chance. However, ORA is usually applied to a set of variables differentiating a single experimental variable and does not take into account correlations. RESULTS: Over-representation of correlation analysis (ORCA) is a novel combination of ORA and correlation analysis that provides a means to test whether more associations exist between two specific groups of variables than expected by chance. The method is exemplified by application to drug sensitivity and microRNA expression data from a panel of cancer cell lines (NCI60). ORCA highlighted a previously reported correlation between sensitivity to alkylating anticancer agents and topoisomerase inhibitors. We also used this approach to validate microRNA clusters predicted by mRNA correlations. These observations suggest that ORCA has the potential to reveal novel insights from these data, which are not readily apparent using classical ORA. AVAILABILITY AND IMPLEMENTATION: The R code of the method is available at https://github.com/ORCABioinfo/ORCAcode.


Assuntos
Biomarcadores Tumorais/análise , Biologia Computacional/métodos , Conjuntos de Dados como Assunto , MicroRNAs/genética , Anotação de Sequência Molecular/métodos , Neoplasias/genética , Alquilantes/farmacologia , Interpretação Estatística de Dados , Bases de Dados Factuais , Inibidores Enzimáticos/farmacologia , Perfilação da Expressão Gênica , Genômica , Humanos , Neoplasias/tratamento farmacológico , Células Tumorais Cultivadas
20.
Bioinformatics ; 31(9): 1505-7, 2015 May 01.
Artigo em Inglês | MEDLINE | ID: mdl-25505093

RESUMO

MOTIVATION: The field of toxicogenomics (the application of '-omics' technologies to risk assessment of compound toxicities) has expanded in the last decade, partly driven by new legislation, aimed at reducing animal testing in chemical risk assessment but mainly as a result of a paradigm change in toxicology towards the use and integration of genome wide data. Many research groups worldwide have generated large amounts of such toxicogenomics data. However, there is no centralized repository for archiving and making these data and associated tools for their analysis easily available. RESULTS: The Data Infrastructure for Chemical Safety Assessment (diXa) is a robust and sustainable infrastructure storing toxicogenomics data. A central data warehouse is connected to a portal with links to chemical information and molecular and phenotype data. diXa is publicly available through a user-friendly web interface. New data can be readily deposited into diXa using guidelines and templates available online. Analysis descriptions and tools for interrogating the data are available via the diXa portal. AVAILABILITY AND IMPLEMENTATION: http://www.dixa-fp7.eu CONTACT: d.hendrickx@maastrichtuniversity.nl; info@dixa-fp7.eu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Bases de Dados de Compostos Químicos , Toxicogenética , Animais , Perfilação da Expressão Gênica , Humanos , Metabolômica , Proteômica , Ratos
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa