RESUMO
MOTIVATION: Random forests (RFs) can deal with a large number of variables, achieve reasonable prediction scores, and yield highly interpretable feature importance values. As such, RFs are appropriate models for feature selection and further dimension reduction. However, RFs are often not appropriate for correlated datasets due to their mode of selecting individual features for splitting. Addressing correlation relationships in high-dimensional datasets is imperative for reducing the number of variables that are assigned high importance, hence making the dimension reduction most efficient. Here, we propose the LAtent VAriable Stochastic Ensemble of Trees (LAVASET) method that derives latent variables based on the distance characteristics of each feature and aims to incorporate the correlation factor in the splitting step. RESULTS: Without compromising on performance in the majority of examples, LAVASET outperforms RF by accurately determining feature importance across all correlated variables and ensuring proper distribution of importance values. LAVASET yields mostly non-inferior prediction accuracies to traditional RFs when tested in simulated and real 1D datasets, as well as more complex and high-dimensional 3D datatypes. Unlike traditional RFs, LAVASET is unaffected by single 'important' noisy features (false positives), as it considers the local neighbourhood. LAVASET, therefore, highlights neighbourhoods of features, reflecting real signals that collectively impact the model's predictive ability. AVAILABILITY AND IMPLEMENTATION: LAVASET is freely available as a standalone package from https://github.com/melkasapi/LAVASET.
RESUMO
Metabolic profiling (metabolomics) aims at measuring small molecules (metabolites) in complex samples like blood or urine for human health studies. While biomarker-based assessment often relies on a single molecule, metabolic profiling combines several metabolites to create a more complex and more specific fingerprint of the disease. However, in contrast to genomics, there is no unique metabolomics setup able to measure the entire metabolome. This challenge leads to tedious and resource consuming preliminary studies to be able to design the right metabolomics experiment. In that context, computer assisted metabolic profiling can be of strong added value to design metabolomics studies more quickly and efficiently. We propose a constraint-based modelling approach which predicts in silico profiles of metabolites that are more likely to be differentially abundant under a given metabolic perturbation (e.g. due to a genetic disease), using flux simulation. In genome-scale metabolic networks, the fluxes of exchange reactions, also known as the flow of metabolites through their external transport reactions, can be simulated and compared between control and disease conditions in order to calculate changes in metabolite import and export. These import/export flux differences would be expected to induce changes in circulating biofluid levels of those metabolites, which can then be interpreted as potential biomarkers or metabolites of interest. In this study, we present SAMBA (SAMpling Biomarker Analysis), an approach which simulates fluxes in exchange reactions following a metabolic perturbation using random sampling, compares the simulated flux distributions between the baseline and modulated conditions, and ranks predicted differentially exchanged metabolites as potential biomarkers for the perturbation. We show that there is a good fit between simulated metabolic exchange profiles and experimental differential metabolites detected in plasma, such as patient data from the disease database OMIM, and metabolic trait-SNP associations found in mGWAS studies. These biomarker recommendations can provide insight into the underlying mechanism or metabolic pathway perturbation lying behind observed metabolite differential abundances, and suggest new metabolites as potential avenues for further experimental analyses.
Assuntos
Metaboloma , Metabolômica , Humanos , Metaboloma/genética , Genoma , Redes e Vias Metabólicas , BiomarcadoresRESUMO
As terabytes of multi-omics data are being generated, there is an ever-increasing need for methods facilitating the integration and interpretation of such data. Current multi-omics integration methods typically output lists, clusters, or subnetworks of molecules related to an outcome. Even with expert domain knowledge, discerning the biological processes involved is a time-consuming activity. Here we propose PathIntegrate, a method for integrating multi-omics datasets based on pathways, designed to exploit knowledge of biological systems and thus provide interpretable models for such studies. PathIntegrate employs single-sample pathway analysis to transform multi-omics datasets from the molecular to the pathway-level, and applies a predictive single-view or multi-view model to integrate the data. Model outputs include multi-omics pathways ranked by their contribution to the outcome prediction, the contribution of each omics layer, and the importance of each molecule in a pathway. Using semi-synthetic data we demonstrate the benefit of grouping molecules into pathways to detect signals in low signal-to-noise scenarios, as well as the ability of PathIntegrate to precisely identify important pathways at low effect sizes. Finally, using COPD and COVID-19 data we showcase how PathIntegrate enables convenient integration and interpretation of complex high-dimensional multi-omics datasets. PathIntegrate is available as an open-source Python package.
Assuntos
Genômica , Multiômica , Genômica/métodosRESUMO
J-Resolved (J-Res) nuclear magnetic resonance (NMR) spectroscopy is pivotal in NMR-based metabolomics, but practitioners face a choice between time-consuming high-resolution (HR) experiments or shorter low-resolution (LR) experiments which exhibit significant peak overlap. Deep learning neural networks have been successfully used in many fields to enhance quality of natural images, especially with regard to resolution, and therefore offer the prospect of improving two-dimensional (2D) NMR data. Here, we introduce the J-RESRGAN, an adapted and modified generative adversarial network (GAN) for image super-resolution (SR), which we trained specifically for metabolomic J-Res spectra to enhance peak resolution. A novel symmetric loss function was introduced, exploiting the inherent vertical symmetry of J-Res NMR spectra. Model training used simulated high-resolution J-Res spectra of complex mixtures, with corresponding low-resolution spectra generated via blurring and down-sampling. Evaluation of peak pair resolvability on J-RESRGAN demonstrated remarkable improvement in resolution across a variety of samples. In simulated plasma data, 100% of peak pairs exhibited enhanced resolution in super-resolution spectra compared to their low-resolution counterparts. Similarly, enhanced resolution was observed in 80.8-100% of peak pairs in experimental plasma, 85.0-96.7% in urine, 94.4-98.9% in full fat milk, and 82.6-91.7% in orange juice. J-RESRGAN is not sample type, spectrometer or field strength dependent and improvements on previously acquired data can be seen in seconds on a standard desktop computer. We believe this demonstrates the promise of deep learning methods to enhance NMR metabolomic data, and in particular, the power of J-RESRGAN to elucidate overlapping peaks, advancing precision in a wide variety of NMR-based metabolomics studies. The model, J-RESRGAN, is openly accessible for download on GitHub at https://github.com/yanyan5420/J-RESRGAN.
Assuntos
Aprendizado Profundo , Espectroscopia de Ressonância Magnética , Metabolômica , Metabolômica/métodos , Espectroscopia de Ressonância Magnética/métodos , Animais , HumanosRESUMO
BACKGROUND: Avocado consumption is linked to better glucose homeostasis, but small associations suggest potential population heterogeneity. Metabolomic data capture the effects of food intake after digestion and metabolism, thus accounting for individual differences in these processes. OBJECTIVES: To identify metabolomic biomarkers of avocado intake and to examine their associations with glycemia. METHODS: Baseline data from 6224 multi-ethnic older adults (62% female) included self-reported avocado intake, fasting glucose and insulin, and untargeted plasma proton nuclear magnetic resonance metabolomic features (metabolomic data were available for a randomly selected subset; N = 3438). Subsequently, incident type 2 diabetes (T2D) was assessed over an â¼18 y follow-up period. A metabolome-wide association study of avocado consumption status (consumer compared with nonconsumer) was conducted, and the relationship of these features with glycemia via cross-sectional associations with fasting insulin and glucose and longitudinal associations with incident T2D was examined. RESULTS: Three highly-correlated spectral features were associated with avocado intake at metabolome-wide significance levels (P < 5.3 ∗ 10-7) and combined into a single biomarker. We did not find evidence that these features were additionally associated with overall dietary quality, nor with any of 47 other food groups (all P > 0.001), supporting their suitability as a biomarker of avocado intake. Avocado intake showed a modest association only with lower fasting insulin (ß = -0.07 +/- 0.03, P = 0.03), an association that was attenuated to nonsignificance when additionally controlling for body mass index (kg/m2). However, our biomarker of avocado intake was strongly associated with lower fasting glucose (ß = -0.22 +/- 0.02, P < 2.0 ∗ 10-16), lower fasting insulin (ß = -0.17 +/- 0.02, P < 2.0 ∗ 10-16), and a lower incidence of T2D (hazard ratio: 0.68; 0.63-074, P < 2.0 ∗ 10-16), even when adjusting for BMI. CONCLUSIONS: Highly significant associations between glycemia and avocado-related metabolomic features, which serve as biomarkers of the physiological impact of dietary intake after digestion and absorption, compared to modest relationships between glycemia and avocado consumption, highlights the importance of considering individual differences in metabolism when considering diet-health relationships.
Assuntos
Aterosclerose , Diabetes Mellitus Tipo 2 , Persea , Humanos , Feminino , Idoso , Masculino , Diabetes Mellitus Tipo 2/epidemiologia , Fatores de Risco , Estudos Transversais , Biomarcadores , Insulina , GlucoseRESUMO
BACKGROUND: Single sample pathway analysis (ssPA) transforms molecular level omics data to the pathway level, enabling the discovery of patient-specific pathway signatures. Compared to conventional pathway analysis, ssPA overcomes the limitations by enabling multi-group comparisons, alongside facilitating numerous downstream analyses such as pathway-based machine learning. While in transcriptomics ssPA is a widely used technique, there is little literature evaluating its suitability for metabolomics. Here we provide a benchmark of established ssPA methods (ssGSEA, GSVA, SVD (PLAGE), and z-score) alongside the evaluation of two novel methods we propose: ssClustPA and kPCA, using semi-synthetic metabolomics data. We then demonstrate how ssPA can facilitate pathway-based interpretation of metabolomics data by performing a case-study on inflammatory bowel disease mass spectrometry data, using clustering to determine subtype-specific pathway signatures. RESULTS: While GSEA-based and z-score methods outperformed the others in terms of recall, clustering/dimensionality reduction-based methods provided higher precision at moderate-to-high effect sizes. A case study applying ssPA to inflammatory bowel disease data demonstrates how these methods yield a much richer depth of interpretation than conventional approaches, for example by clustering pathway scores to visualise a pathway-based patient subtype-specific correlation network. We also developed the sspa python package (freely available at https://pypi.org/project/sspa/ ), providing implementations of all the methods benchmarked in this study. CONCLUSION: This work underscores the value ssPA methods can add to metabolomic studies and provides a useful reference for those wishing to apply ssPA methods to metabolomics data.
Assuntos
Doenças Inflamatórias Intestinais , Metabolômica , Humanos , Metabolômica/métodos , Transcriptoma , Análise por Conglomerados , Espectrometria de MassasRESUMO
Prostate cancer (PCa) is the most common cancer form in males in many European and American countries, but there are still open questions regarding its etiology. Untargeted metabolomics can produce an unbiased global metabolic profile, with the opportunity for uncovering new plasma metabolites prospectively associated with risk of PCa, providing insights into disease etiology. We conducted a prospective untargeted liquid chromatography-mass spectrometry (LC-MS) metabolomics analysis using prediagnostic fasting plasma samples from 752 PCa case-control pairs nested within the Northern Sweden Health and Disease Study (NSHDS). The pairs were matched by age, BMI, and sample storage time. Discriminating features were identified by a combination of orthogonal projection to latent structures-effect projections (OPLS-EP) and Wilcoxon signed-rank tests. Their prospective associations with PCa risk were investigated by conditional logistic regression. Subgroup analyses based on stratification by disease aggressiveness and baseline age were also conducted. Various free fatty acids and phospholipids were positively associated with overall risk of PCa and in various stratification subgroups. Aromatic amino acids were positively associated with overall risk of PCa. Uric acid was positively, and glucose negatively, associated with risk of PCa in the older subgroup. This is the largest untargeted LC-MS based metabolomics study to date on plasma metabolites prospectively associated with risk of developing PCa. Different subgroups of disease aggressiveness and baseline age showed different associations with metabolites. The findings suggest that shifts in plasma concentrations of metabolites in lipid, aromatic amino acid, and glucose metabolism are associated with risk of developing PCa during the following two decades.
Assuntos
Ácidos Graxos não Esterificados , Neoplasias da Próstata , Masculino , Humanos , Estudos de Casos e Controles , Ácido Úrico , Suécia/epidemiologia , Metabolômica/métodos , Espectrometria de Massas , Neoplasias da Próstata/diagnóstico , Neoplasias da Próstata/epidemiologia , Aminoácidos Aromáticos , GlucoseRESUMO
Untargeted metabolomics and lipidomics LC-MS experiments produce complex datasets, usually containing tens of thousands of features from thousands of metabolites whose annotation requires additional MS/MS experiments and expert knowledge. All-ion fragmentation (AIF) LC-MS/MS acquisition provides fragmentation data at no additional experimental time cost. However, analysis of such datasets requires reconstruction of parent-fragment relationships and annotation of the resulting pseudo-MS/MS spectra. Here, we propose a novel approach for automated annotation of isotopologues, adducts, and in-source fragments from AIF LC-MS datasets by combining correlation-based parent-fragment linking with molecular fragment matching. Our workflow focuses on a subset of features rather than trying to annotate the full dataset, saving time and simplifying the process. We demonstrate the workflow in three human serum datasets containing 599 features manually annotated by experts. Precision and recall values of 82-92% and 82-85%, respectively, were obtained for features found in the highest-rank scores (1-5). These results equal or outperform those obtained using MS-DIAL software, the current state of the art for AIF data annotation. Further validation for other biological matrices and different instrument types showed variable precision (60-89%) and recall (10-88%) particularly for datasets dominated by nonlipid metabolites. The workflow is freely available as an open-source R package, MetaboAnnotatoR, together with the fragment libraries from Github (https://github.com/gggraca/MetaboAnnotatoR).
Assuntos
Metabolômica , Espectrometria de Massas em Tandem , Cromatografia Líquida/métodos , Humanos , Metabolômica/métodos , Software , Espectrometria de Massas em Tandem/métodos , Fluxo de TrabalhoRESUMO
Integration of multiple datasets can greatly enhance bioanalytical studies, for example, by increasing power to discover and validate biomarkers. In liquid chromatography-mass spectrometry (LC-MS) metabolomics, it is especially hard to combine untargeted datasets since the majority of metabolomic features are not annotated and thus cannot be matched by chemical identity. Typically, the information available for each feature is retention time (RT), mass-to-charge ratio (m/z), and feature intensity (FI). Pairs of features from the same metabolite in separate datasets can exhibit small but significant differences, making matching very challenging. Current methods to address this issue are too simple or rely on assumptions that cannot be met in all cases. We present a method to find feature correspondence between two similar LC-MS metabolomics experiments or batches using only the features' RT, m/z, and FI. We demonstrate the method on both real and synthetic datasets, using six orthogonal validation strategies to gauge the matching quality. In our main example, 4953 features were uniquely matched, of which 585 (96.8%) of 604 manually annotated features were correct. In a second example, 2324 features could be uniquely matched, with 79 (90.8%) out of 87 annotated features correctly matched. Most of the missed annotated matches are between features that behave very differently from modeled inter-dataset shifts of RT, MZ, and FI. In a third example with simulated data with 4755 features per dataset, 99.6% of the matches were correct. Finally, the results of matching three other dataset pairs using our method are compared with a published alternative method, metabCombiner, showing the advantages of our approach. The method can be applied using M2S (Match 2 Sets), a free, open-source MATLAB toolbox, available at https://github.com/rjdossan/M2S.
Assuntos
Metabolômica , Biomarcadores/análise , Cromatografia Líquida/métodos , Espectrometria de Massas/métodos , Metabolômica/métodosRESUMO
BACKGROUND: Compound identification remains a critical bottleneck in the process of exploiting Nuclear Magnetic Resonance (NMR) metabolomics data, especially for 1H 1-dimensional (1H 1D) data. As databases of reference compound spectra have grown, workflows have evolved to rely heavily on their search functions to facilitate this process by generating lists of potential metabolites found in complex mixture data, facilitating annotation and identification. However, approaches for validating and communicating annotations are most often guided by expert knowledge, and therefore are highly variable despite repeated efforts to align practices and define community standards. AIM OF REVIEW: This review is aimed at broadening the application of automated annotation tools by discussing the key ideas of spectral matching and beginning to describe a set of terms to classify this information, thus advancing standards for communicating annotation confidence. Additionally, we hope that this review will facilitate the growing collaboration between chemical data scientists, software developers and the NMR metabolomics community aiding development of long-term software solutions. KEY SCIENTIFIC CONCEPTS OF REVIEW: We begin with a brief discussion of the typical untargeted NMR identification workflow. We differentiate between annotation (hypothesis generation, filtering), and identification (hypothesis testing, verification), and note the utility of different NMR data features for annotation. We then touch on three parts of annotation: (1) generation of queries, (2) matching queries to reference data, and (3) scoring and confidence estimation of potential matches for verification. In doing so, we highlight existing approaches to automated and semi-automated annotation from the perspective of the structural information they utilize, as well as how this information can be represented computationally.
Assuntos
Metabolômica , Software , Metabolômica/métodos , Espectroscopia de Ressonância Magnética/métodos , Imageamento por Ressonância Magnética , Bases de Dados FactuaisRESUMO
Over-representation analysis (ORA) is one of the commonest pathway analysis approaches used for the functional interpretation of metabolomics datasets. Despite the widespread use of ORA in metabolomics, the community lacks guidelines detailing its best-practice use. Many factors have a pronounced impact on the results, but to date their effects have received little systematic attention. Using five publicly available datasets, we demonstrated that changes in parameters such as the background set, differential metabolite selection methods, and pathway database used can result in profoundly different ORA results. The use of a non-assay-specific background set, for example, resulted in large numbers of false-positive pathways. Pathway database choice, evaluated using three of the most popular metabolic pathway databases (KEGG, Reactome, and BioCyc), led to vastly different results in both the number and function of significantly enriched pathways. Factors that are specific to metabolomics data, such as the reliability of compound identification and the chemical bias of different analytical platforms also impacted ORA results. Simulated metabolite misidentification rates as low as 4% resulted in both gain of false-positive pathways and loss of truly significant pathways across all datasets. Our results have several practical implications for ORA users, as well as those using alternative pathway analysis methods. We offer a set of recommendations for the use of ORA in metabolomics, alongside a set of minimal reporting guidelines, as a first step towards the standardisation of pathway analysis in metabolomics.
Assuntos
Metabolômica , Biologia Computacional/métodos , Conjuntos de Dados como Assunto , Redes e Vias Metabólicas , Reprodutibilidade dos TestesRESUMO
BACKGROUND: The search for statistically significant relationships between molecular markers and outcomes is challenging when dealing with high-dimensional, noisy and collinear multivariate omics data, such as metabolomic profiles. Permutation procedures allow for the estimation of adjusted significance levels without assuming independence among metabolomic variables. Nevertheless, the complex non-normal structure of metabolic profiles and outcomes may bias the permutation results leading to overly conservative threshold estimates i.e. lower than those from a Bonferroni or Sidak correction. METHODS: Within a univariate permutation procedure we employ parametric simulation methods based on the multivariate (log-)Normal distribution to obtain adjusted significance levels which are consistent across different outcomes while effectively controlling the type I error rate. Next, we derive an alternative closed-form expression for the estimation of the number of non-redundant metabolic variates based on the spectral decomposition of their correlation matrix. The performance of the method is tested for different model parametrizations and across a wide range of correlation levels of the variates using synthetic and real data sets. RESULTS: Both the permutation-based formulation and the more practical closed form expression are found to give an effective indication of the number of independent metabolic effects exhibited by the system, while guaranteeing that the derived adjusted threshold is stable across outcome measures with diverse properties.
Assuntos
Metaboloma , Metabolômica , Modelos Biológicos , Marcadores Genéticos/genética , Metabolômica/métodos , Distribuições EstatísticasRESUMO
MOTIVATION: Liquid chromatography-mass spectrometry (LC-MS) is a standard method for proteomics and metabolomics analysis of biological samples. Unfortunately, it suffers from various changes in the retention times (RT) of the same compound in different samples, and these must be subsequently corrected (aligned) during data processing. Classic alignment methods such as in the popular XCMS package often assume a single time-warping function for each sample. Thus, the potentially varying RT drift for compounds with different masses in a sample is neglected in these methods. Moreover, the systematic change in RT drift across run order is often not considered by alignment algorithms. Therefore, these methods cannot effectively correct all misalignments. For a large-scale experiment involving many samples, the existence of misalignment becomes inevitable and concerning. RESULTS: Here, we describe an integrated reference-free profile alignment method, neighbor-wise compound-specific Graphical Time Warping (ncGTW), that can detect misaligned features and align profiles by leveraging expected RT drift structures and compound-specific warping functions. Specifically, ncGTW uses individualized warping functions for different compounds and assigns constraint edges on warping functions of neighboring samples. Validated with both realistic synthetic data and internal quality control samples, ncGTW applied to two large-scale metabolomics LC-MS datasets identifies many misaligned features and successfully realigns them. These features would otherwise be discarded or uncorrected using existing methods. The ncGTW software tool is developed currently as a plug-in to detect and realign misaligned features present in standard XCMS output. AVAILABILITY AND IMPLEMENTATION: An R package of ncGTW is freely available at Bioconductor and https://github.com/ChiungTingWu/ncGTW. A detailed user's manual and a vignette are provided within the package. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Metabolômica , Espectrometria de Massas em Tandem , Algoritmos , Cromatografia Líquida , Proteômica , SoftwareRESUMO
Omics methodologies are widely used in toxicological research to understand modes and mechanisms of toxicity. Increasingly, these methodologies are being applied to questions of regulatory interest such as molecular point-of-departure derivation and chemical grouping/read-across. Despite its value, widespread regulatory acceptance of omics data has not yet occurred. Barriers to the routine application of omics data in regulatory decision making have been: 1) lack of transparency for data processing methods used to convert raw data into an interpretable list of observations; and 2) lack of standardization in reporting to ensure that omics data, associated metadata and the methodologies used to generate results are available for review by stakeholders, including regulators. Thus, in 2017, the Organisation for Economic Co-operation and Development (OECD) Extended Advisory Group on Molecular Screening and Toxicogenomics (EAGMST) launched a project to develop guidance for the reporting of omics data aimed at fostering further regulatory use. Here, we report on the ongoing development of the first formal reporting framework describing the processing and analysis of both transcriptomic and metabolomic data for regulatory toxicology. We introduce the modular structure, content, harmonization and strategy for trialling this reporting framework prior to its publication by the OECD.
Assuntos
Metabolômica/normas , Organização para a Cooperação e Desenvolvimento Econômico/normas , Toxicogenética/normas , Toxicologia/normas , Transcriptoma/fisiologia , Documentação/normas , HumanosRESUMO
Metabolomics encompasses the systematic identification and quantification of all metabolic products in the human body. This field could provide clinicians with novel sets of diagnostic biomarkers for disease states in addition to quantifying treatment response to medications at an individualized level. This literature review aims to highlight the technology underpinning metabolic profiling, identify potential applications of metabolomics in clinical practice, and discuss the translational challenges that the field faces. We searched PubMed, MEDLINE, and EMBASE for primary and secondary research articles regarding clinical applications of metabolomics. Metabolic profiling can be performed using mass spectrometry and nuclear magnetic resonance-based techniques using a variety of biological samples. This is carried out in vivo or in vitro following careful sample collection, preparation, and analysis. The potential clinical applications constitute disruptive innovations in their respective specialities, particularly oncology and metabolic medicine. Outstanding issues currently preventing widespread clinical use are scalability of data interpretation, standardization of sample handling practice, and e-infrastructure. Routine utilization of metabolomics at a patient and population level will constitute an integral part of future healthcare provision.
Assuntos
Metabolômica , Medicina de Precisão , Estetoscópios , HumanosRESUMO
BACKGROUND: Metabolomics time-course experiments provide the opportunity to understand the changes to an organism by observing the evolution of metabolic profiles in response to internal or external stimuli. Along with other omic longitudinal profiling technologies, these techniques have great potential to uncover complex relations between variations across diverse omic variables and provide unique insights into the underlying biology of the system. However, many statistical methods currently used to analyse short time-series omic data are i) prone to overfitting, ii) do not fully take into account the experimental design or iii) do not make full use of the multivariate information intrinsic to the data or iv) are unable to uncover multiple associations between different omic data. The model we propose is an attempt to i) overcome overfitting by using a weakly informative Bayesian model, ii) capture experimental design conditions through a mixed-effects model, iii) model interdependencies between variables by augmenting the mixed-effects model with a conditional auto-regressive (CAR) component and iv) identify potential associations between heterogeneous omic variables by using a horseshoe prior. RESULTS: We assess the performance of our model on synthetic and real datasets and show that it can outperform comparable models for metabolomic longitudinal data analysis. In addition, our proposed method provides the analyst with new insights on the data as it is able to identify metabolic biomarkers related to treatment, infer perturbed pathways as a result of treatment and find significant associations with additional omic variables. We also show through simulation that our model is fairly robust against inaccuracies in metabolite assignments. On real data, we demonstrate that the number of profiled metabolites slightly affects the predictive ability of the model. CONCLUSIONS: Our single model approach to longitudinal analysis of metabolomics data provides an approach simultaneously for integrative analysis and biomarker discovery. In addition, it lends better interpretation by allowing analysis at the pathway level. An accompanying R package for the model has been developed using the probabilistic programming language Stan. The package offers user-friendly functions for simulating data, fitting the model, assessing model fit and postprocessing the results. The main aim of the R package is to offer freely accessible resources for integrative longitudinal analysis for metabolomics scientists and various visualization functions easy-to-use for applied researchers to interpret results.
Assuntos
Biomarcadores/metabolismo , Metabolômica/métodos , Modelos Teóricos , Bactérias/metabolismo , Teorema de Bayes , MetabolomaRESUMO
AIMS: To characterize serum metabolic signatures associated with atherosclerosis in the coronary or carotid arteries and subsequently their association with incident cardiovascular disease (CVD). METHODS AND RESULTS: We used untargeted one-dimensional (1D) serum metabolic profiling by proton nuclear magnetic resonance spectroscopy (1H NMR) among 3867 participants from the Multi-Ethnic Study of Atherosclerosis (MESA), with replication among 3569 participants from the Rotterdam and LOLIPOP studies. Atherosclerosis was assessed by coronary artery calcium (CAC) and carotid intima-media thickness (IMT). We used multivariable linear regression to evaluate associations between NMR features and atherosclerosis accounting for multiplicity of comparisons. We then examined associations between metabolites associated with atherosclerosis and incident CVD available in MESA and Rotterdam and explored molecular networks through bioinformatics analyses. Overall, 30 1H NMR measured metabolites were associated with CAC and/or IMT, P = 1.3 × 10-14 to 1.0 × 10-6 (discovery) and P = 5.6 × 10-10 to 1.1 × 10-2 (replication). These associations were substantially attenuated after adjustment for conventional cardiovascular risk factors. Metabolites associated with atherosclerosis revealed disturbances in lipid and carbohydrate metabolism, branched chain, and aromatic amino acid metabolism, as well as oxidative stress and inflammatory pathways. Analyses of incident CVD events showed inverse associations with creatine, creatinine, and phenylalanine, and direct associations with mannose, acetaminophen-glucuronide, and lactate as well as apolipoprotein B (P < 0.05). CONCLUSION: Metabolites associated with atherosclerosis were largely consistent between the two vascular beds (coronary and carotid arteries) and predominantly tag pathways that overlap with the known cardiovascular risk factors. We present an integrated systems network that highlights a series of inter-connected pathways underlying atherosclerosis.
Assuntos
Doenças Cardiovasculares/etiologia , Doenças das Artérias Carótidas/complicações , Doenças das Artérias Carótidas/metabolismo , Doença da Artéria Coronariana/complicações , Doença da Artéria Coronariana/metabolismo , Adulto , Idoso , Doenças Cardiovasculares/sangue , Doenças das Artérias Carótidas/sangue , Doença da Artéria Coronariana/sangue , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Estudos Prospectivos , Espectroscopia de Prótons por Ressonância MagnéticaRESUMO
BACKGROUND: Transcriptomic data is often used to build statistical models which are predictive of a given phenotype, such as disease status. Genes work together in pathways and it is widely thought that pathway representations will be more robust to noise in the gene expression levels. We aimed to test this hypothesis by constructing models based on either genes alone, or based on sample specific scores for each pathway, thus transforming the data to a 'pathway space'. We progressively degraded the raw data by addition of noise and examined the ability of the models to maintain predictivity. RESULTS: Models in the pathway space indeed had higher predictive robustness than models in the gene space. This result was independent of the workflow, parameters, classifier and data set used. Surprisingly, randomised pathway mappings produced models of similar accuracy and robustness to true mappings, suggesting that the success of pathway space models is not conferred by the specific definitions of the pathway. Instead, predictive models built on the true pathway mappings led to prediction rules with fewer influential pathways than those built on randomised pathways. The extent of this effect was used to differentiate pathway collections coming from a variety of widely used pathway databases. CONCLUSIONS: Prediction models based on pathway scores are more robust to degradation of gene expression information than the equivalent models based on ungrouped genes. While models based on true pathway scores are not more robust or accurate than those based on randomised pathways, true pathways produced simpler prediction rules, emphasizing a smaller number of pathways.
Assuntos
Biologia Computacional/métodos , Perfilação da Expressão Gênica , Transdução de Sinais , Bases de Dados Factuais , Expressão Gênica , Humanos , Modelos Estatísticos , Fenótipo , TranscriptomaRESUMO
Metabolism is altered by genetics, diet, disease status, environment, and many other factors. Modeling either one of these is often done without considering the effects of the other covariates. Attributing differences in metabolic profile to one of these factors needs to be done while controlling for the metabolic influence of the rest. We describe here a data analysis framework and novel confounder-adjustment algorithm for multivariate analysis of metabolic profiling data. Using simulated data, we show that similar numbers of true associations and significantly less false positives are found compared to other commonly used methods. Covariate-adjusted projections to latent structures (CA-PLS) are exemplified here using a large-scale metabolic phenotyping study of two Chinese populations at different risks for cardiovascular disease. Using CA-PLS, we find that some previously reported differences are actually associated with external factors and discover a number of previously unreported biomarkers linked to different metabolic pathways. CA-PLS can be applied to any multivariate data where confounding may be an issue and the confounder-adjustment procedure is translatable to other multivariate regression techniques.
Assuntos
Biomarcadores , Fatores de Confusão Epidemiológicos , Metaboloma , Modelos Estatísticos , Fenótipo , Algoritmos , Povo Asiático , Doenças Cardiovasculares , Simulação por Computador , Humanos , Análise Multivariada , Risco , Análise EspectralRESUMO
NMR is a widely used analytical technique with a growing number of repositories available. As a result, demands for a vendor-agnostic, open data format for long-term archiving of NMR data have emerged with the aim to ease and encourage sharing, comparison, and reuse of NMR data. Here we present nmrML, an open XML-based exchange and storage format for NMR spectral data. The nmrML format is intended to be fully compatible with existing NMR data for chemical, biochemical, and metabolomics experiments. nmrML can capture raw NMR data, spectral data acquisition parameters, and where available spectral metadata, such as chemical structures associated with spectral assignments. The nmrML format is compatible with pure-compound NMR data for reference spectral libraries as well as NMR data from complex biomixtures, i.e., metabolomics experiments. To facilitate format conversions, we provide nmrML converters for Bruker, JEOL and Agilent/Varian vendor formats. In addition, easy-to-use Web-based spectral viewing, processing, and spectral assignment tools that read and write nmrML have been developed. Software libraries and Web services for data validation are available for tool developers and end-users. The nmrML format has already been adopted for capturing and disseminating NMR data for small molecules by several open source data processing tools and metabolomics reference spectral libraries, e.g., serving as storage format for the MetaboLights data repository. The nmrML open access data standard has been endorsed by the Metabolomics Standards Initiative (MSI), and we here encourage user participation and feedback to increase usability and make it a successful standard.