Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Resultados 1 - 20 de 81
Filtrar
1.
Bioinformatics ; 40(3)2024 Mar 04.
Artículo en Inglés | MEDLINE | ID: mdl-38383048

RESUMEN

MOTIVATION: Random forests (RFs) can deal with a large number of variables, achieve reasonable prediction scores, and yield highly interpretable feature importance values. As such, RFs are appropriate models for feature selection and further dimension reduction. However, RFs are often not appropriate for correlated datasets due to their mode of selecting individual features for splitting. Addressing correlation relationships in high-dimensional datasets is imperative for reducing the number of variables that are assigned high importance, hence making the dimension reduction most efficient. Here, we propose the LAtent VAriable Stochastic Ensemble of Trees (LAVASET) method that derives latent variables based on the distance characteristics of each feature and aims to incorporate the correlation factor in the splitting step. RESULTS: Without compromising on performance in the majority of examples, LAVASET outperforms RF by accurately determining feature importance across all correlated variables and ensuring proper distribution of importance values. LAVASET yields mostly non-inferior prediction accuracies to traditional RFs when tested in simulated and real 1D datasets, as well as more complex and high-dimensional 3D datatypes. Unlike traditional RFs, LAVASET is unaffected by single 'important' noisy features (false positives), as it considers the local neighbourhood. LAVASET, therefore, highlights neighbourhoods of features, reflecting real signals that collectively impact the model's predictive ability. AVAILABILITY AND IMPLEMENTATION: LAVASET is freely available as a standalone package from https://github.com/melkasapi/LAVASET.

2.
Anal Chem ; 96(29): 11707-11715, 2024 Jul 23.
Artículo en Inglés | MEDLINE | ID: mdl-38990576

RESUMEN

J-Resolved (J-Res) nuclear magnetic resonance (NMR) spectroscopy is pivotal in NMR-based metabolomics, but practitioners face a choice between time-consuming high-resolution (HR) experiments or shorter low-resolution (LR) experiments which exhibit significant peak overlap. Deep learning neural networks have been successfully used in many fields to enhance quality of natural images, especially with regard to resolution, and therefore offer the prospect of improving two-dimensional (2D) NMR data. Here, we introduce the J-RESRGAN, an adapted and modified generative adversarial network (GAN) for image super-resolution (SR), which we trained specifically for metabolomic J-Res spectra to enhance peak resolution. A novel symmetric loss function was introduced, exploiting the inherent vertical symmetry of J-Res NMR spectra. Model training used simulated high-resolution J-Res spectra of complex mixtures, with corresponding low-resolution spectra generated via blurring and down-sampling. Evaluation of peak pair resolvability on J-RESRGAN demonstrated remarkable improvement in resolution across a variety of samples. In simulated plasma data, 100% of peak pairs exhibited enhanced resolution in super-resolution spectra compared to their low-resolution counterparts. Similarly, enhanced resolution was observed in 80.8-100% of peak pairs in experimental plasma, 85.0-96.7% in urine, 94.4-98.9% in full fat milk, and 82.6-91.7% in orange juice. J-RESRGAN is not sample type, spectrometer or field strength dependent and improvements on previously acquired data can be seen in seconds on a standard desktop computer. We believe this demonstrates the promise of deep learning methods to enhance NMR metabolomic data, and in particular, the power of J-RESRGAN to elucidate overlapping peaks, advancing precision in a wide variety of NMR-based metabolomics studies. The model, J-RESRGAN, is openly accessible for download on GitHub at https://github.com/yanyan5420/J-RESRGAN.


Asunto(s)
Aprendizaje Profundo , Espectroscopía de Resonancia Magnética , Metabolómica , Metabolómica/métodos , Espectroscopía de Resonancia Magnética/métodos , Animales , Humanos
3.
BMC Bioinformatics ; 23(1): 481, 2022 Nov 14.
Artículo en Inglés | MEDLINE | ID: mdl-36376837

RESUMEN

BACKGROUND: Single sample pathway analysis (ssPA) transforms molecular level omics data to the pathway level, enabling the discovery of patient-specific pathway signatures. Compared to conventional pathway analysis, ssPA overcomes the limitations by enabling multi-group comparisons, alongside facilitating numerous downstream analyses such as pathway-based machine learning. While in transcriptomics ssPA is a widely used technique, there is little literature evaluating its suitability for metabolomics. Here we provide a benchmark of established ssPA methods (ssGSEA, GSVA, SVD (PLAGE), and z-score) alongside the evaluation of two novel methods we propose: ssClustPA and kPCA, using semi-synthetic metabolomics data. We then demonstrate how ssPA can facilitate pathway-based interpretation of metabolomics data by performing a case-study on inflammatory bowel disease mass spectrometry data, using clustering to determine subtype-specific pathway signatures. RESULTS: While GSEA-based and z-score methods outperformed the others in terms of recall, clustering/dimensionality reduction-based methods provided higher precision at moderate-to-high effect sizes. A case study applying ssPA to inflammatory bowel disease data demonstrates how these methods yield a much richer depth of interpretation than conventional approaches, for example by clustering pathway scores to visualise a pathway-based patient subtype-specific correlation network. We also developed the sspa python package (freely available at https://pypi.org/project/sspa/ ), providing implementations of all the methods benchmarked in this study. CONCLUSION: This work underscores the value ssPA methods can add to metabolomic studies and provides a useful reference for those wishing to apply ssPA methods to metabolomics data.


Asunto(s)
Enfermedades Inflamatorias del Intestino , Metabolómica , Humanos , Metabolómica/métodos , Transcriptoma , Análisis por Conglomerados , Espectrometría de Masas
4.
Int J Cancer ; 151(12): 2115-2127, 2022 Dec 15.
Artículo en Inglés | MEDLINE | ID: mdl-35866293

RESUMEN

Prostate cancer (PCa) is the most common cancer form in males in many European and American countries, but there are still open questions regarding its etiology. Untargeted metabolomics can produce an unbiased global metabolic profile, with the opportunity for uncovering new plasma metabolites prospectively associated with risk of PCa, providing insights into disease etiology. We conducted a prospective untargeted liquid chromatography-mass spectrometry (LC-MS) metabolomics analysis using prediagnostic fasting plasma samples from 752 PCa case-control pairs nested within the Northern Sweden Health and Disease Study (NSHDS). The pairs were matched by age, BMI, and sample storage time. Discriminating features were identified by a combination of orthogonal projection to latent structures-effect projections (OPLS-EP) and Wilcoxon signed-rank tests. Their prospective associations with PCa risk were investigated by conditional logistic regression. Subgroup analyses based on stratification by disease aggressiveness and baseline age were also conducted. Various free fatty acids and phospholipids were positively associated with overall risk of PCa and in various stratification subgroups. Aromatic amino acids were positively associated with overall risk of PCa. Uric acid was positively, and glucose negatively, associated with risk of PCa in the older subgroup. This is the largest untargeted LC-MS based metabolomics study to date on plasma metabolites prospectively associated with risk of developing PCa. Different subgroups of disease aggressiveness and baseline age showed different associations with metabolites. The findings suggest that shifts in plasma concentrations of metabolites in lipid, aromatic amino acid, and glucose metabolism are associated with risk of developing PCa during the following two decades.


Asunto(s)
Ácidos Grasos no Esterificados , Neoplasias de la Próstata , Masculino , Humanos , Estudios de Casos y Controles , Ácido Úrico , Suecia/epidemiología , Metabolómica/métodos , Espectrometría de Masas , Neoplasias de la Próstata/diagnóstico , Neoplasias de la Próstata/epidemiología , Aminoácidos Aromáticos , Glucosa
5.
Anal Chem ; 94(8): 3446-3455, 2022 03 01.
Artículo en Inglés | MEDLINE | ID: mdl-35180347

RESUMEN

Untargeted metabolomics and lipidomics LC-MS experiments produce complex datasets, usually containing tens of thousands of features from thousands of metabolites whose annotation requires additional MS/MS experiments and expert knowledge. All-ion fragmentation (AIF) LC-MS/MS acquisition provides fragmentation data at no additional experimental time cost. However, analysis of such datasets requires reconstruction of parent-fragment relationships and annotation of the resulting pseudo-MS/MS spectra. Here, we propose a novel approach for automated annotation of isotopologues, adducts, and in-source fragments from AIF LC-MS datasets by combining correlation-based parent-fragment linking with molecular fragment matching. Our workflow focuses on a subset of features rather than trying to annotate the full dataset, saving time and simplifying the process. We demonstrate the workflow in three human serum datasets containing 599 features manually annotated by experts. Precision and recall values of 82-92% and 82-85%, respectively, were obtained for features found in the highest-rank scores (1-5). These results equal or outperform those obtained using MS-DIAL software, the current state of the art for AIF data annotation. Further validation for other biological matrices and different instrument types showed variable precision (60-89%) and recall (10-88%) particularly for datasets dominated by nonlipid metabolites. The workflow is freely available as an open-source R package, MetaboAnnotatoR, together with the fragment libraries from Github (https://github.com/gggraca/MetaboAnnotatoR).


Asunto(s)
Metabolómica , Espectrometría de Masas en Tándem , Cromatografía Liquida/métodos , Humanos , Metabolómica/métodos , Programas Informáticos , Espectrometría de Masas en Tándem/métodos , Flujo de Trabajo
6.
Metabolomics ; 18(12): 102, 2022 12 05.
Artículo en Inglés | MEDLINE | ID: mdl-36469142

RESUMEN

BACKGROUND: Compound identification remains a critical bottleneck in the process of exploiting Nuclear Magnetic Resonance (NMR) metabolomics data, especially for 1H 1-dimensional (1H 1D) data. As databases of reference compound spectra have grown, workflows have evolved to rely heavily on their search functions to facilitate this process by generating lists of potential metabolites found in complex mixture data, facilitating annotation and identification. However, approaches for validating and communicating annotations are most often guided by expert knowledge, and therefore are highly variable despite repeated efforts to align practices and define community standards. AIM OF REVIEW: This review is aimed at broadening the application of automated annotation tools by discussing the key ideas of spectral matching and beginning to describe a set of terms to classify this information, thus advancing standards for communicating annotation confidence. Additionally, we hope that this review will facilitate the growing collaboration between chemical data scientists, software developers and the NMR metabolomics community aiding development of long-term software solutions. KEY SCIENTIFIC CONCEPTS OF REVIEW: We begin with a brief discussion of the typical untargeted NMR identification workflow. We differentiate between annotation (hypothesis generation, filtering), and identification (hypothesis testing, verification), and note the utility of different NMR data features for annotation. We then touch on three parts of annotation: (1) generation of queries, (2) matching queries to reference data, and (3) scoring and confidence estimation of potential matches for verification. In doing so, we highlight existing approaches to automated and semi-automated annotation from the perspective of the structural information they utilize, as well as how this information can be represented computationally.


Asunto(s)
Metabolómica , Programas Informáticos , Metabolómica/métodos , Espectroscopía de Resonancia Magnética/métodos , Imagen por Resonancia Magnética , Bases de Datos Factuales
7.
BMC Bioinformatics ; 22(1): 67, 2021 Feb 12.
Artículo en Inglés | MEDLINE | ID: mdl-33579202

RESUMEN

BACKGROUND: The search for statistically significant relationships between molecular markers and outcomes is challenging when dealing with high-dimensional, noisy and collinear multivariate omics data, such as metabolomic profiles. Permutation procedures allow for the estimation of adjusted significance levels without assuming independence among metabolomic variables. Nevertheless, the complex non-normal structure of metabolic profiles and outcomes may bias the permutation results leading to overly conservative threshold estimates i.e. lower than those from a Bonferroni or Sidak correction. METHODS: Within a univariate permutation procedure we employ parametric simulation methods based on the multivariate (log-)Normal distribution to obtain adjusted significance levels which are consistent across different outcomes while effectively controlling the type I error rate. Next, we derive an alternative closed-form expression for the estimation of the number of non-redundant metabolic variates based on the spectral decomposition of their correlation matrix. The performance of the method is tested for different model parametrizations and across a wide range of correlation levels of the variates using synthetic and real data sets. RESULTS: Both the permutation-based formulation and the more practical closed form expression are found to give an effective indication of the number of independent metabolic effects exhibited by the system, while guaranteeing that the derived adjusted threshold is stable across outcome measures with diverse properties.


Asunto(s)
Metaboloma , Metabolómica , Modelos Biológicos , Marcadores Genéticos/genética , Metabolómica/métodos , Distribuciones Estadísticas
8.
Regul Toxicol Pharmacol ; 125: 105020, 2021 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-34333066

RESUMEN

Omics methodologies are widely used in toxicological research to understand modes and mechanisms of toxicity. Increasingly, these methodologies are being applied to questions of regulatory interest such as molecular point-of-departure derivation and chemical grouping/read-across. Despite its value, widespread regulatory acceptance of omics data has not yet occurred. Barriers to the routine application of omics data in regulatory decision making have been: 1) lack of transparency for data processing methods used to convert raw data into an interpretable list of observations; and 2) lack of standardization in reporting to ensure that omics data, associated metadata and the methodologies used to generate results are available for review by stakeholders, including regulators. Thus, in 2017, the Organisation for Economic Co-operation and Development (OECD) Extended Advisory Group on Molecular Screening and Toxicogenomics (EAGMST) launched a project to develop guidance for the reporting of omics data aimed at fostering further regulatory use. Here, we report on the ongoing development of the first formal reporting framework describing the processing and analysis of both transcriptomic and metabolomic data for regulatory toxicology. We introduce the modular structure, content, harmonization and strategy for trialling this reporting framework prior to its publication by the OECD.


Asunto(s)
Metabolómica/normas , Organización para la Cooperación y el Desarrollo Económico/normas , Toxicogenética/normas , Toxicología/normas , Transcriptoma/fisiología , Documentación/normas , Humanos
9.
BMC Bioinformatics ; 21(1): 11, 2020 Jan 09.
Artículo en Inglés | MEDLINE | ID: mdl-31918658

RESUMEN

BACKGROUND: Metabolomics time-course experiments provide the opportunity to understand the changes to an organism by observing the evolution of metabolic profiles in response to internal or external stimuli. Along with other omic longitudinal profiling technologies, these techniques have great potential to uncover complex relations between variations across diverse omic variables and provide unique insights into the underlying biology of the system. However, many statistical methods currently used to analyse short time-series omic data are i) prone to overfitting, ii) do not fully take into account the experimental design or iii) do not make full use of the multivariate information intrinsic to the data or iv) are unable to uncover multiple associations between different omic data. The model we propose is an attempt to i) overcome overfitting by using a weakly informative Bayesian model, ii) capture experimental design conditions through a mixed-effects model, iii) model interdependencies between variables by augmenting the mixed-effects model with a conditional auto-regressive (CAR) component and iv) identify potential associations between heterogeneous omic variables by using a horseshoe prior. RESULTS: We assess the performance of our model on synthetic and real datasets and show that it can outperform comparable models for metabolomic longitudinal data analysis. In addition, our proposed method provides the analyst with new insights on the data as it is able to identify metabolic biomarkers related to treatment, infer perturbed pathways as a result of treatment and find significant associations with additional omic variables. We also show through simulation that our model is fairly robust against inaccuracies in metabolite assignments. On real data, we demonstrate that the number of profiled metabolites slightly affects the predictive ability of the model. CONCLUSIONS: Our single model approach to longitudinal analysis of metabolomics data provides an approach simultaneously for integrative analysis and biomarker discovery. In addition, it lends better interpretation by allowing analysis at the pathway level. An accompanying R package for the model has been developed using the probabilistic programming language Stan. The package offers user-friendly functions for simulating data, fitting the model, assessing model fit and postprocessing the results. The main aim of the R package is to offer freely accessible resources for integrative longitudinal analysis for metabolomics scientists and various visualization functions easy-to-use for applied researchers to interpret results.


Asunto(s)
Biomarcadores/metabolismo , Metabolómica/métodos , Modelos Teóricos , Bacterias/metabolismo , Teorema de Bayes , Metaboloma
10.
Eur Heart J ; 40(34): 2883-2896, 2019 09 07.
Artículo en Inglés | MEDLINE | ID: mdl-31102408

RESUMEN

AIMS: To characterize serum metabolic signatures associated with atherosclerosis in the coronary or carotid arteries and subsequently their association with incident cardiovascular disease (CVD). METHODS AND RESULTS: We used untargeted one-dimensional (1D) serum metabolic profiling by proton nuclear magnetic resonance spectroscopy (1H NMR) among 3867 participants from the Multi-Ethnic Study of Atherosclerosis (MESA), with replication among 3569 participants from the Rotterdam and LOLIPOP studies. Atherosclerosis was assessed by coronary artery calcium (CAC) and carotid intima-media thickness (IMT). We used multivariable linear regression to evaluate associations between NMR features and atherosclerosis accounting for multiplicity of comparisons. We then examined associations between metabolites associated with atherosclerosis and incident CVD available in MESA and Rotterdam and explored molecular networks through bioinformatics analyses. Overall, 30 1H NMR measured metabolites were associated with CAC and/or IMT, P = 1.3 × 10-14 to 1.0 × 10-6 (discovery) and P = 5.6 × 10-10 to 1.1 × 10-2 (replication). These associations were substantially attenuated after adjustment for conventional cardiovascular risk factors. Metabolites associated with atherosclerosis revealed disturbances in lipid and carbohydrate metabolism, branched chain, and aromatic amino acid metabolism, as well as oxidative stress and inflammatory pathways. Analyses of incident CVD events showed inverse associations with creatine, creatinine, and phenylalanine, and direct associations with mannose, acetaminophen-glucuronide, and lactate as well as apolipoprotein B (P < 0.05). CONCLUSION: Metabolites associated with atherosclerosis were largely consistent between the two vascular beds (coronary and carotid arteries) and predominantly tag pathways that overlap with the known cardiovascular risk factors. We present an integrated systems network that highlights a series of inter-connected pathways underlying atherosclerosis.


Asunto(s)
Enfermedades Cardiovasculares/etiología , Enfermedades de las Arterias Carótidas/complicaciones , Enfermedades de las Arterias Carótidas/metabolismo , Enfermedad de la Arteria Coronaria/complicaciones , Enfermedad de la Arteria Coronaria/metabolismo , Adulto , Anciano , Enfermedades Cardiovasculares/sangre , Enfermedades de las Arterias Carótidas/sangre , Enfermedad de la Arteria Coronaria/sangre , Femenino , Humanos , Masculino , Persona de Mediana Edad , Estudios Prospectivos , Espectroscopía de Protones por Resonancia Magnética
11.
BMC Bioinformatics ; 20(1): 543, 2019 Nov 04.
Artículo en Inglés | MEDLINE | ID: mdl-31684857

RESUMEN

BACKGROUND: Transcriptomic data is often used to build statistical models which are predictive of a given phenotype, such as disease status. Genes work together in pathways and it is widely thought that pathway representations will be more robust to noise in the gene expression levels. We aimed to test this hypothesis by constructing models based on either genes alone, or based on sample specific scores for each pathway, thus transforming the data to a 'pathway space'. We progressively degraded the raw data by addition of noise and examined the ability of the models to maintain predictivity. RESULTS: Models in the pathway space indeed had higher predictive robustness than models in the gene space. This result was independent of the workflow, parameters, classifier and data set used. Surprisingly, randomised pathway mappings produced models of similar accuracy and robustness to true mappings, suggesting that the success of pathway space models is not conferred by the specific definitions of the pathway. Instead, predictive models built on the true pathway mappings led to prediction rules with fewer influential pathways than those built on randomised pathways. The extent of this effect was used to differentiate pathway collections coming from a variety of widely used pathway databases. CONCLUSIONS: Prediction models based on pathway scores are more robust to degradation of gene expression information than the equivalent models based on ungrouped genes. While models based on true pathway scores are not more robust or accurate than those based on randomised pathways, true pathways produced simpler prediction rules, emphasizing a smaller number of pathways.


Asunto(s)
Biología Computacional/métodos , Perfilación de la Expresión Génica , Transducción de Señal , Bases de Datos Factuales , Expresión Génica , Humanos , Modelos Estadísticos , Fenotipo , Transcriptoma
12.
J Proteome Res ; 17(4): 1586-1595, 2018 04 06.
Artículo en Inglés | MEDLINE | ID: mdl-29457906

RESUMEN

Metabolism is altered by genetics, diet, disease status, environment, and many other factors. Modeling either one of these is often done without considering the effects of the other covariates. Attributing differences in metabolic profile to one of these factors needs to be done while controlling for the metabolic influence of the rest. We describe here a data analysis framework and novel confounder-adjustment algorithm for multivariate analysis of metabolic profiling data. Using simulated data, we show that similar numbers of true associations and significantly less false positives are found compared to other commonly used methods. Covariate-adjusted projections to latent structures (CA-PLS) are exemplified here using a large-scale metabolic phenotyping study of two Chinese populations at different risks for cardiovascular disease. Using CA-PLS, we find that some previously reported differences are actually associated with external factors and discover a number of previously unreported biomarkers linked to different metabolic pathways. CA-PLS can be applied to any multivariate data where confounding may be an issue and the confounder-adjustment procedure is translatable to other multivariate regression techniques.


Asunto(s)
Biomarcadores , Factores de Confusión Epidemiológicos , Metaboloma , Modelos Estadísticos , Fenotipo , Algoritmos , Pueblo Asiatico , Enfermedades Cardiovasculares , Simulación por Computador , Humanos , Análisis Multivariante , Riesgo , Análisis Espectral
13.
Anal Chem ; 90(1): 649-656, 2018 01 02.
Artículo en Inglés | MEDLINE | ID: mdl-29035042

RESUMEN

NMR is a widely used analytical technique with a growing number of repositories available. As a result, demands for a vendor-agnostic, open data format for long-term archiving of NMR data have emerged with the aim to ease and encourage sharing, comparison, and reuse of NMR data. Here we present nmrML, an open XML-based exchange and storage format for NMR spectral data. The nmrML format is intended to be fully compatible with existing NMR data for chemical, biochemical, and metabolomics experiments. nmrML can capture raw NMR data, spectral data acquisition parameters, and where available spectral metadata, such as chemical structures associated with spectral assignments. The nmrML format is compatible with pure-compound NMR data for reference spectral libraries as well as NMR data from complex biomixtures, i.e., metabolomics experiments. To facilitate format conversions, we provide nmrML converters for Bruker, JEOL and Agilent/Varian vendor formats. In addition, easy-to-use Web-based spectral viewing, processing, and spectral assignment tools that read and write nmrML have been developed. Software libraries and Web services for data validation are available for tool developers and end-users. The nmrML format has already been adopted for capturing and disseminating NMR data for small molecules by several open source data processing tools and metabolomics reference spectral libraries, e.g., serving as storage format for the MetaboLights data repository. The nmrML open access data standard has been endorsed by the Metabolomics Standards Initiative (MSI), and we here encourage user participation and feedback to increase usability and make it a successful standard.


Asunto(s)
Bases de Datos de Compuestos Químicos/normas , Espectroscopía de Resonancia Magnética/estadística & datos numéricos , Metabolómica/métodos , Programas Informáticos
14.
Metabolomics ; 14(3): 32, 2018 02 13.
Artículo en Inglés | MEDLINE | ID: mdl-30830335

RESUMEN

INTRODUCTION: Differences in the metabolite profiles between serum and plasma are incompletely understood. OBJECTIVES: To evaluate metabolic profile differences between serum and plasma and among plasma sample subtypes. METHODS: We analyzed serum, platelet rich plasma (PRP), platelet poor plasma (PPP), and platelet free plasma (PFP), collected from 8 non-fasting apparently healthy women, using untargeted standard 1D and CPMG 1H NMR and reverse phase and hydrophilic (HILIC) UPLC-MS. Differences between metabolic profiles were evaluated using validated principal component and orthogonal partial least squares discriminant analysis. RESULTS: Explorative analysis showed the main source of variation among samples was due to inter-individual differences with no grouping by sample type. After correcting for inter-individual differences, lipoproteins, lipids in VLDL/LDL, lactate, glutamine, and glucose were found to discriminate serum from plasma in NMR analyses. In UPLC-MS analyses, lysophosphatidylethanolamine (lysoPE)(18:0) and lysophosphatidic acid(20:0) were higher in serum, and phosphatidylcholines (PC)(16:1/18:2, 20:3/18:0, O-20:0/22:4), lysoPC(16:0), PE(O-18:2/20:4), sphingomyelin(18:0/22:0), and linoleic acid were lower. In plasma subtype analyses, isoleucine, leucine, valine, phenylalanine, glutamate, and pyruvate were higher among PRP samples compared with PPP and PFP by NMR while lipids in VLDL/LDL, citrate, and glutamine were lower. By UPLC-MS, PE(18:0/18:2) and PC(P-16:0/20:4) were higher in PRP compared with PFP samples. CONCLUSIONS: Correction for inter-individual variation was required to detect metabolite differences between serum and plasma. Our results suggest the potential importance of inter-individual effects and sample type on the results from serum and plasma metabolic phenotyping studies.


Asunto(s)
Metaboloma , Plasma/química , Suero/química , Adulto , Aminoácidos/análisis , Glucemia/análisis , Femenino , Humanos , Lípidos/análisis , Lipoproteínas/análisis , Espectrometría de Masas , Persona de Mediana Edad , Espectroscopía de Protones por Resonancia Magnética
15.
J Proteome Res ; 16(10): 3623-3633, 2017 10 06.
Artículo en Inglés | MEDLINE | ID: mdl-28823158

RESUMEN

1H NMR spectroscopy of biofluids generates reproducible data allowing detection and quantification of small molecules in large population cohorts. Statistical models to analyze such data are now well-established, and the use of univariate metabolome wide association studies (MWAS) investigating the spectral features separately has emerged as a computationally efficient and interpretable alternative to multivariate models. The MWAS rely on the accurate estimation of a metabolome wide significance level (MWSL) to be applied to control the family wise error rate. Subsequent interpretation requires efficient visualization and formal feature annotation, which, in-turn, call for efficient prioritization of spectral variables of interest. Using human serum 1H NMR spectroscopic profiles from 3948 participants from the Multi-Ethnic Study of Atherosclerosis (MESA), we have performed a series of MWAS for serum levels of glucose. We first propose an extension of the conventional MWSL that yields stable estimates of the MWSL across the different model parameterizations and distributional features of the outcome. We propose both efficient visualization methods and a strategy based on subsampling and internal validation to prioritize the associations. Our work proposes and illustrates practical and scalable solutions to facilitate the implementation of the MWAS approach and improve interpretation in large cohort studies.


Asunto(s)
Aterosclerosis/sangre , Metaboloma/genética , Metabolómica , Adulto , Anciano , Anciano de 80 o más Años , Aterosclerosis/epidemiología , Aterosclerosis/patología , Glucemia , Estudios de Cohortes , Femenino , Humanos , Masculino , Persona de Mediana Edad , Espectroscopía de Protones por Resonancia Magnética
17.
J Proteome Res ; 15(12): 4188-4194, 2016 12 02.
Artículo en Inglés | MEDLINE | ID: mdl-27628670

RESUMEN

Large-scale metabolomics studies involving thousands of samples present multiple challenges in data analysis, particularly when an untargeted platform is used. Studies with multiple cohorts and analysis platforms exacerbate existing problems such as peak alignment and normalization. Therefore, there is a need for robust processing pipelines that can ensure reliable data for statistical analysis. The COMBI-BIO project incorporates serum from ∼8000 individuals, in three cohorts, profiled by six assays in two phases using both 1H NMR and UPLC-MS. Here we present the COMBI-BIO NMR analysis pipeline and demonstrate its fitness for purpose using representative quality control (QC) samples. NMR spectra were first aligned and normalized. After eliminating interfering signals, outliers identified using Hotelling's T2 were removed and a cohort/phase adjustment was applied, resulting in two NMR data sets (CPMG and NOESY). Alignment of the NMR data was shown to increase the correlation-based alignment quality measure from 0.319 to 0.391 for CPMG and from 0.536 to 0.586 for NOESY, showing that the improvement was present across both large and small peaks. End-to-end quality assessment of the pipeline was achieved using Hotelling's T2 distributions. For CPMG spectra, the interquartile range decreased from 1.425 in raw QC data to 0.679 in processed spectra, while the corresponding change for NOESY spectra was from 0.795 to 0.636, indicating an improvement in precision following processing. PCA indicated that gross phase and cohort differences were no longer present. These results illustrate that the pipeline produces robust and reproducible data, successfully addressing the methodological challenges of this large multifaceted study.


Asunto(s)
Interpretación Estadística de Datos , Metabolómica/métodos , Espectroscopía de Protones por Resonancia Magnética/métodos , Humanos , Metabolómica/instrumentación , Metabolómica/estadística & datos numéricos , Epidemiología Molecular , Espectroscopía de Protones por Resonancia Magnética/normas , Espectroscopía de Protones por Resonancia Magnética/estadística & datos numéricos , Control de Calidad , Reproducibilidad de los Resultados , Flujo de Trabajo
18.
Anal Chem ; 88(5): 2583-9, 2016 Mar 01.
Artículo en Inglés | MEDLINE | ID: mdl-26824414

RESUMEN

NMR spectroscopy and mass spectrometry are the two major analytical platforms for metabolomics, and both generate substantial data with hundreds to thousands of observed peaks for a single sample. Many of these are unknown, and peak assignment is generally complex and time-consuming. Statistical correlations between data types have proven useful in expediting this process, for example, in prioritizing candidate assignments. However, this approach has not been formally assessed for the comparison of direct-infusion mass spectrometry (DIMS) and NMR data. Here, we present a systematic analysis of a sample set (tissue extracts), and the utility of a simple correlation threshold to aid metabolite identification. The correlations were surprisingly successful in linking structurally related signals, with 15 of 26 NMR-detectable metabolites having their highest correlation to a cognate MS ion. However, we found that the distribution of the correlations was highly dependent on the nature of the MS ion, such as the adduct type. This approach should help to alleviate this important bottleneck where both 1D NMR and DIMS data sets have been collected.


Asunto(s)
Espectroscopía de Resonancia Magnética , Espectrometría de Masas , Oligoquetos/metabolismo , Extractos de Tejidos/análisis , Animales , Lisina/análogos & derivados , Lisina/análisis , Metabolómica , Oligoquetos/química , Serina/análogos & derivados , Serina/análisis , Estadística como Asunto , Ácido Succínico/análisis
19.
Anal Chem ; 88(10): 5179-88, 2016 05 17.
Artículo en Inglés | MEDLINE | ID: mdl-27116637

RESUMEN

Estimation of statistical power and sample size is a key aspect of experimental design. However, in metabolic phenotyping, there is currently no accepted approach for these tasks, in large part due to the unknown nature of the expected effect. In such hypothesis free science, neither the number or class of important analytes nor the effect size are known a priori. We introduce a new approach, based on multivariate simulation, which deals effectively with the highly correlated structure and high-dimensionality of metabolic phenotyping data. First, a large data set is simulated based on the characteristics of a pilot study investigating a given biomedical issue. An effect of a given size, corresponding either to a discrete (classification) or continuous (regression) outcome is then added. Different sample sizes are modeled by randomly selecting data sets of various sizes from the simulated data. We investigate different methods for effect detection, including univariate and multivariate techniques. Our framework allows us to investigate the complex relationship between sample size, power, and effect size for real multivariate data sets. For instance, we demonstrate for an example pilot data set that certain features achieve a power of 0.8 for a sample size of 20 samples or that a cross-validated predictivity QY(2) of 0.8 is reached with an effect size of 0.2 and 200 samples. We exemplify the approach for both nuclear magnetic resonance and liquid chromatography-mass spectrometry data from humans and the model organism C. elegans.


Asunto(s)
Metaboloma , Metabolómica/estadística & datos numéricos , Análisis Multivariante , Animales , Caenorhabditis elegans , Conjuntos de Datos como Asunto/estadística & datos numéricos , Humanos , Modelos Estadísticos , Datos Preliminares , Tamaño de la Muestra
20.
Bioinformatics ; 31(1): 102-8, 2015 Jan 01.
Artículo en Inglés | MEDLINE | ID: mdl-25183485

RESUMEN

MOTIVATION: Often during the analysis of biological data, it is of importance to interpret the correlation structure that exists between variables. Such correlations may reveal patterns of co-regulation that are indicative of biochemical pathways or common mechanisms of response to a related set of treatments. However, analyses of correlations are usually conducted by either subjective interpretation of the univariate covariance matrix or by applying multivariate modeling techniques, which do not take prior biological knowledge into account. Over-representation analysis (ORA) is a simple method for objectively deciding whether a set of variables of known or suspected biological relevance, such as a gene set or pathway, is more prevalent in a set of variables of interest than we expect by chance. However, ORA is usually applied to a set of variables differentiating a single experimental variable and does not take into account correlations. RESULTS: Over-representation of correlation analysis (ORCA) is a novel combination of ORA and correlation analysis that provides a means to test whether more associations exist between two specific groups of variables than expected by chance. The method is exemplified by application to drug sensitivity and microRNA expression data from a panel of cancer cell lines (NCI60). ORCA highlighted a previously reported correlation between sensitivity to alkylating anticancer agents and topoisomerase inhibitors. We also used this approach to validate microRNA clusters predicted by mRNA correlations. These observations suggest that ORCA has the potential to reveal novel insights from these data, which are not readily apparent using classical ORA. AVAILABILITY AND IMPLEMENTATION: The R code of the method is available at https://github.com/ORCABioinfo/ORCAcode.


Asunto(s)
Biomarcadores de Tumor/análisis , Biología Computacional/métodos , Conjuntos de Datos como Asunto , MicroARNs/genética , Anotación de Secuencia Molecular/métodos , Neoplasias/genética , Alquilantes/farmacología , Interpretación Estadística de Datos , Bases de Datos Factuales , Inhibidores Enzimáticos/farmacología , Perfilación de la Expresión Génica , Genómica , Humanos , Neoplasias/tratamiento farmacológico , Células Tumorales Cultivadas
SELECCIÓN DE REFERENCIAS
Detalles de la búsqueda