ABSTRACT
As terabytes of multi-omics data are being generated, there is an ever-increasing need for methods facilitating the integration and interpretation of such data. Current multi-omics integration methods typically output lists, clusters, or subnetworks of molecules related to an outcome. Even with expert domain knowledge, discerning the biological processes involved is a time-consuming activity. Here we propose PathIntegrate, a method for integrating multi-omics datasets based on pathways, designed to exploit knowledge of biological systems and thus provide interpretable models for such studies. PathIntegrate employs single-sample pathway analysis to transform multi-omics datasets from the molecular to the pathway-level, and applies a predictive single-view or multi-view model to integrate the data. Model outputs include multi-omics pathways ranked by their contribution to the outcome prediction, the contribution of each omics layer, and the importance of each molecule in a pathway. Using semi-synthetic data we demonstrate the benefit of grouping molecules into pathways to detect signals in low signal-to-noise scenarios, as well as the ability of PathIntegrate to precisely identify important pathways at low effect sizes. Finally, using COPD and COVID-19 data we showcase how PathIntegrate enables convenient integration and interpretation of complex high-dimensional multi-omics datasets. PathIntegrate is available as an open-source Python package.
Subject(s)
Genomics , Multiomics , Genomics/methodsABSTRACT
Sparse multiple canonical correlation network analysis (SmCCNet) is a machine learning technique for integrating omics data along with a variable of interest (e.g., phenotype of complex disease), and reconstructing multi-omics networks that are specific to this variable. We present the second-generation SmCCNet (SmCCNet 2.0) that adeptly integrates single or multiple omics data types along with a quantitative or binary phenotype of interest. In addition, this new package offers a streamlined setup process that can be configured manually or automatically, ensuring a flexible and user-friendly experience. AVAILABILITY : This package is available in both CRAN: https://cran.r-project.org/web/packages/SmCCNet/index.html and Github: https://github.com/KechrisLab/SmCCNet under the MIT license. The network visualization tool is available at https://smccnet.shinyapps.io/smccnetnetwork/ .
Subject(s)
Machine Learning , Software , Genomics/methods , Gene Regulatory Networks , Computational Biology/methods , Humans , MultiomicsABSTRACT
BACKGROUND: Studies have identified individual blood biomarkers associated with chronic obstructive pulmonary disease (COPD) and related phenotypes. However, complex diseases such as COPD typically involve changes in multiple molecules with interconnections that may not be captured when considering single molecular features. METHODS: Leveraging proteomic data from 3,173 COPDGene Non-Hispanic White (NHW) and African American (AA) participants, we applied sparse multiple canonical correlation network analysis (SmCCNet) to 4,776 proteins assayed on the SomaScan v4.0 platform to derive sparse networks of proteins associated with current vs. former smoking status, airflow obstruction, and emphysema quantitated from high-resolution computed tomography scans. We then used NetSHy, a dimension reduction technique leveraging network topology, to produce summary scores of each proteomic network, referred to as NetSHy scores. We next performed a genome-wide association study (GWAS) to identify variants associated with the NetSHy scores, or network quantitative trait loci (nQTLs). Finally, we evaluated the replicability of the networks in an independent cohort, SPIROMICS. RESULTS: We identified networks of 13 to 104 proteins for each phenotype and exposure in NHW and AA, and the derived NetSHy scores significantly associated with the variable of interests. Networks included known (sRAGE, ALPP, MIP1) and novel molecules (CA10, CPB1, HIS3, PXDN) and interactions involved in COPD pathogenesis. We observed 7 nQTL loci associated with NetSHy scores, 4 of which remained after conditional analysis. Networks for smoking status and emphysema, but not airflow obstruction, demonstrated a high degree of replicability across race groups and cohorts. CONCLUSIONS: In this work, we apply state-of-the-art molecular network generation and summarization approaches to proteomic data from COPDGene participants to uncover protein networks associated with COPD phenotypes. We further identify genetic associations with networks. This work discovers protein networks containing known and novel proteins and protein interactions associated with clinically relevant COPD phenotypes across race groups and cohorts.
Subject(s)
Genome-Wide Association Study , Proteomics , Pulmonary Disease, Chronic Obstructive , Smoking , Humans , Pulmonary Disease, Chronic Obstructive/genetics , Smoking/genetics , Male , Female , Middle Aged , Aged , Quantitative Trait Loci , Phenotype , Polymorphism, Single Nucleotide , Genetic VariationABSTRACT
MOTIVATION: Biological networks can provide a system-level understanding of underlying processes. In many contexts, networks have a high degree of modularity, i.e. they consist of subsets of nodes, often known as subnetworks or modules, which are highly interconnected and may perform separate functions. In order to perform subsequent analyses to investigate the association between the identified module and a variable of interest, a module summarization, that best explains the module's information and reduces dimensionality is often needed. Conventional approaches for obtaining network representation typically rely only on the profiles of the nodes within the network while disregarding the inherent network topological information. RESULTS: In this article, we propose NetSHy, a hybrid approach which is capable of reducing the dimension of a network while incorporating topological properties to aid the interpretation of the downstream analyses. In particular, NetSHy applies principal component analysis (PCA) on a combination of the node profiles and the well-known Laplacian matrix derived directly from the network similarity matrix to extract a summarization at a subject level. Simulation scenarios based on random and empirical networks at varying network sizes and sparsity levels show that NetSHy outperforms the conventional PCA approach applied directly on node profiles, in terms of recovering the true correlation with a phenotype of interest and maintaining a higher amount of explained variation in the data when networks are relatively sparse. The robustness of NetSHy is also demonstrated by a more consistent correlation with the observed phenotype as the sample size decreases. Lastly, a genome-wide association study is performed as an application of a downstream analysis, where NetSHy summarization scores on the biological networks identify more significant single nucleotide polymorphisms than the conventional network representation. AVAILABILITY AND IMPLEMENTATION: R code implementation of NetSHy is available at https://github.com/thaovu1/NetSHy. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Genome-Wide Association Study , Polymorphism, Single Nucleotide , Computer Simulation , Principal Component Analysis , Sample SizeABSTRACT
Inferring gene co-expression networks is a useful process for understanding gene regulation and pathway activity. The networks are usually undirected graphs where genes are represented as nodes and an edge represents a significant co-expression relationship. When expression data of multiple (p) genes in multiple (K) conditions (e.g., treatments, tissues, strains) are available, joint estimation of networks harnessing shared information across them can significantly increase the power of analysis. In addition, examining condition-specific patterns of co-expression can provide insights into the underlying cellular processes activated in a particular condition. Condition adaptive fused graphical lasso (CFGL) is an existing method that incorporates condition specificity in a fused graphical lasso (FGL) model for estimating multiple co-expression networks. However, with computational complexity of O(p2K log K), the current implementation of CFGL is prohibitively slow even for a moderate number of genes and can only be used for a maximum of three conditions. In this paper, we propose a faster alternative of CFGL named rapid condition adaptive fused graphical lasso (RCFGL). In RCFGL, we incorporate the condition specificity into another popular model for joint network estimation, known as fused multiple graphical lasso (FMGL). We use a more efficient algorithm in the iterative steps compared to CFGL, enabling faster computation with complexity of O(p2K) and making it easily generalizable for more than three conditions. We also present a novel screening rule to determine if the full network estimation problem can be broken down into estimation of smaller disjoint sub-networks, thereby reducing the complexity further. We demonstrate the computational advantage and superior performance of our method compared to two non-condition adaptive methods, FGL and FMGL, and one condition adaptive method, CFGL in both simulation study and real data analysis. We used RCFGL to jointly estimate the gene co-expression networks in different brain regions (conditions) using a cohort of heterogeneous stock rats. We also provide an accommodating C and Python based package that implements RCFGL.
Subject(s)
Algorithms , Brain , Animals , Rats , Computer Simulation , Gene Regulatory Networks/geneticsABSTRACT
Maternal metabolism during pregnancy shapes offspring health via in utero programming. In the Healthy Start study, we identified five subgroups of pregnant women based on conventional metabolic biomarkers: Reference (n = 360); High HDL-C (n = 289); Dyslipidemic-High TG (n = 149); Dyslipidemic-High FFA (n = 180); Insulin Resistant (IR)-Hyperglycemic (n = 87). These subgroups not only captured metabolic heterogeneity among pregnant participants but were also associated with offspring obesity in early childhood, even among women without obesity or diabetes. Here, we utilize metabolomics data to enrich characterization of the metabolic subgroups and identify key compounds driving between-group differences. We analyzed fasting blood samples from 1065 pregnant women at 18 gestational weeks using untargeted metabolomics. We used weighted gene correlation network analysis (WGCNA) to derive a global network based on the Reference subgroup and characterized distinct metabolite modules representative of the different metabolomic profiles. We used the mummichog algorithm for pathway enrichment and identified key compounds that differed across the subgroups. Eight metabolite modules representing pathways such as the carnitine-acylcarnitine translocase system, fatty acid biosynthesis and activation, and glycerophospholipid metabolism were identified. A module that included 189 compounds related to DHA peroxidation, oxidative stress, and sex hormone biosynthesis was elevated in the Insulin Resistant-Hyperglycemic vs. the Reference subgroup. This module was positively correlated with total cholesterol (R:0.10; p-value < 0.0001) and free fatty acids (R:0.07; p-value < 0.05). Oxidative stress and inflammatory pathways may underlie insulin resistance during pregnancy, even below clinical diabetes thresholds. These findings highlight potential therapeutic targets and strategies for pregnancy risk stratification and reveal mechanisms underlying the developmental origins of metabolic disease risk.
Subject(s)
Lipid Metabolism , Metabolomics , Humans , Female , Pregnancy , Metabolomics/methods , Adult , Pediatric Obesity/blood , Pediatric Obesity/metabolism , Biomarkers/blood , Insulin Resistance , Child , Prenatal Exposure Delayed Effects/blood , Prenatal Exposure Delayed Effects/metabolism , Child, Preschool , MetabolomeABSTRACT
It is now common to have a modest to large number of features on individuals with complex diseases. Unsupervised analyses, such as clustering with and without preprocessing by Principle Component Analysis (PCA), is widely used in practice to uncover subgroups in a sample. However, in many modern studies features are often highly correlated and noisy (e.g. SNP's, -omics, quantitative imaging markers, and electronic health record data). The practical performance of clustering approaches in these settings remains unclear. Through extensive simulations and empirical examples applying Gaussian Mixture Models and related clustering methods, we show these approaches (including variants of kmeans, VarSelLCM, HDClassifier, and Fisher-EM) can have very poor performance in many settings. We also show the poor performance is often driven by either an explicit or implicit assumption by the clustering algorithm that high variance features are relevant while lower variance features are irrelevant, called the variance as relevance assumption. We develop practical pre-processing approaches that improve analysis performance in some cases. This work offers practical guidance on the strengths and limitations of unsupervised clustering approaches in modern data analysis applications.
ABSTRACT
BACKGROUND: In this paper, we are interested in interactions between a high-dimensional -omics dataset and clinical covariates. The goal is to evaluate the relationship between a phenotype of interest and a high-dimensional omics pathway, where the effect of the omics data depends on subjects' clinical covariates (age, sex, smoking status, etc.). For instance, metabolic pathways can vary greatly between sexes which may also change the relationship between certain metabolic pathways and a clinical phenotype of interest. We propose partitioning the clinical covariate space and performing a kernel association test within those partitions. To illustrate this idea, we focus on hierarchical partitions of the clinical covariate space and kernel tests on metabolic pathways. RESULTS: We see that our proposed method outperforms competing methods in most simulation scenarios. It can identify different relationships among clinical groups with higher power in most scenarios while maintaining a proper Type I error rate. The simulation studies also show a robustness to the grouping structure within the clinical space. We also apply the method to the COPDGene study and find several clinically meaningful interactions between metabolic pathways, the clinical space, and lung function. CONCLUSION: TreeKernel provides a simple and interpretable process for testing for relationships between high-dimensional omics data and clinical outcomes in the presence of interactions within clinical cohorts. The method is broadly applicable to many studies.
Subject(s)
Pulmonary Disease, Chronic Obstructive , Humans , Phenotype , Computer SimulationABSTRACT
BACKGROUND: We developed a novel approach to minimize batch effects when assigning samples to batches. Our algorithm selects a batch allocation, among all possible ways of assigning samples to batches, that minimizes differences in average propensity score between batches. This strategy was compared to randomization and stratified randomization in a case-control study (30 per group) with a covariate (case vs control, represented as ß1, set to be null) and two biologically relevant confounding variables (age, represented as ß2, and hemoglobin A1c (HbA1c), represented as ß3). Gene expression values were obtained from a publicly available dataset of expression data obtained from pancreas islet cells. Batch effects were simulated as twice the median biological variation across the gene expression dataset and were added to the publicly available dataset to simulate a batch effect condition. Bias was calculated as the absolute difference between observed betas under the batch allocation strategies and the true beta (no batch effects). Bias was also evaluated after adjustment for batch effects using ComBat as well as a linear regression model. In order to understand performance of our optimal allocation strategy under the alternative hypothesis, we also evaluated bias at a single gene associated with both age and HbA1c levels in the 'true' dataset (CAPN13 gene). RESULTS: Pre-batch correction, under the null hypothesis (ß1), maximum absolute bias and root mean square (RMS) of maximum absolute bias, were minimized using the optimal allocation strategy. Under the alternative hypothesis (ß2 and ß3 for the CAPN13 gene), maximum absolute bias and RMS of maximum absolute bias were also consistently lower using the optimal allocation strategy. ComBat and the regression batch adjustment methods performed well as the bias estimates moved towards the true values in all conditions under both the null and alternative hypotheses. Although the differences between methods were less pronounced following batch correction, estimates of bias (average and RMS) were consistently lower using the optimal allocation strategy under both the null and alternative hypotheses. CONCLUSIONS: Our algorithm provides an extremely flexible and effective method for assigning samples to batches by exploiting knowledge of covariates prior to sample allocation.
Subject(s)
Algorithms , Health Status , Propensity Score , Case-Control Studies , Glycated Hemoglobin , HumansABSTRACT
Humoral immune perturbations contribute to pathogenic outcomes in persons with HIV-1 infection (PWH). Gut barrier dysfunction in PWH is associated with microbial translocation and alterations in microbial communities (dysbiosis), and IgA, the most abundant immunoglobulin (Ig) isotype in the gut, is involved in gut homeostasis by interacting with the microbiome. We determined the impact of HIV-1 infection on the antibody repertoire in the gastrointestinal tract by comparing Ig gene utilization and somatic hypermutation (SHM) in colon biopsies from PWH (n = 19) versus age and sex-matched controls (n = 13). We correlated these Ig parameters with clinical, immunological, microbiome and virological data. Gene signatures of enhanced B cell activation were accompanied by skewed frequencies of multiple Ig Variable genes in PWH. PWH showed decreased frequencies of SHM in IgA and possibly IgG, with a substantial loss of highly mutated IgA sequences. The decline in IgA SHM in PWH correlated with gut CD4+ T cell loss and inversely correlated with mucosal inflammation and microbial translocation. Diminished gut IgA SHM in PWH was driven by transversion mutations at A or T deoxynucleotides, suggesting a defect not at the AID/APOBEC3 deamination step but at later stages of IgA SHM. These results expand our understanding of humoral immune perturbations in PWH that could have important implications in understanding mucosal immune defects in individuals with chronic HIV-1 infection. IMPORTANCE The gut is a major site of early HIV-1 replication and pathogenesis. Extensive CD4+ T cell depletion in this compartment results in a compromised epithelial barrier that facilitates the translocation of microbes into the underlying lamina propria and systemic circulation, resulting in chronic immune activation. To date, the consequences of microbial translocation on the mucosal humoral immune response (or vice versa) remains poorly integrated into the panoply of mucosal immune defects in PWH. We utilized next-generation sequencing approaches to profile the Ab repertoire and ascertain frequencies of somatic hypermutation in colon biopsies from antiretroviral therapy-naive PWH versus controls. Our findings identify perturbations in the Ab repertoire of PWH that could contribute to development or maintenance of dysbiosis. Moreover, IgA mutations significantly decreased in PWH and this was associated with adverse clinical outcomes. These data may provide insight into the mechanisms underlying impaired Ab-dependent gut homeostasis during chronic HIV-1 infection.
Subject(s)
Gastrointestinal Tract , HIV Infections , Immunoglobulin A , Somatic Hypermutation, Immunoglobulin , Dysbiosis , Gastrointestinal Tract/immunology , Gastrointestinal Tract/virology , HIV Infections/genetics , HIV Infections/immunology , HIV-1 , Humans , Immunity, Humoral , Immunoglobulin A/geneticsABSTRACT
Given the differential risk of type 1 diabetes (T1D) in offspring of affected fathers versus affected mothers and our observation that T1D cases have differential DNA methylation near the imprinted DLGAP2 gene compared to controls, we examined whether methylation near DLGAP2 mediates the association between T1D family history and T1D risk. In a nested case-control study of 87 T1D cases and 87 controls from the Diabetes Autoimmunity Study in the Young, we conducted causal mediation analyses at 12 DLGAP2 region CpGs to decompose the effect of family history on T1D risk into indirect and direct effects. These effects were estimated from two regression models adjusted for the human leukocyte antigen DR3/4 genotype: a linear regression of family history on methylation (mediator model) and a logistic regression of family history and methylation on T1D (outcome model). For 8 of the 12 CpGs, we identified a significant interaction between T1D family history and methylation on T1D risk. Accounting for this interaction, we found that the increased risk of T1D for children with affected mothers compared to those with no family history was mediated through differences in methylation at two CpGs (cg27351978, cg00565786) in the DLGAP2 region, as demonstrated by a significant pure natural indirect effect (odds ratio (OR) = 1.98, 95% confidence interval (CI): 1.06-3.71) and nonsignificant total natural direct effect (OR = 1.65, 95% CI: 0.16-16.62) (for cg00565786). In contrast, the increased risk of T1D for children with an affected father or sibling was not explained by DNA methylation changes at these CpGs. Results were similar for cg27351978 and robust in sensitivity analyses. Lastly, we found that DNA methylation in the DLGAP2 region was associated (P<0:05) with gene expression of nearby protein-coding genes DLGAP2, ARHGEF10, ZNF596, and ERICH1. Results indicate that the maternal protective effect conferred through exposure to T1D in utero may operate through changes to DNA methylation that have functional downstream consequences.
Subject(s)
DNA Methylation , Diabetes Mellitus, Type 1 , Genetic Predisposition to Disease , Humans , Diabetes Mellitus, Type 1/genetics , Diabetes Mellitus, Type 1/epidemiology , Female , Male , Case-Control Studies , Child , Child, Preschool , Adolescent , GTPase-Activating Proteins/genetics , CpG Islands , Risk Factors , Nerve Tissue ProteinsABSTRACT
BACKGROUND: Per- and polyfluoroalkyl substances (PFAS) are ubiquitous, environmentally persistent chemicals, and prenatal exposures have been associated with adverse child health outcomes. Prenatal PFAS exposure may lead to epigenetic age acceleration (EAA), defined as the discrepancy between an individual's chronologic and epigenetic or biological age. OBJECTIVES: We estimated associations of maternal serum PFAS concentrations with EAA in umbilical cord blood DNA methylation using linear regression, and a multivariable exposure-response function of the PFAS mixture using Bayesian kernel machine regression. METHODS: Five PFAS were quantified in maternal serum (median: 27 weeks of gestation) among 577 mother-infant dyads from a prospective cohort. Cord blood DNA methylation data were assessed with the Illumina HumanMethylation450 array. EAA was calculated as the residuals from regressing gestational age on epigenetic age, calculated using a cord-blood specific epigenetic clock. Linear regression tested for associations between each maternal PFAS concentration with EAA. Bayesian kernel machine regression with hierarchical selection estimated an exposure-response function for the PFAS mixture. RESULTS: In single pollutant models we observed an inverse relationship between perfluorodecanoate (PFDA) and EAA (-0.148 weeks per log-unit increase, 95% CI: -0.283, -0.013). Mixture analysis with hierarchical selection between perfluoroalkyl carboxylates and sulfonates indicated the carboxylates had the highest group posterior inclusion probability (PIP), or relative importance. Within this group, PFDA had the highest conditional PIP. Univariate predictor-response functions indicated PFDA and perfluorononanoate were inversely associated with EAA, while perfluorohexane sulfonate had a positive association with EAA. CONCLUSIONS: Maternal mid-pregnancy serum concentrations of PFDA were negatively associated with EAA in cord blood, suggesting a pathway by which prenatal PFAS exposures may affect infant development. No significant associations were observed with other PFAS. Mixture models suggested opposite directions of association between perfluoroalkyl sulfonates and carboxylates. Future studies are needed to determine the importance of neonatal EAA for later child health outcomes.
Subject(s)
Alkanesulfonic Acids , Environmental Pollutants , Fluorocarbons , Prenatal Exposure Delayed Effects , Infant , Infant, Newborn , Pregnancy , Child , Female , Humans , Fetal Blood , Prenatal Exposure Delayed Effects/chemically induced , Prospective Studies , Bayes Theorem , Alkanesulfonates , Mothers , Carboxylic Acids , Epigenesis, GeneticABSTRACT
Intrauterine smoke (IUS) exposure during early childhood has been associated with a number of negative health consequences, including reduced lung function and asthma susceptibility. The biological mechanisms underlying these associations have not been established. MicroRNAs regulate the expression of numerous genes involved in lung development. Thus, investigation of the impact of IUS on miRNA expression during human lung development may elucidate the impact of IUS on post-natal respiratory outcomes. We sought to investigate the effect of IUS exposure on miRNA expression during early lung development. We hypothesized that miRNA-mRNA networks are dysregulated by IUS during human lung development and that these miRNAs may be associated with future risk of asthma and allergy. Human fetal lung samples from a prenatal tissue retrieval program were tested for differential miRNA expression with IUS exposure (measured using placental cotinine concentration). RNA was extracted and miRNA-sequencing was performed. We performed differential expression using IUS exposure, with covariate adjustment. We also considered the above model with an additional sex-by-IUS interaction term, allowing IUS effects to differ by male and female samples. Using paired gene expression profiles, we created sex-stratified miRNA-mRNA correlation networks predictive of IUS using DIABLO. We additionally evaluated whether miRNAs were associated with asthma and allergy outcomes in a cohort of childhood asthma. We profiled pseudoglandular lung miRNA in n = 298 samples, 139 (47%) of which had evidence of IUS exposure. Of 515 miRNAs, 25 were significantly associated with intrauterine smoke exposure (q-value < 0.10). The IUS associated miRNAs were correlated with well-known asthma genes (e.g., ORM1-Like Protein 3, ORDML3) and enriched in disease-relevant pathways (oxidative stress). Eleven IUS-miRNAs were also correlated with clinical measures (e.g., Immunoglobulin E andlungfunction) in children with asthma, further supporting their likely disease relevance. Lastly, we found substantial differences in IUS effects by sex, finding 95 significant IUS-miRNAs in male samples, but only four miRNAs in female samples. The miRNA-mRNA correlation networks were predictive of IUS (AUC = 0.78 in males and 0.86 in females) and suggested that IUS-miRNAs are involved in regulation of disease-relevant genes (e.g., A disintegrin and metalloproteinase domain 19 (ADAM19), LBH regulator of WNT signaling (LBH)) and sex hormone signaling (Coactivator associated methyltransferase 1(CARM1)). Our study demonstrated differential expression of miRNAs by IUS during early prenatal human lung development, which may be modified by sex. Based on their gene targets and correlation to clinical asthma and atopy outcomes, these IUS-miRNAs may be relevant for subsequent allergy and asthma risk. Our study provides insight into the impact of IUS in human fetal lung transcriptional networks and on the developmental origins of asthma and allergic disorders.
Subject(s)
Asthma , MicroRNAs , Child , Humans , Male , Female , Child, Preschool , Pregnancy , Smoke , Placenta/metabolism , Asthma/genetics , Lung/metabolism , MicroRNAs/genetics , MicroRNAs/metabolism , RNA, Messenger/geneticsABSTRACT
When analyzing large datasets from high-throughput technologies, researchers often encounter missing quantitative measurements, which are particularly frequent in metabolomics datasets. Metabolomics, the comprehensive profiling of metabolite abundances, are typically measured using mass spectrometry technologies that often introduce missingness via multiple mechanisms: (1) the metabolite signal may be smaller than the instrument limit of detection; (2) the conditions under which the data are collected and processed may lead to missing values; (3) missing values can be introduced randomly. Missingness resulting from mechanism (1) would be classified as Missing Not At Random (MNAR), that from mechanism (2) would be Missing At Random (MAR), and that from mechanism (3) would be classified as Missing Completely At Random (MCAR). Two common approaches for handling missing data are the following: (1) omit missing data from the analysis; (2) impute the missing values. Both approaches may introduce bias and reduce statistical power in downstream analyses such as testing metabolite associations with clinical variables. Further, standard imputation methods in metabolomics often ignore the mechanisms causing missingness and inaccurately estimate missing values within a data set. We propose a mechanism-aware imputation algorithm that leverages a two-step approach in imputing missing values. First, we use a random forest classifier to classify the missing mechanism for each missing value in the data set. Second, we impute each missing value using imputation algorithms that are specific to the predicted missingness mechanism (i.e., MAR/MCAR or MNAR). Using complete data, we conducted simulations, where we imposed different missingness patterns within the data and tested the performance of combinations of imputation algorithms. Our proposed algorithm provided imputations closer to the original data than those using only one imputation algorithm for all the missing values. Consequently, our two-step approach was able to reduce bias for improved downstream analyses.
Subject(s)
Algorithms , Metabolomics , Bias , Mass Spectrometry/methods , Metabolomics/methodsABSTRACT
Omics studies frequently use samples collected during cohort studies. Conditioning on sample availability can cause selection bias if sample availability is nonrandom. Inverse probability weighting (IPW) is purported to reduce this bias. We evaluated IPW in an epigenome-wide analysis testing the association between DNA methylation (261,435 probes) and age in healthy adolescent subjects (n = 114). We simulated age and sex to be correlated with sample selection and then evaluated four conditions: complete population/no selection bias (all subjects), naïve selection bias (no adjustment), and IPW selection bias (selection bias with IPW adjustment). Assuming the complete population condition represented the "truth," we compared each condition to the complete population condition. Bias or difference in associations between age and methylation was reduced in the IPW condition versus the naïve condition. However, genomic inflation and type 1 error were higher in the IPW condition relative to the naïve condition. Postadjustment using bacon, type 1 error and inflation were similar across all conditions. Power was higher under the IPW condition compared with the naïve condition before and after inflation adjustment. IPW methods can reduce bias in genome-wide analyses. Genomic inflation is a potential concern that can be minimized using methods that adjust for inflation.
Subject(s)
Genome-Wide Association Study , Adolescent , Bias , Cohort Studies , Humans , Probability , Selection BiasABSTRACT
The Type I Interferons (IFN-Is) are innate antiviral cytokines that include 12 different IFNα subtypes and IFNß that signal through the IFN-I receptor (IFNAR), inducing hundreds of IFN-stimulated genes (ISGs) that comprise the 'interferome'. Quantitative differences in IFNAR binding correlate with antiviral activity, but whether IFN-Is exhibit qualitative differences remains controversial. Moreover, the IFN-I response is protective during acute HIV-1 infection, but likely pathogenic during the chronic stages. To gain a deeper understanding of the IFN-I response, we compared the interferomes of IFNα subtypes dominantly-expressed in HIV-1-exposed plasmacytoid dendritic cells (1, 2, 5, 8 and 14) and IFNß in the earliest cellular targets of HIV-1 infection. Primary gut CD4 T cells from 3 donors were treated for 18 hours ex vivo with individual IFN-Is normalized for IFNAR signaling strength. Of 1,969 IFN-regulated genes, 246 'core ISGs' were induced by all IFN-Is tested. However, many IFN-regulated genes were not shared between the IFNα subtypes despite similar induction of canonical antiviral ISGs such as ISG15, RSAD2 and MX1, formally demonstrating qualitative differences between the IFNα subtypes. Notably, IFNß induced a broader interferome than the individual IFNα subtypes. Since IFNß, and not IFNα, is upregulated during chronic HIV-1 infection in the gut, we compared core ISGs and IFNß-specific ISGs from colon pinch biopsies of HIV-1-uninfected (n = 13) versus age- and gender-matched, antiretroviral-therapy naïve persons with HIV-1 (PWH; n = 19). Core ISGs linked to inflammation, T cell activation and immune exhaustion were elevated in PWH, positively correlated with plasma lipopolysaccharide (LPS) levels and gut IFNß levels, and negatively correlated with gut CD4 T cell frequencies. In sharp contrast, IFNß-specific ISGs linked to protein translation and anti-inflammatory responses were significantly downregulated in PWH, negatively correlated with gut IFNß and LPS, and positively correlated with plasma IL6 and gut CD4 T cell frequencies. Our findings reveal qualitative differences in interferome induction by diverse IFN-Is and suggest potential mechanisms for how IFNß may drive HIV-1 pathogenesis in the gut.
Subject(s)
Antiviral Agents/pharmacology , Dendritic Cells/pathology , Gastrointestinal Tract/pathology , HIV Infections/pathology , HIV-1/drug effects , Interferon-alpha/pharmacology , Interferon-beta/pharmacology , Adult , Case-Control Studies , Dendritic Cells/drug effects , Female , Gastrointestinal Tract/drug effects , Gene Expression Profiling , HIV Infections/drug therapy , HIV Infections/virology , Humans , Interferon-alpha/classification , Male , Middle Aged , Young AdultABSTRACT
High-throughput data such as metabolomics, genomics, transcriptomics, and proteomics have become familiar data types within the "-omics" family. For this work, we focus on subsets that interact with one another and represent these "pathways" as graphs. Observed pathways often have disjoint components, i.e., nodes or sets of nodes (metabolites, etc.) not connected to any other within the pathway, which notably lessens testing power. In this paper we propose the Pathway Integrated Regression-based Kernel Association Test (PaIRKAT), a new kernel machine regression method for incorporating known pathway information into the semi-parametric kernel regression framework. This work extends previous kernel machine approaches. This paper also contributes an application of a graph kernel regularization method for overcoming disconnected pathways. By incorporating a regularized or "smoothed" graph into a score test, PaIRKAT can provide more powerful tests for associations between biological pathways and phenotypes of interest and will be helpful in identifying novel pathways for targeted clinical research. We evaluate this method through several simulation studies and an application to real metabolomics data from the COPDGene study. Our simulation studies illustrate the robustness of this method to incorrect and incomplete pathway knowledge, and the real data analysis shows meaningful improvements of testing power in pathways. PaIRKAT was developed for application to metabolomic pathway data, but the techniques are easily generalizable to other data sources with a graph-like structure.
Subject(s)
Metabolome/genetics , Metabolomics/methods , Pulmonary Disease, Chronic Obstructive , Algorithms , Biomarkers/blood , Databases, Genetic , Humans , Phenotype , Pulmonary Disease, Chronic Obstructive/genetics , Pulmonary Disease, Chronic Obstructive/metabolism , Regression AnalysisABSTRACT
BACKGROUND: Prenatal exposure to ambient air pollution has been associated with adverse offspring health outcomes. Childhood health effects of prenatal exposures may be mediated through changes to DNA methylation detectable at birth. METHODS: Among 429 non-smoking women in a cohort study of mother-infant pairs in Colorado, USA, we estimated associations between prenatal exposure to ambient fine particulate matter (PM2.5) and ozone (O3), and epigenome-wide DNA methylation of umbilical cord blood cells at delivery (2010-2014). We calculated average PM2.5 and O3 in each trimester of pregnancy and the full pregnancy using inverse-distance-weighted interpolation. We fit linear regression models adjusted for potential confounders and cell proportions to estimate associations between air pollutants and methylation at each of 432,943 CpGs. Differentially methylated regions (DMRs) were identified using comb-p. Previously in this cohort, we reported positive associations between 3rd trimester O3 exposure and infant adiposity at 5 months of age. Here, we quantified the potential for mediation of that association by changes in DNA methylation in cord blood. RESULTS: We identified several DMRs for each pollutant and period of pregnancy. The greatest number of significant DMRs were associated with third trimester PM2.5 (21 DMRs). No single CpGs were associated with air pollutants at a false discovery rate <0.05. We found that up to 8% of the effect of 3rd trimester O3 on 5-month adiposity may be mediated by locus-specific methylation changes, but mediation estimates were not statistically significant. CONCLUSIONS: Differentially methylated regions in cord blood were identified in association with maternal exposure to PM2.5 and O3. Genes annotated to the significant sites played roles in cardiometabolic disease, immune function and inflammation, and neurologic disorders. We found limited evidence of mediation by DNA methylation of associations between third trimester O3 exposure and 5-month infant adiposity.
Subject(s)
Air Pollutants , Air Pollution , Prenatal Exposure Delayed Effects , Adiposity , Child , Cohort Studies , DNA Methylation , Female , Fetal Blood , Humans , Infant , Infant, Newborn , Maternal Exposure , Obesity , Particulate Matter , PregnancyABSTRACT
BACKGROUND: Assessing the reproducibility of measurements is an important first step for improving the reliability of downstream analyses of high-throughput metabolomics experiments. We define a metabolite to be reproducible when it demonstrates consistency across replicate experiments. Similarly, metabolites which are not consistent across replicates can be labeled as irreproducible. In this work, we introduce and evaluate the use (Ma)ximum (R)ank (R)eproducibility (MaRR) to examine reproducibility in mass spectrometry-based metabolomics experiments. We examine reproducibility across technical or biological samples in three different mass spectrometry metabolomics (MS-Metabolomics) data sets. RESULTS: We apply MaRR, a nonparametric approach that detects the change from reproducible to irreproducible signals using a maximal rank statistic. The advantage of using MaRR over model-based methods that it does not make parametric assumptions on the underlying distributions or dependence structures of reproducible metabolites. Using three MS Metabolomics data sets generated in the multi-center Genetic Epidemiology of Chronic Obstructive Pulmonary Disease (COPD) study, we applied the MaRR procedure after data processing to explore reproducibility across technical or biological samples. Under realistic settings of MS-Metabolomics data, the MaRR procedure effectively controls the False Discovery Rate (FDR) when there was a gradual reduction in correlation between replicate pairs for less highly ranked signals. Simulation studies also show that the MaRR procedure tends to have high power for detecting reproducible metabolites in most situations except for smaller values of proportion of reproducible metabolites. Bias (i.e., the difference between the estimated and the true value of reproducible signal proportions) values for simulations are also close to zero. The results reported from the real data show a higher level of reproducibility for technical replicates compared to biological replicates across all the three different datasets. In summary, we demonstrate that the MaRR procedure application can be adapted to various experimental designs, and that the nonparametric approach performs consistently well. CONCLUSIONS: This research was motivated by reproducibility, which has proven to be a major obstacle in the use of genomic findings to advance clinical practice. In this paper, we developed a data-driven approach to assess the reproducibility of MS-Metabolomics data sets. The methods described in this paper are implemented in the open-source R package marr, which is freely available from Bioconductor at http://bioconductor.org/packages/marr .
Subject(s)
Metabolomics , Mass Spectrometry , Reproducibility of ResultsABSTRACT
BACKGROUND: The drive to understand how microbial communities interact with their environments has inspired innovations across many fields. The data generated from sequence-based analyses of microbial communities typically are of high dimensionality and can involve multiple data tables consisting of taxonomic or functional gene/pathway counts. Merging multiple high dimensional tables with study-related metadata can be challenging. Existing microbiome pipelines available in R have created their own data structures to manage this problem. However, these data structures may be unfamiliar to analysts new to microbiome data or R and do not allow for deviations from internal workflows. Existing analysis tools also focus primarily on community-level analyses and exploratory visualizations, as opposed to analyses of individual taxa. RESULTS: We developed the R package "tidyMicro" to serve as a more complete microbiome analysis pipeline. This open source software provides all of the essential tools available in other popular packages (e.g., management of sequence count tables, standard exploratory visualizations, and diversity inference tools) supplemented with multiple options for regression modelling (e.g., negative binomial, beta binomial, and/or rank based testing) and novel visualizations to improve interpretability (e.g., Rocky Mountain plots, longitudinal ordination plots). This comprehensive pipeline for microbiome analysis also maintains data structures familiar to R users to improve analysts' control over workflow. A complete vignette is provided to aid new users in analysis workflow. CONCLUSIONS: tidyMicro provides a reliable alternative to popular microbiome analysis packages in R. We provide standard tools as well as novel extensions on standard analyses to improve interpretability results while maintaining object malleability to encourage open source collaboration. The simple examples and full workflow from the package are reproducible and applicable to external data sets.