Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 15 de 15
Filter
1.
PLoS Genet ; 20(4): e1011246, 2024 Apr.
Article in English | MEDLINE | ID: mdl-38648211

ABSTRACT

Genome-wide association studies (GWAS) have identified many genetic loci associated with complex traits and diseases in the past 20 years. Multiple heritable covariates may be added into GWAS regression models to estimate direct effects of genetic variants on a focal trait, or to improve the power by accounting for environmental effects and other sources of trait variations. When one or more covariates are causally affected by both genetic variants and hidden confounders, adjusting for them in GWAS will produce biased estimation of SNP effects, known as collider bias. Several approaches have been developed to correct collider bias through estimating the bias by Mendelian randomization (MR). However, these methods work for only one covariate, some of which utilize MR methods with relatively strong assumptions, both of which may not hold in practice. In this paper, we extend the bias-correction approaches in two aspects: first we derive an analytical expression for the collider bias in the presence of multiple covariates, then we propose estimating the bias using a robust multivariable MR (MVMR) method based on constrained maximum likelihood (called MVMR-cML), allowing the presence of invalid instrumental variables (IVs) and correlated pleiotropy. We also established the estimation consistency and asymptotic normality of the new bias-corrected estimator. We conducted simulations to show that all methods mitigated collider bias under various scenarios. In real data analyses, we applied the methods to two GWAS examples, the first a GWAS of waist-hip ratio with adjustment for only one covariate, body-mass index (BMI), and the second a GWAS of BMI adjusting metabolomic principle components as multiple covariates, illustrating the effectiveness of bias correction.


Subject(s)
Bias , Genome-Wide Association Study , Mendelian Randomization Analysis , Polymorphism, Single Nucleotide , Genome-Wide Association Study/methods , Mendelian Randomization Analysis/methods , Humans , Models, Genetic , Body Mass Index
2.
Am J Hum Genet ; 110(4): 592-605, 2023 04 06.
Article in English | MEDLINE | ID: mdl-36948188

ABSTRACT

Mendelian randomization (MR) is a powerful tool for causal inference with observational genome-wide association study (GWAS) summary data. Compared to the more commonly used univariable MR (UVMR), multivariable MR (MVMR) not only is more robust to the notorious problem of genetic (horizontal) pleiotropy but also estimates the direct effect of each exposure on the outcome after accounting for possible mediating effects of other exposures. Despite promising applications, there is a lack of studies on MVMR's theoretical properties and robustness in applications. In this work, we propose an efficient and robust MVMR method based on constrained maximum likelihood (cML), called MVMR-cML, with strong theoretical support. Extensive simulations demonstrate that MVMR-cML performs better than other existing MVMR methods while possessing the above two advantages over its univariable counterpart. An application to several large-scale GWAS summary datasets to infer causal relationships between eight cardiometabolic risk factors and coronary artery disease (CAD) highlights the usefulness and some advantages of the proposed method. For example, after accounting for possible pleiotropic and mediating effects, triglyceride (TG), low-density lipoprotein cholesterol (LDL), and systolic blood pressure (SBP) had direct effects on CAD; in contrast, the effects of high-density lipoprotein cholesterol (HDL), diastolic blood pressure (DBP), and body height diminished after accounting for other risk factors.


Subject(s)
Coronary Artery Disease , Mendelian Randomization Analysis , Humans , Mendelian Randomization Analysis/methods , Genome-Wide Association Study , Risk Factors , Causality , Coronary Artery Disease/genetics , Cholesterol, HDL/genetics
3.
PLoS Genet ; 19(5): e1010762, 2023 05.
Article in English | MEDLINE | ID: mdl-37200398

ABSTRACT

Mendelian randomization (MR) has been increasingly applied for causal inference with observational data by using genetic variants as instrumental variables (IVs). However, the current practice of MR has been largely restricted to investigating the total causal effect between two traits, while it would be useful to infer the direct causal effect between any two of many traits (by accounting for indirect or mediating effects through other traits). For this purpose we propose a two-step approach: we first apply an extended MR method to infer (i.e. both estimate and test) a causal network of total effects among multiple traits, then we modify a graph deconvolution algorithm to infer the corresponding network of direct effects. Simulation studies showed much better performance of our proposed method than existing ones. We applied the method to 17 large-scale GWAS summary datasets (with median N = 256879 and median #IVs = 48) to infer the causal networks of both total and direct effects among 11 common cardiometabolic risk factors, 4 cardiometabolic diseases (coronary artery disease, stroke, type 2 diabetes, atrial fibrillation), Alzheimer's disease and asthma, identifying some interesting causal pathways. We also provide an R Shiny app (https://zhaotongl.shinyapps.io/cMLgraph/) for users to explore any subset of the 17 traits of interest.


Subject(s)
Coronary Artery Disease , Diabetes Mellitus, Type 2 , Humans , Diabetes Mellitus, Type 2/genetics , Mendelian Randomization Analysis/methods , Genome-Wide Association Study , Causality , Polymorphism, Single Nucleotide
4.
Hum Mol Genet ; 32(17): 2693-2703, 2023 08 26.
Article in English | MEDLINE | ID: mdl-37369060

ABSTRACT

Recently, a non-parametric method has been proposed to impute the genetic component of a trait for a large set of genotyped individuals based on a separate genome-wide association study (GWAS) summary dataset of the same trait (from the same population). The imputed trait may contain linear, non-linear and epistatic effects of genetic variants, thus can be used for downstream linear or non-linear association analyses and machine learning tasks. Here, we propose an extension of the method to impute both genetic and environmental components of a trait using both single nucleotide polymorphism (SNP)-trait and omics-trait association summary data. We illustrate an application to a UK Biobank subset of individuals (n ≈ 80K) with both body mass index (BMI) GWAS data and metabolomic data. We divided the whole dataset into two equally sized and non-overlapping training and test datasets; we used the training data to build SNP- and metabolite-BMI association summary data and impute BMI on the test data. We compared the performance of the original and new imputation methods. As by the original method, the imputed BMI values by the new method largely retained SNP-BMI association information; however, the latter retained more information about BMI-environment associations and were more highly correlated with the original observed BMI values.


Subject(s)
Genome-Wide Association Study , Polymorphism, Single Nucleotide , Humans , Genome-Wide Association Study/methods , Phenotype , Genotype , Polymorphism, Single Nucleotide/genetics
5.
Biostatistics ; 25(2): 468-485, 2024 Apr 15.
Article in English | MEDLINE | ID: mdl-36610078

ABSTRACT

Transcriptome-wide association studies (TWAS) have been increasingly applied to identify (putative) causal genes for complex traits and diseases. TWAS can be regarded as a two-sample two-stage least squares method for instrumental variable (IV) regression for causal inference. The standard TWAS (called TWAS-L) only considers a linear relationship between a gene's expression and a trait in stage 2, which may lose statistical power when not true. Recently, an extension of TWAS (called TWAS-LQ) considers both the linear and quadratic effects of a gene on a trait, which however is not flexible enough due to its parametric nature and may be low powered for nonquadratic nonlinear effects. On the other hand, a deep learning (DL) approach, called DeepIV, has been proposed to nonparametrically model a nonlinear effect in IV regression. However, it is both slow and unstable due to the ill-posed inverse problem of solving an integral equation with Monte Carlo approximations. Furthermore, in the original DeepIV approach, statistical inference, that is, hypothesis testing, was not studied. Here, we propose a novel DL approach, called DeLIVR, to overcome the major drawbacks of DeepIV, by estimating a related but different target function and including a hypothesis testing framework. We show through simulations that DeLIVR was both faster and more stable than DeepIV. We applied both parametric and DL approaches to the GTEx and UK Biobank data, showcasing that DeLIVR detected additional 8 and 7 genes nonlinearly associated with high-density lipoprotein (HDL) cholesterol and low-density lipoprotein (LDL) cholesterol, respectively, all of which would be missed by TWAS-L, TWAS-LQ, and DeepIV; these genes include BUD13 associated with HDL, SLC44A2 and GMIP with LDL, all supported by previous studies.


Subject(s)
Deep Learning , Transcriptome , Humans , Quantitative Trait Loci , Phenotype , Genome-Wide Association Study/methods , Cholesterol , Genetic Predisposition to Disease , Polymorphism, Single Nucleotide
6.
PLoS Genet ; 18(5): e1010166, 2022 05.
Article in English | MEDLINE | ID: mdl-35507585

ABSTRACT

Mendelian randomization (MR) is an instrumental variable (IV) method using genetic variants such as single nucleotide polymorphisms (SNPs) as IVs to disentangle the causal relationship between an exposure and an outcome. Since any causal conclusion critically depends on the three valid IV assumptions, which will likely be violated in practice, MR methods robust to the IV assumptions are greatly needed. As such a method, Egger regression stands out as one of the most widely used due to its easy use and perceived robustness. Although Egger regression is claimed to be robust to directional pleiotropy under the instrument strength independent of direct effect (InSIDE) assumption, it is known to be dependent on the orientations/coding schemes of SNPs (i.e. which allele of an SNP is selected as the reference group). The current practice, as recommended as the default setting in some popular MR software packages, is to orientate the SNPs to be all positively associated with the exposure, which however, to our knowledge, has not been fully studied to assess its robustness and potential impact. We use both numerical examples (with both real data and simulated data) and analytical results to demonstrate the practical problem of Egger regression with respect to its heavy dependence on the SNP orientations. Under the assumption that InSIDE holds for some specific (and unknown) coding scheme of the SNPs, we analytically show that other coding schemes would in general lead to the violation of InSIDE. Other related MR and IV regression methods may suffer from the same problem. Cautions should be taken when applying Egger regression (and related MR and IV regression methods) in practice.


Subject(s)
Genetic Pleiotropy , Mendelian Randomization Analysis , Causality , Genome-Wide Association Study , Mendelian Randomization Analysis/methods , Polymorphism, Single Nucleotide , Regression Analysis
7.
Genet Epidemiol ; 47(8): 585-599, 2023 Dec.
Article in English | MEDLINE | ID: mdl-37573486

ABSTRACT

We propose structural equation models (SEMs) as a general framework to infer causal networks for metabolites and other complex traits. Traditionally SEMs are used only for individual-level data under the assumption that all instrumental variables (IVs) are valid. To overcome these limitations, we propose both one- and two-sample approaches for causal network inference based on SEMs that can: (1) perform causal analysis and discover causal relationships among multiple traits; (2) account for the possible presence of some invalid IVs; (3) allow for data analysis using only genome-wide association studies (GWAS) summary statistics when individual-level data are not available; (4) consider the possibility of bidirectional relationships between traits. Our method employs a simple stepwise selection to identify invalid IVs, thus avoiding false positives while possibly increasing true discoveries based on two-stage least squares (2SLS). We use both real GWAS data and simulated data to demonstrate the superior performance of our method over the standard 2SLS/SEMs. For real data analysis, our proposed approach is applied to a human blood metabolite GWAS summary data set to uncover putative causal relationships among the metabolites; we also identify some metabolites (putative) causal to Alzheimer's disease (AD), which, along with the inferred causal metabolite network, suggest some possible pathways of metabolites involved in AD.


Subject(s)
Alzheimer Disease , Genome-Wide Association Study , Humans , Genome-Wide Association Study/methods , Models, Genetic , Phenotype , Alzheimer Disease/genetics
8.
Hum Mol Genet ; 31(14): 2462-2470, 2022 07 21.
Article in English | MEDLINE | ID: mdl-35043938

ABSTRACT

Transcriptome-wide association studies (TWAS) integrate genome-wide association study (GWAS) data with gene expression (GE) data to identify (putative) causal genes for complex traits. There are two stages in TWAS: in Stage 1, a model is built to impute gene expression from genotypes, and in Stage 2, gene-trait association is tested using imputed gene expression. Despite many successes with TWAS, in the current practice, one only assumes a linear relationship between GE and the trait, which however may not hold, leading to loss of power. In this study, we extend the standard TWAS by considering a quadratic effect of GE, in addition to the usual linear effect. We train imputation models for both linear and quadratic gene expression levels in Stage 1, then include both the imputed linear and quadratic expression levels in Stage 2. We applied both the standard TWAS and our approach first to the ADNI gene expression data and the IGAP Alzheimer's disease GWAS summary data, then to the GTEx (V8) gene expression data and the UK Biobank individual-level GWAS data for lipids, followed by validation with different GWAS data, suitable model checking and more robust TWAS methods. In all these applications, the new TWAS approach was able to identify additional genes associated with Alzheimer's disease, LDL and HDL cholesterol levels, suggesting its likely power gains and thus the need to account for potentially nonlinear effects of gene expression on complex traits.


Subject(s)
Alzheimer Disease , Transcriptome , Alzheimer Disease/genetics , Genetic Predisposition to Disease , Genome-Wide Association Study/methods , Humans , Multifactorial Inheritance , Polymorphism, Single Nucleotide/genetics , Quantitative Trait Loci/genetics , Transcriptome/genetics
9.
PLoS Genet ; 17(11): e1009922, 2021 11.
Article in English | MEDLINE | ID: mdl-34793444

ABSTRACT

With the increasing availability of large-scale GWAS summary data on various traits, Mendelian randomization (MR) has become commonly used to infer causality between a pair of traits, an exposure and an outcome. It depends on using genetic variants, typically SNPs, as instrumental variables (IVs). The inverse-variance weighted (IVW) method (with a fixed-effect meta-analysis model) is most powerful when all IVs are valid; however, when horizontal pleiotropy is present, it may lead to biased inference. On the other hand, Egger regression is one of the most widely used methods robust to (uncorrelated) pleiotropy, but it suffers from loss of power. We propose a two-component mixture of regressions to combine and thus take advantage of both IVW and Egger regression; it is often both more efficient (i.e. higher powered) and more robust to pleiotropy (i.e. controlling type I error) than either IVW or Egger regression alone by accounting for both valid and invalid IVs respectively. We propose a model averaging approach and a novel data perturbation scheme to account for uncertainties in model/IV selection, leading to more robust statistical inference for finite samples. Through extensive simulations and applications to the GWAS summary data of 48 risk factor-disease pairs and 63 genetically uncorrelated trait pairs, we showcase that our proposed methods could often control type I error better while achieving much higher power than IVW and Egger regression (and sometimes than several other new/popular MR methods). We expect that our proposed methods will be a useful addition to the toolbox of Mendelian randomization for causal inference.


Subject(s)
Genetic Predisposition to Disease , Genome-Wide Association Study/statistics & numerical data , Mendelian Randomization Analysis/statistics & numerical data , Polymorphism, Single Nucleotide/genetics , Genetic Pleiotropy/genetics , Humans , Regression Analysis
10.
Diabetes Care ; 47(6): 1042-1047, 2024 Jun 01.
Article in English | MEDLINE | ID: mdl-38652672

ABSTRACT

OBJECTIVE: To identify genetic risk factors for incident cardiovascular disease (CVD) among people with type 2 diabetes (T2D). RESEARCH DESIGN AND METHODS: We conducted a multiancestry time-to-event genome-wide association study for incident CVD among people with T2D. We also tested 204 known coronary artery disease (CAD) variants for association with incident CVD. RESULTS: Among 49,230 participants with T2D, 8,956 had incident CVD events (event rate 18.2%). We identified three novel genetic loci for incident CVD: rs147138607 (near CACNA1E/ZNF648, hazard ratio [HR] 1.23, P = 3.6 × 10-9), rs77142250 (near HS3ST1, HR 1.89, P = 9.9 × 10-9), and rs335407 (near TFB1M/NOX3, HR 1.25, P = 1.5 × 10-8). Among 204 known CAD loci, 5 were associated with incident CVD in T2D (multiple comparison-adjusted P < 0.00024, 0.05/204). A standardized polygenic score of these 204 variants was associated with incident CVD with HR 1.14 (P = 1.0 × 10-16). CONCLUSIONS: The data point to novel and known genomic regions associated with incident CVD among individuals with T2D.


Subject(s)
Cardiovascular Diseases , Diabetes Mellitus, Type 2 , Genome-Wide Association Study , Humans , Diabetes Mellitus, Type 2/genetics , Diabetes Mellitus, Type 2/epidemiology , Diabetes Mellitus, Type 2/complications , Cardiovascular Diseases/genetics , Cardiovascular Diseases/epidemiology , Female , Male , Middle Aged , Aged , Polymorphism, Single Nucleotide
11.
HGG Adv ; 4(3): 100197, 2023 07 13.
Article in English | MEDLINE | ID: mdl-37181332

ABSTRACT

Genome-wide association study (GWAS) summary data have become extremely useful in daily routine data analysis, largely facilitating new methods development and new applications. However, a severe limitation with the current use of GWAS summary data is its exclusive restriction to only linear single nucleotide polymorphism (SNP)-trait association analyses. To further expand the use of GWAS summary data, along with a large sample of individual-level genotypes, we propose a nonparametric method for large-scale imputation of the genetic component of the trait for the given genotypes. The imputed individual-level trait values, along with the individual-level genotypes, make it possible to conduct any analysis as with individual-level GWAS data, including nonlinear SNP-trait associations and predictions. We use the UK Biobank data to highlight the usefulness and effectiveness of the proposed method in three applications that currently cannot be done with only GWAS summary data (for SNP-trait associations): marginal SNP-trait association analysis under non-additive genetic models, detection of SNP-SNP interactions, and genetic prediction of a trait using a nonlinear model of SNPs.


Subject(s)
Genome-Wide Association Study , Polymorphism, Single Nucleotide , Humans , Genome-Wide Association Study/methods , Genotype , Phenotype , Polymorphism, Single Nucleotide/genetics
12.
Wellcome Open Res ; 8: 449, 2023.
Article in English | MEDLINE | ID: mdl-37915953

ABSTRACT

The MendelianRandomization package is a software package written for the R software environment that implements methods for Mendelian randomization based on summarized data. In this manuscript, we describe functions that have been added or edited in the package since version 0.5.0, when we last described the package and its contents. The main additions to the package since that time are: 1) new robust methods for performing Mendelian randomization, particularly in the cases of bias from weak instruments and/or winner's curse, and pleiotropic variants, 2) methods for performing Mendelian randomization with correlated variants using dimension reduction to summarize large numbers of highly correlated variants into a limited set of principal components, 3) functions for calculating first-stage F statistics, representing instrument strength, in both univariable and multivariable contexts, and with uncorrelated and correlated genetic variants. We also discuss some pragmatic issues relating to the use of correlated variants in Mendelian randomization.

13.
medRxiv ; 2023 Jul 28.
Article in English | MEDLINE | ID: mdl-37546893

ABSTRACT

BACKGROUND: Type 2 diabetes mellitus (T2D) confers a two- to three-fold increased risk of cardiovascular disease (CVD). However, the mechanisms underlying increased CVD risk among people with T2D are only partially understood. We hypothesized that a genetic association study among people with T2D at risk for developing incident cardiovascular complications could provide insights into molecular genetic aspects underlying CVD. METHODS: From 16 studies of the Cohorts for Heart & Aging Research in Genomic Epidemiology (CHARGE) Consortium, we conducted a multi-ancestry time-to-event genome-wide association study (GWAS) for incident CVD among people with T2D using Cox proportional hazards models. Incident CVD was defined based on a composite of coronary artery disease (CAD), stroke, and cardiovascular death that occurred at least one year after the diagnosis of T2D. Cohort-level estimated effect sizes were combined using inverse variance weighted fixed effects meta-analysis. We also tested 204 known CAD variants for association with incident CVD among patients with T2D. RESULTS: A total of 49,230 participants with T2D were included in the analyses (31,118 European ancestries and 18,112 non-European ancestries) which consisted of 8,956 incident CVD cases over a range of mean follow-up duration between 3.2 and 33.7 years (event rate 18.2%). We identified three novel, distinct genetic loci for incident CVD among individuals with T2D that reached the threshold for genome-wide significance (P<5.0×10-8): rs147138607 (intergenic variant between CACNA1E and ZNF648) with a hazard ratio (HR) 1.23, 95% confidence interval (CI) 1.15 - 1.32, P=3.6×10-9, rs11444867 (intergenic variant near HS3ST1) with HR 1.89, 95% CI 1.52 - 2.35, P=9.9×10-9, and rs335407 (intergenic variant between TFB1M and NOX3) HR 1.25, 95% CI 1.16 - 1.35, P=1.5×10-8. Among 204 known CAD loci, 32 were associated with incident CVD in people with T2D with P<0.05, and 5 were significant after Bonferroni correction (P<0.00024, 0.05/204). A polygenic score of these 204 variants was significantly associated with incident CVD with HR 1.14 (95% CI 1.12 - 1.16) per 1 standard deviation increase (P=1.0×10-16). CONCLUSIONS: The data point to novel and known genomic regions associated with incident CVD among individuals with T2D.

14.
Genetics ; 220(4)2022 04 04.
Article in English | MEDLINE | ID: mdl-35106569

ABSTRACT

Single nucleotide polymorphism heritability of a trait is measured as the proportion of total variance explained by the additive effects of genome-wide single nucleotide polymorphisms. Linear mixed models are routinely used to estimate single nucleotide polymorphism heritability for many complex traits, which requires estimation of a genetic relationship matrix among individuals. Heritability is usually estimated by the restricted maximum likelihood or method of moments approaches such as Haseman-Elston regression. The common practice of accounting for such population substructure is to adjust for the top few principal components of the genetic relationship matrix as covariates in the linear mixed model. This can get computationally very intensive on large biobank-scale datasets. Here, we propose a method of moments approach for estimating single nucleotide polymorphism heritability in presence of population substructure. Our proposed method is computationally scalable on biobank datasets and gives an asymptotically unbiased estimate of heritability in presence of discrete substructures. It introduces the adjustments for population stratification in a second-order estimating equation. It allows these substructures to vary in their single nucleotide polymorphism allele frequencies and in their trait distributions (means and variances) while the heritability is assumed to be the same across these substructures. Through extensive simulation studies and the application on 7 quantitative traits in the UK Biobank cohort, we demonstrate that our proposed method performs well in the presence of population substructure and much more computationally efficient than existing approaches.


Subject(s)
Models, Genetic , Polymorphism, Single Nucleotide , Quantitative Trait, Heritable , Databases, Factual , Genome , Genome-Wide Association Study/methods , Humans , Multifactorial Inheritance , Phenotype
15.
HGG Adv ; 3(4): 100144, 2022 Oct 13.
Article in English | MEDLINE | ID: mdl-36217425

ABSTRACT

Genome-wide association studies (GWASs) have successfully identified many genetic variants and risk loci for complex traits and common diseases in the last 15 years. However, these identified variants, in general, can explain only a small to moderate proportion of the heritability, thus the task of improving GWAS power for more discoveries remains both critical and challenging. In addition to the usual but costly or even infeasible route of continuing to increase the sample size, many approaches have been proposed to incorporate functional annotations to prioritize SNPs but with only limited success. Here, by taking advantage of increasing availability of various types of omics data, we propose a new and orthogonal approach by integrating individual-level omics data with GWASs. The premise is that since omics data reflect both genetic and environmental (such as diet and other lifestyle) effects on individuals, they can be used to account for (otherwise unexplained) variations among individuals in GWAS analysis, leading to more precise/efficient estimation and thus higher power. As a concrete example, we propose boosting GWAS power by adjusting for metabolomics data in GWAS analysis. We applied the method to the UK Biobank subcohort of n = 90,000 individuals with both GWAS and metabolomics data. The analysis of 7 quantitative traits and one binary trait demonstrated clear power gains. For example, the new method (after adjusting for metabolomics data) identified 13 new loci for diastolic blood pressure that were all missed by the standard GWAS, and most or all of the 13 new signals were validated in two much larger GWAS datasets (n = 340,000 and 700,000); the improved estimation efficiency was equivalent to a 38.4% gain of GWAS sample size. The proposed method is both simple and promising and broadly applicable to integrating GWASs with other omics data.

SELECTION OF CITATIONS
SEARCH DETAIL