ABSTRACT
Understanding the genetic basis of human diseases and traits is dependent on the identification and accurate genotyping of genetic variants. Deep whole-genome sequencing (WGS), the gold standard technology for SNP and indel identification and genotyping, remains very expensive for most large studies. Here, we quantify the extent to which array genotyping followed by genotype imputation can approximate WGS in studies of individuals of African, Hispanic/Latino, and European ancestry in the US and of Finnish ancestry in Finland (a population isolate). For each study, we performed genotype imputation by using the genetic variants present on the Illumina Core, OmniExpress, MEGA, and Omni 2.5M arrays with the 1000G, HRC, and TOPMed imputation reference panels. Using the Omni 2.5M array and the TOPMed panel, ≥90% of bi-allelic single-nucleotide variants (SNVs) are well imputed (r2 > 0.8) down to minor-allele frequencies (MAFs) of 0.14% in African, 0.11% in Hispanic/Latino, 0.35% in European, and 0.85% in Finnish ancestries. There was little difference in TOPMed-based imputation quality among the arrays with >700k variants. Individual-level imputation quality varied widely between and within the three US studies. Imputation quality also varied across genomic regions, producing regions where even common (MAF > 5%) variants were consistently not well imputed across ancestries. The extent to which array genotyping and imputation can approximate WGS therefore depends on reference panel, genotype array, sample ancestry, and genomic location. Imputation quality by variant or genomic region can be queried with our new tool, RsqBrowser, now deployed on the Michigan Imputation Server.
Subject(s)
High-Throughput Nucleotide Sequencing , Polymorphism, Single Nucleotide , Gene Frequency/genetics , Genome-Wide Association Study , Genotype , Humans , Polymorphism, Single Nucleotide/genetics , Whole Genome SequencingABSTRACT
Transcriptomics data have been integrated with genome-wide association studies (GWASs) to help understand disease/trait molecular mechanisms. The utility of metabolomics, integrated with transcriptomics and disease GWASs, to understand molecular mechanisms for metabolite levels or diseases has not been thoroughly evaluated. We performed probabilistic transcriptome-wide association and locus-level colocalization analyses to integrate transcriptomics results for 49 tissues in 706 individuals from the GTEx project, metabolomics results for 1,391 plasma metabolites in 6,136 Finnish men from the METSIM study, and GWAS results for 2,861 disease traits in 260,405 Finnish individuals from the FinnGen study. We found that genetic variants that regulate metabolite levels were more likely to influence gene expression and disease risk compared to the ones that do not. Integrating transcriptomics with metabolomics results prioritized 397 genes for 521 metabolites, including 496 previously identified gene-metabolite pairs with strong functional connections and suggested 33.3% of such gene-metabolite pairs shared the same causal variants with genetic associations of gene expression. Integrating transcriptomics and metabolomics individually with FinnGen GWAS results identified 1,597 genes for 790 disease traits. Integrating transcriptomics and metabolomics jointly with FinnGen GWAS results helped pinpoint metabolic pathways from genes to diseases. We identified putative causal effects of UGT1A1/UGT1A4 expression on gallbladder disorders through regulating plasma (E,E)-bilirubin levels, of SLC22A5 expression on nasal polyps and plasma carnitine levels through distinct pathways, and of LIPC expression on age-related macular degeneration through glycerophospholipid metabolic pathways. Our study highlights the power of integrating multiple sets of molecular traits and GWAS results to deepen understanding of disease pathophysiology.
Subject(s)
Genome-Wide Association Study , Transcriptome , Bilirubin , Carnitine , Glycerophospholipids , Humans , Male , Metabolomics , Quantitative Trait Loci/genetics , Solute Carrier Family 22 Member 5/genetics , Transcriptome/geneticsABSTRACT
Exome-sequencing studies have generally been underpowered to identify deleterious alleles with a large effect on complex traits as such alleles are mostly rare. Because the population of northern and eastern Finland has expanded considerably and in isolation following a series of bottlenecks, individuals of these populations have numerous deleterious alleles at a relatively high frequency. Here, using exome sequencing of nearly 20,000 individuals from these regions, we investigate the role of rare coding variants in clinically relevant quantitative cardiometabolic traits. Exome-wide association studies for 64 quantitative traits identified 26 newly associated deleterious alleles. Of these 26 alleles, 19 are either unique to or more than 20 times more frequent in Finnish individuals than in other Europeans and show geographical clustering comparable to Mendelian disease mutations that are characteristic of the Finnish population. We estimate that sequencing studies of populations without this unique history would require hundreds of thousands to millions of participants to achieve comparable association power.
Subject(s)
Exome Sequencing , Genetic Association Studies/methods , Genetic Predisposition to Disease/genetics , Genetic Variation/genetics , Quantitative Trait Loci/genetics , Alleles , Cholesterol, HDL/genetics , Cluster Analysis , Endpoint Determination , Finland , Geographic Mapping , Humans , Multifactorial Inheritance/genetics , Reproducibility of ResultsABSTRACT
An Amendment to this paper has been published and can be accessed via a link at the top of the paper.
ABSTRACT
Tests of association between a phenotype and a set of genes in a biological pathway can provide insights into the genetic architecture of complex phenotypes beyond those obtained from single-variant or single-gene association analysis. However, most existing gene set tests have limited power to detect gene set-phenotype association when a small fraction of the genes are associated with the phenotype and cannot identify the potentially "active" genes that might drive a gene set-based association. To address these issues, we have developed Gene set analysis Association Using Sparse Signals (GAUSS), a method for gene set association analysis that requires only GWAS summary statistics. For each significantly associated gene set, GAUSS identifies the subset of genes that have the maximal evidence of association and can best account for the gene set association. Using pre-computed correlation structure among test statistics from a reference panel, our p value calculation is substantially faster than other permutation- or simulation-based approaches. In simulations with varying proportions of causal genes, we find that GAUSS effectively controls type 1 error rate and has greater power than several existing methods, particularly when a small proportion of genes account for the gene set signal. Using GAUSS, we analyzed UK Biobank GWAS summary statistics for 10,679 gene sets and 1,403 binary phenotypes. We found that GAUSS is scalable and identified 13,466 phenotype and gene set association pairs. Within these gene sets, we identify an average of 17.2 (max = 405) genes that underlie these gene set associations.
Subject(s)
Biological Specimen Banks , Data Interpretation, Statistical , Databases, Genetic , Datasets as Topic , Genome-Wide Association Study/methods , Phenotype , ATP-Binding Cassette Transporters/genetics , Computer Simulation , Gene Expression/genetics , Humans , Research Design , Time Factors , United Kingdom , Web BrowserABSTRACT
Identifying the molecular mechanisms by which genome-wide association study (GWAS) loci influence traits remains challenging. Chromatin accessibility quantitative trait loci (caQTLs) help identify GWAS loci that may alter GWAS traits by modulating chromatin structure, but caQTLs have been identified in a limited set of human tissues. Here we mapped caQTLs in human liver tissue in 20 liver samples and identified 3,123 caQTLs. The caQTL variants are enriched in liver tissue promoter and enhancer states and frequently disrupt binding motifs of transcription factors expressed in liver. We predicted target genes for 861 caQTL peaks using proximity, chromatin interactions, correlation with promoter accessibility or gene expression, and colocalization with expression QTLs. Using GWAS signals for 19 liver function and/or cardiometabolic traits, we identified 110 colocalized caQTLs and GWAS signals, 56 of which contained a predicted caPeak target gene. At the LITAF LDL-cholesterol GWAS locus, we validated that a caQTL variant showed allelic differences in protein binding and transcriptional activity. These caQTLs contribute to the epigenomic characterization of human liver and help identify molecular mechanisms and genes at GWAS loci.
Subject(s)
Chromatin/metabolism , Liver/metabolism , Quantitative Trait Loci , Amino Acid Motifs , Binding Sites , Chromatin Assembly and Disassembly , Enhancer Elements, Genetic , Genetic Variation , Genome-Wide Association Study , Humans , Promoter Regions, Genetic , Protein Binding , Transcription Factors/chemistry , Transcription Factors/metabolism , TranscriptomeABSTRACT
Detecting and estimating DNA sample contamination are important steps to ensure high-quality genotype calls and reliable downstream analysis. Existing methods rely on population allele frequency information for accurate estimation of contamination rates. Correctly specifying population allele frequencies for each individual in early stage of sequence analysis is impractical or even impossible for large-scale sequencing centers that simultaneously process samples from multiple studies across diverse populations. On the other hand, incorrectly specified allele frequencies may result in substantial bias in estimated contamination rates. For example, we observed that existing methods often fail to identify 10% contaminated samples at a typical 3% contamination exclusion threshold when genetic ancestry is misspecified. Such an incomplete screening of contaminated samples substantially inflates the estimated rate of genotyping errors even in deeply sequenced genomes and exomes. We propose a robust statistical method that accurately estimates DNA contamination and is agnostic to genetic ancestry of the intended or contaminating sample. Our method integrates the estimation of genetic ancestry and DNA contamination in a unified likelihood framework by leveraging individual-specific allele frequencies projected from reference genotypes onto principal component coordinates. Our method can also be used for estimating genetic ancestries, similar to LASER or TRACE, but simultaneously accounting for potential contamination. We demonstrate that our method robustly estimates contamination rates and genetic ancestries across populations and contamination scenarios. We further demonstrate that, in the presence of contamination, genetic ancestry inference can be substantially biased with existing methods that ignore contamination, while our method corrects for such biases.
Subject(s)
DNA Contamination , DNA/genetics , Genotype , Genotyping Techniques/standards , Alleles , Exome/genetics , Gene Frequency/genetics , Genetics, Population , Humans , Polymorphism, Single Nucleotide/genetics , Sequence Analysis, DNAABSTRACT
Loci identified in genome-wide association studies (GWAS) can include multiple distinct association signals. We sought to identify the molecular basis of multiple association signals for adiponectin, a hormone involved in glucose regulation secreted almost exclusively from adipose tissue, identified in the Metabolic Syndrome in Men (METSIM) study. With GWAS data for 9,262 men, four loci were significantly associated with adiponectin: ADIPOQ, CDH13, IRS1, and PBRM1. We performed stepwise conditional analyses to identify distinct association signals, a subset of which are also nearly independent (lead variant pairwise r2<0.01). Two loci exhibited allelic heterogeneity, ADIPOQ and CDH13. Of seven association signals at the ADIPOQ locus, two signals colocalized with adipose tissue expression quantitative trait loci (eQTLs) for three transcripts: trait-increasing alleles at one signal were associated with increased ADIPOQ and LINC02043, while trait-increasing alleles at the other signal were associated with decreased ADIPOQ-AS1. In reporter assays, adiponectin-increasing alleles at two signals showed corresponding directions of effect on transcriptional activity. Putative mechanisms for the seven ADIPOQ signals include a missense variant (ADIPOQ G90S), a splice variant, a promoter variant, and four enhancer variants. Of two association signals at the CDH13 locus, the first signal consisted of promoter variants, including the lead adipose tissue eQTL variant for CDH13, while a second signal included a distal intron 1 enhancer variant that showed ~2-fold allelic differences in transcriptional reporter activity. Fine-mapping and experimental validation demonstrated that multiple, distinct association signals at these loci can influence multiple transcripts through multiple molecular mechanisms.
Subject(s)
Adiponectin/genetics , Adiponectin/metabolism , Adipose Tissue/metabolism , Alleles , Cadherins/genetics , Cadherins/metabolism , DNA-Binding Proteins/genetics , DNA-Binding Proteins/metabolism , Gene Frequency/genetics , Genetic Predisposition to Disease , Genome-Wide Association Study/methods , Humans , Insulin Receptor Substrate Proteins/genetics , Insulin Receptor Substrate Proteins/metabolism , Male , Metabolic Syndrome/genetics , Phenotype , Polymorphism, Single Nucleotide/genetics , Quantitative Trait Loci/genetics , Regulatory Sequences, Nucleic Acid , Transcription Factors/genetics , Transcription Factors/metabolismABSTRACT
Genome-wide association studies (GWASs) have identified thousands of genetic loci associated with cardiometabolic traits including type 2 diabetes (T2D), lipid levels, body fat distribution, and adiposity, although most causal genes remain unknown. We used subcutaneous adipose tissue RNA-seq data from 434 Finnish men from the METSIM study to identify 9,687 primary and 2,785 secondary cis-expression quantitative trait loci (eQTL; <1 Mb from TSS, FDR < 1%). Compared to primary eQTL signals, secondary eQTL signals were located further from transcription start sites, had smaller effect sizes, and were less enriched in adipose tissue regulatory elements compared to primary signals. Among 2,843 cardiometabolic GWAS signals, 262 colocalized by LD and conditional analysis with 318 transcripts as primary and conditionally distinct secondary cis-eQTLs, including some across ancestries. Of cardiometabolic traits examined for adipose tissue eQTL colocalizations, waist-hip ratio (WHR) and circulating lipid traits had the highest percentage of colocalized eQTLs (15% and 14%, respectively). Among alleles associated with increased cardiometabolic GWAS risk, approximately half (53%) were associated with decreased gene expression level. Mediation analyses of colocalized genes and cardiometabolic traits within the 434 individuals provided further evidence that gene expression influences variant-trait associations. These results identify hundreds of candidate genes that may act in adipose tissue to influence cardiometabolic traits.
Subject(s)
Adipose Tissue/metabolism , Diabetes Mellitus, Type 2/genetics , Gene Expression , Obesity/genetics , Alleles , Body Mass Index , Finland , Genome-Wide Association Study , Humans , Male , Quantitative Trait Loci , Waist-Hip RatioABSTRACT
BACKGROUND: COVID-19 severity varies widely. Although some demographic and cardio-metabolic factors, including age and obesity, are associated with increasing risk of severe illness, the underlying mechanism(s) are uncertain. SUBJECTS/METHODS: In a meta-analysis of three independent studies of 1471 participants in total, we investigated phenotypic and genetic factors associated with subcutaneous adipose tissue expression of Angiotensin I Converting Enzyme 2 (ACE2), measured by RNA-Seq, which acts as a receptor for SARS-CoV-2 cellular entry. RESULTS: Lower adipose tissue ACE2 expression was associated with multiple adverse cardio-metabolic health indices, including type 2 diabetes (T2D) (P = 9.14 × 10-6), obesity status (P = 4.81 × 10-5), higher serum fasting insulin (P = 5.32 × 10-4), BMI (P = 3.94 × 10-4), and lower serum HDL levels (P = 1.92 × 10-7). ACE2 expression was also associated with estimated proportions of cell types in adipose tissue: lower expression was associated with a lower proportion of microvascular endothelial cells (P = 4.25 × 10-4) and higher proportion of macrophages (P = 2.74 × 10-5). Despite an estimated heritability of 32%, we did not identify any proximal or distal expression quantitative trait loci (eQTLs) associated with adipose tissue ACE2 expression. CONCLUSIONS: Our results demonstrate that individuals with cardio-metabolic features known to increase risk of severe COVID-19 have lower background ACE2 levels in this highly relevant tissue. Reduced adipose tissue ACE2 expression may contribute to the pathophysiology of cardio-metabolic diseases, as well as the associated increased risk of severe COVID-19.
Subject(s)
Adipose Tissue , Angiotensin-Converting Enzyme 2 , COVID-19 , Adipose Tissue/metabolism , Angiotensin-Converting Enzyme 2/genetics , Angiotensin-Converting Enzyme 2/metabolism , COVID-19/complications , COVID-19/genetics , Cardiometabolic Risk Factors , Diabetes Mellitus, Type 2/genetics , Endothelial Cells/metabolism , Humans , Obesity , SARS-CoV-2ABSTRACT
We integrate comeasured gene expression and DNA methylation (DNAme) in 265 human skeletal muscle biopsies from the FUSION study with >7 million genetic variants and eight physiological traits: height, waist, weight, waist-hip ratio, body mass index, fasting serum insulin, fasting plasma glucose, and type 2 diabetes. We find hundreds of genes and DNAme sites associated with fasting insulin, waist, and body mass index, as well as thousands of DNAme sites associated with gene expression (eQTM). We find that controlling for heterogeneity in tissue/muscle fiber type reduces the number of physiological trait associations, and that long-range eQTMs (>1 Mb) are reduced when controlling for tissue/muscle fiber type or latent factors. We map genetic regulators (quantitative trait loci; QTLs) of expression (eQTLs) and DNAme (mQTLs). Using Mendelian randomization (MR) and mediation techniques, we leverage these genetic maps to predict 213 causal relationships between expression and DNAme, approximately two-thirds of which predict methylation to causally influence expression. We use MR to integrate FUSION mQTLs, FUSION eQTLs, and GTEx eQTLs for 48 tissues with genetic associations for 534 diseases and quantitative traits. We identify hundreds of genes and thousands of DNAme sites that may drive the reported disease/quantitative trait genetic associations. We identify 300 gene expression MR associations that are present in both FUSION and GTEx skeletal muscle and that show stronger evidence of MR association in skeletal muscle than other tissues, which may partially reflect differences in power across tissues. As one example, we find that increased RXRA muscle expression may decrease lean tissue mass.
Subject(s)
DNA Methylation/genetics , Gene Expression/genetics , Muscle, Skeletal , Blood Glucose/analysis , Body Weights and Measures , Diabetes Mellitus, Type 2 , Genome-Wide Association Study/methods , Genomics/methods , Humans , Insulin/analysis , Muscle, Skeletal/chemistry , Muscle, Skeletal/physiology , Quantitative Trait Loci/geneticsABSTRACT
Integration of genome-wide association study (GWAS) signals with expression quantitative trait loci (eQTL) studies enables identification of candidate genes. However, evaluating whether nearby signals may share causal variants, termed colocalization, is affected by the presence of allelic heterogeneity, different variants at the same locus impacting the same phenotype. We previously identified eQTL in subcutaneous adipose tissue from 770 participants in the Metabolic Syndrome in Men (METSIM) study and detected 15 eQTL signals that colocalized with GWAS signals for waist-hip ratio adjusted for body mass index (WHRadjBMI) from the Genetic Investigation of Anthropometric Traits consortium. Here, we reevaluated evidence of colocalization using two approaches, conditional analysis and the Bayesian test COLOC, and show that providing COLOC with approximate conditional summary statistics at multi-signal GWAS loci can reconcile disagreements in colocalization classification between the two tests. Next, we performed conditional analysis on the METSIM subcutaneous adipose tissue data to identify conditionally distinct or secondary eQTL signals. We used the two approaches to test for colocalization with WHRadjBMI GWAS signals and evaluated the differences in colocalization classification between the two tests. Through these analyses, we identified four GWAS signals colocalized with secondary eQTL signals for FAM13A, SSR3, GRB14 and FMO1. Thus, at loci with multiple eQTL and/or GWAS signals, analyzing each signal independently enabled additional candidate genes to be identified.
Subject(s)
Adipose Tissue/physiology , Body Fat Distribution , Genome-Wide Association Study/methods , Metabolic Syndrome/genetics , Quantitative Trait Loci , Adult , Bayes Theorem , Body Mass Index , Female , Genetic Predisposition to Disease , Humans , Linkage Disequilibrium , Male , Phenotype , Polymorphism, Single Nucleotide , Subcutaneous Fat/metabolism , Waist-Hip Ratio/methodsABSTRACT
A major challenge in evaluating the contribution of rare variants to complex disease is identifying enough copies of the rare alleles to permit informative statistical analysis. To investigate the contribution of rare variants to the risk of type 2 diabetes (T2D) and related traits, we performed deep whole-genome analysis of 1,034 members of 20 large Mexican-American families with high prevalence of T2D. If rare variants of large effect accounted for much of the diabetes risk in these families, our experiment was powered to detect association. Using gene expression data on 21,677 transcripts for 643 pedigree members, we identified evidence for large-effect rare-variant cis-expression quantitative trait loci that could not be detected in population studies, validating our approach. However, we did not identify any rare variants of large effect associated with T2D, or the related traits of fasting glucose and insulin, suggesting that large-effect rare variants account for only a modest fraction of the genetic risk of these traits in this sample of families. Reliable identification of large-effect rare variants will require larger samples of extended pedigrees or different study designs that further enrich for such variants.
Subject(s)
Diabetes Mellitus, Type 2/genetics , Genetic Predisposition to Disease/genetics , Genetic Variation , Mexican Americans/genetics , Diabetes Mellitus, Type 2/ethnology , Diabetes Mellitus, Type 2/pathology , Family Health , Female , Gene Frequency , Genetic Predisposition to Disease/ethnology , Genome-Wide Association Study/methods , Genotype , Humans , Male , Pedigree , Phenotype , Quantitative Trait Loci/genetics , Whole Genome Sequencing/methodsABSTRACT
Subcutaneous adipose tissue stores excess lipids and maintains energy balance. We performed expression quantitative trait locus (eQTL) analyses by using abdominal subcutaneous adipose tissue of 770 extensively phenotyped participants of the METSIM study. We identified cis-eQTLs for 12,400 genes at a 1% false-discovery rate. Among an approximately 680 known genome-wide association study (GWAS) loci for cardio-metabolic traits, we identified 140 coincident cis-eQTLs at 109 GWAS loci, including 93 eQTLs not previously described. At 49 of these 140 eQTLs, gene expression was nominally associated (p < 0.05) with levels of the GWAS trait. The size of our dataset enabled identification of five loci associated (p < 5 × 10-8) with at least five genes located >5 Mb away. These trans-eQTL signals confirmed and extended the previously reported KLF14-mediated network to 55 target genes, validated the CIITA regulation of class II MHC genes, and identified ZNF800 as a candidate master regulator. Finally, we observed similar expression-clinical trait correlations of genes associated with GWAS loci in both humans and a panel of genetically diverse mice. These results provide candidate genes for further investigation of their potential roles in adipose biology and in regulating cardio-metabolic traits.
Subject(s)
Cardiovascular Diseases/genetics , Gene Expression Regulation , Metabolic Syndrome/genetics , Quantitative Trait Loci , Subcutaneous Fat/metabolism , Aged , Animals , Databases, Genetic , Gene Expression Profiling , Genome-Wide Association Study , Genotyping Techniques , Humans , Male , Mice , Middle Aged , Nuclear Proteins/genetics , Nuclear Proteins/metabolism , Phenotype , Reproducibility of Results , Trans-Activators/genetics , Trans-Activators/metabolismABSTRACT
Genome-wide association studies (GWAS) have identified >100 independent SNPs that modulate the risk of type 2 diabetes (T2D) and related traits. However, the pathogenic mechanisms of most of these SNPs remain elusive. Here, we examined genomic, epigenomic, and transcriptomic profiles in human pancreatic islets to understand the links between genetic variation, chromatin landscape, and gene expression in the context of T2D. We first integrated genome and transcriptome variation across 112 islet samples to produce dense cis-expression quantitative trait loci (cis-eQTL) maps. Additional integration with chromatin-state maps for islets and other diverse tissue types revealed that cis-eQTLs for islet-specific genes are specifically and significantly enriched in islet stretch enhancers. High-resolution chromatin accessibility profiling using assay for transposase-accessible chromatin sequencing (ATAC-seq) in two islet samples enabled us to identify specific transcription factor (TF) footprints embedded in active regulatory elements, which are highly enriched for islet cis-eQTL. Aggregate allelic bias signatures in TF footprints enabled us de novo to reconstruct TF binding affinities genetically, which support the high-quality nature of the TF footprint predictions. Interestingly, we found that T2D GWAS loci were strikingly and specifically enriched in islet Regulatory Factor X (RFX) footprints. Remarkably, within and across independent loci, T2D risk alleles that overlap with RFX footprints uniformly disrupt the RFX motifs at high-information content positions. Together, these results suggest that common regulatory variations have shaped islet TF footprints and the transcriptome and that a confluent RFX regulatory grammar plays a significant role in the genetic component of T2D predisposition.
Subject(s)
Diabetes Mellitus, Type 2/genetics , Genetic Predisposition to Disease , Genome, Human , Islets of Langerhans/metabolism , Quantitative Trait Loci , Transcriptome , Alleles , Base Sequence , Binding Sites , Chromatin/chemistry , Chromatin/metabolism , Diabetes Mellitus, Type 2/metabolism , Diabetes Mellitus, Type 2/pathology , Epigenesis, Genetic , Gene Expression Profiling , Genetic Variation , Genome-Wide Association Study , Genomic Imprinting , Humans , Islets of Langerhans/pathology , Polymorphism, Single Nucleotide , Protein Binding , Protein Isoforms/genetics , Protein Isoforms/metabolism , Regulatory Factor X Transcription Factors/genetics , Regulatory Factor X Transcription Factors/metabolismABSTRACT
Gene set enrichment testing can enhance the biological interpretation of ChIP-seq data. Here, we develop a method, ChIP-Enrich, for this analysis which empirically adjusts for gene locus length (the length of the gene body and its surrounding non-coding sequence). Adjustment for gene locus length is necessary because it is often positively associated with the presence of one or more peaks and because many biologically defined gene sets have an excess of genes with longer or shorter gene locus lengths. Unlike alternative methods, ChIP-Enrich can account for the wide range of gene locus length-to-peak presence relationships (observed in ENCODE ChIP-seq data sets). We show that ChIP-Enrich has a well-calibrated type I error rate using permuted ENCODE ChIP-seq data sets; in contrast, two commonly used gene set enrichment methods, Fisher's exact test and the binomial test implemented in Genomic Regions Enrichment of Annotations Tool (GREAT), can have highly inflated type I error rates and biases in ranking. We identify DNA-binding proteins, including CTCF, JunD and glucocorticoid receptor α (GRα), that show different enrichment patterns for peaks closer to versus further from transcription start sites. We also identify known and potential new biological functions of GRα. ChIP-Enrich is available as a web interface (http://chip-enrich.med.umich.edu) and Bioconductor package.
Subject(s)
Chromatin Immunoprecipitation/methods , Genes , Genetic Loci , Sequence Analysis, DNA/methods , DNA-Binding Proteins/analysis , Logistic Models , Receptors, Glucocorticoid/analysisABSTRACT
MOTIVATION: Functional enrichment testing facilitates the interpretation of Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) data in terms of pathways and other biological contexts. Previous methods developed and used to test for key gene sets affected in ChIP-seq experiments treat peaks as points, and are based on the number of peaks associated with a gene or a binary score for each gene. These approaches work well for transcription factors, but histone modifications often occur over broad domains, and across multiple genes. RESULTS: To incorporate the unique properties of broad domains into functional enrichment testing, we developed Broad-Enrich, a method that uses the proportion of each gene's locus covered by a peak. We show that our method has a well-calibrated false-positive rate, performing well with ChIP-seq data having broad domains compared with alternative approaches. We illustrate Broad-Enrich with 55 ENCODE ChIP-seq datasets using different methods to define gene loci. Broad-Enrich can also be applied to other datasets consisting of broad genomic domains such as copy number variations. AVAILABILITY AND IMPLEMENTATION: http://broad-enrich.med.umich.edu for Web version and R package. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Chromatin Immunoprecipitation/methods , Genomics/methods , Histones/metabolism , Cell Line , Genetic Loci , High-Throughput Nucleotide Sequencing , Humans , Logistic Models , Sequence Analysis, DNA , Transcription Factors/metabolismABSTRACT
Common diseases such as type 2 diabetes are phenotypically heterogeneous. Obesity is a major risk factor for type 2 diabetes, but patients vary appreciably in body mass index. We hypothesized that the genetic predisposition to the disease may be different in lean (BMI<25 Kg/m²) compared to obese cases (BMI≥30 Kg/m²). We performed two case-control genome-wide studies using two accepted cut-offs for defining individuals as overweight or obese. We used 2,112 lean type 2 diabetes cases (BMI<25 kg/m²) or 4,123 obese cases (BMI≥30 kg/m²), and 54,412 un-stratified controls. Replication was performed in 2,881 lean cases or 8,702 obese cases, and 18,957 un-stratified controls. To assess the effects of known signals, we tested the individual and combined effects of SNPs representing 36 type 2 diabetes loci. After combining data from discovery and replication datasets, we identified two signals not previously reported in Europeans. A variant (rs8090011) in the LAMA1 gene was associated with type 2 diabetes in lean cases (Pâ=â8.4×10â»9, ORâ=â1.13 [95% CI 1.09-1.18]), and this association was stronger than that in obese cases (Pâ=â0.04, ORâ=â1.03 [95% CI 1.00-1.06]). A variant in HMG20A--previously identified in South Asians but not Europeans--was associated with type 2 diabetes in obese cases (Pâ=â1.3×10â»8, ORâ=â1.11 [95% CI 1.07-1.15]), although this association was not significantly stronger than that in lean cases (Pâ=â0.02, ORâ=â1.09 [95% CI 1.02-1.17]). For 36 known type 2 diabetes loci, 29 had a larger odds ratio in the lean compared to obese (binomial Pâ=â0.0002). In the lean analysis, we observed a weighted per-risk allele ORâ=â1.13 [95% CI 1.10-1.17], Pâ=â3.2×10⻹4. This was larger than the same model fitted in the obese analysis where the ORâ=â1.06 [95% CI 1.05-1.08], Pâ=â2.2×10⻹6. This study provides evidence that stratification of type 2 diabetes cases by BMI may help identify additional risk variants and that lean cases may have a stronger genetic predisposition to type 2 diabetes.
Subject(s)
Body Mass Index , Diabetes Mellitus, Type 2/genetics , High Mobility Group Proteins/genetics , Laminin/genetics , Obesity/genetics , Aged , Alleles , Asian People/genetics , Case-Control Studies , Diabetes Mellitus, Type 2/complications , Female , Genetic Predisposition to Disease , Genome-Wide Association Study , Humans , Male , Middle Aged , Obesity/complications , Polymorphism, Single Nucleotide , Risk Factors , White People/geneticsABSTRACT
Genome-wide association is a promising approach to identify common genetic variants that predispose to human disease. Because of the high cost of genotyping hundreds of thousands of markers on thousands of subjects, genome-wide association studies often follow a staged design in which a proportion (pi(samples)) of the available samples are genotyped on a large number of markers in stage 1, and a proportion (pi(samples)) of these markers are later followed up by genotyping them on the remaining samples in stage 2. The standard strategy for analyzing such two-stage data is to view stage 2 as a replication study and focus on findings that reach statistical significance when stage 2 data are considered alone. We demonstrate that the alternative strategy of jointly analyzing the data from both stages almost always results in increased power to detect genetic association, despite the need to use more stringent significance levels, even when effect sizes differ between the two stages. We recommend joint analysis for all two-stage genome-wide association studies, especially when a relatively large proportion of the samples are genotyped in stage 1 (pi(samples) >or= 0.30), and a relatively large proportion of markers are selected for follow-up in stage 2 (pi(markers) >or= 0.01).
Subject(s)
DNA Replication/genetics , Genetic Predisposition to Disease/genetics , Genetics, Medical/methods , Genome, Human/genetics , Alleles , Case-Control Studies , Gene Frequency/genetics , Genetic Heterogeneity , Genetic Markers/genetics , Genotype , HumansABSTRACT
In genome-wide association studies of binary traits, investigators typically use logistic regression to test common variants for disease association within studies, and combine association results across studies using meta-analysis. For common variants, logistic regression tests are well calibrated, and meta-analysis of study-specific association results is only slightly less powerful than joint analysis of the combined individual-level data. In recent sequencing and dense chip based association studies, investigators increasingly test low-frequency variants for disease association. In this paper, we seek to (1) identify the association test with maximal power among tests with well controlled type I error rate and (2) compare the relative power of joint and meta-analysis tests. We use analytic calculation and simulation to compare the empirical type I error rate and power of four logistic regression based tests: Wald, score, likelihood ratio, and Firth bias-corrected. We demonstrate for low-count variants (roughly minor allele count [MAC] < 400) that: (1) for joint analysis, the Firth test has the best combination of type I error and power; (2) for meta-analysis of balanced studies (equal numbers of cases and controls), the score test is best, but is less powerful than Firth test based joint analysis; and (3) for meta-analysis of sufficiently unbalanced studies, all four tests can be anti-conservative, particularly the score test. We also establish MAC as the key parameter determining test calibration for joint and meta-analysis.