Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 141
Filter
Add more filters

Publication year range
1.
Nature ; 606(7912): 120-128, 2022 06.
Article in English | MEDLINE | ID: mdl-35545678

ABSTRACT

Non-coding genetic variants may cause disease by modulating gene expression. However, identifying these expression quantitative trait loci (eQTLs) is complicated by differences in gene regulation across fluid functional cell states within cell types. These states-for example, neurotransmitter-driven programs in astrocytes or perivascular fibroblast differentiation-are obscured in eQTL studies that aggregate cells1,2. Here we modelled eQTLs at single-cell resolution in one complex cell type: memory T cells. Using more than 500,000 unstimulated memory T cells from 259 Peruvian individuals, we show that around one-third of 6,511 cis-eQTLs had effects that were mediated by continuous multimodally defined cell states, such as cytotoxicity and regulatory capacity. In some loci, independent eQTL variants had opposing cell-state relationships. Autoimmune variants were enriched in cell-state-dependent eQTLs, including risk variants for rheumatoid arthritis near ORMDL3 and CTLA4; this indicates that cell-state context is crucial to understanding potential eQTL pathogenicity. Moreover, continuous cell states explained more variation in eQTLs than did conventional discrete categories, such as CD4+ versus CD8+, suggesting that modelling eQTLs and cell states at single-cell resolution can expand insight into gene regulation in functionally heterogeneous cell types.


Subject(s)
Genetic Predisposition to Disease , Memory T Cells , Quantitative Trait Loci , Gene Expression Regulation , Genetic Predisposition to Disease/genetics , Humans , Memory T Cells/immunology , Memory T Cells/metabolism , Peru , Quantitative Trait Loci/genetics
2.
Nature ; 593(7858): 238-243, 2021 05.
Article in English | MEDLINE | ID: mdl-33828297

ABSTRACT

Genome-wide association studies (GWAS) have identified thousands of noncoding loci that are associated with human diseases and complex traits, each of which could reveal insights into the mechanisms of disease1. Many of the underlying causal variants may affect enhancers2,3, but we lack accurate maps of enhancers and their target genes to interpret such variants. We recently developed the activity-by-contact (ABC) model to predict which enhancers regulate which genes and validated the model using CRISPR perturbations in several cell types4. Here we apply this ABC model to create enhancer-gene maps in 131 human cell types and tissues, and use these maps to interpret the functions of GWAS variants. Across 72 diseases and complex traits, ABC links 5,036 GWAS signals to 2,249 unique genes, including a class of 577 genes that appear to influence multiple phenotypes through variants in enhancers that act in different cell types. In inflammatory bowel disease (IBD), causal variants are enriched in predicted enhancers by more than 20-fold in particular cell types such as dendritic cells, and ABC achieves higher precision than other regulatory methods at connecting noncoding variants to target genes. These variant-to-function maps reveal an enhancer that contains an IBD risk variant and that regulates the expression of PPIF to alter the membrane potential of mitochondria in macrophages. Our study reveals principles of genome regulation, identifies genes that affect IBD and provides a resource and generalizable strategy to connect risk variants of common diseases to their molecular and cellular functions.


Subject(s)
Enhancer Elements, Genetic/genetics , Genetic Predisposition to Disease , Genetic Variation/genetics , Genome, Human/genetics , Genome-Wide Association Study , Inflammatory Bowel Diseases/genetics , Cell Line , Chromosomes, Human, Pair 10/genetics , Cyclophilins/genetics , Dendritic Cells , Female , Humans , Macrophages/metabolism , Male , Mitochondria/metabolism , Organ Specificity/genetics , Phenotype
3.
Am J Hum Genet ; 109(3): 393-404, 2022 03 03.
Article in English | MEDLINE | ID: mdl-35108496

ABSTRACT

Identifying gene sets that are associated to disease can provide valuable biological knowledge, but a fundamental challenge of gene set analyses of GWAS data is linking disease-associated SNPs to genes. Transcriptome-wide association studies (TWASs) detect associations between the genetically predicted expression of a gene and disease risk, thus implicating candidate disease genes. However, causal disease genes at TWAS-associated loci generally remain unknown due to gene co-regulation, which leads to correlations across genes in predicted expression. We developed a method, gene co-regulation score (GCSC) regression, to identify gene sets that are enriched for disease heritability explained by predicted expression. GCSC regresses TWAS chi-square statistics on gene co-regulation scores reflecting correlations in predicted gene expression; a gene set is enriched for heritability if genes with high co-regulation to the set have higher TWAS chi-square statistics than genes with low co-regulation to the set, beyond what is expected based on co-regulation to all genes. We verified via simulations that GCSC is well calibrated and well powered. We applied GCSC to gene expression data from GTEx (48 tissues) and GWAS summary statistics for 43 independent diseases and complex traits analyzing a broad set of biological pathways and specifically expressed gene sets. We identified many enriched sets, recapitulating known biology. For Alzheimer disease, we detected evidence of an immune basis, and specifically a role for antigen presentation, in analyses of both biological pathways and specifically expressed gene sets. Our results highlight the advantages of leveraging gene co-regulation within the TWAS framework to identify enriched gene sets.


Subject(s)
Genome-Wide Association Study , Quantitative Trait Loci , Genetic Predisposition to Disease , Humans , Multifactorial Inheritance , Polymorphism, Single Nucleotide/genetics , Quantitative Trait Loci/genetics , Transcriptome
4.
Nature ; 559(7714): 350-355, 2018 07.
Article in English | MEDLINE | ID: mdl-29995854

ABSTRACT

The selective pressures that shape clonal evolution in healthy individuals are largely unknown. Here we investigate 8,342 mosaic chromosomal alterations, from 50 kb to 249 Mb long, that we uncovered in blood-derived DNA from 151,202 UK Biobank participants using phase-based computational techniques (estimated false discovery rate, 6-9%). We found six loci at which inherited variants associated strongly with the acquisition of deletions or loss of heterozygosity in cis. At three such loci (MPL, TM2D3-TARSL2, and FRA10B), we identified a likely causal variant that acted with high penetrance (5-50%). Inherited alleles at one locus appeared to affect the probability of somatic mutation, and at three other loci to be objects of positive or negative clonal selection. Several specific mosaic chromosomal alterations were strongly associated with future haematological malignancies. Our results reveal a multitude of paths towards clonal expansions with a wide range of effects on human health.


Subject(s)
Chromosome Aberrations , Clone Cells/cytology , Clone Cells/metabolism , Hematopoiesis/genetics , Mosaicism , Adult , Aged , Alleles , Biological Specimen Banks , Chromosome Breakage , Chromosome Fragile Sites/genetics , Chromosomes, Human, Pair 10/genetics , Female , Health , Hematologic Neoplasms/genetics , Hematologic Neoplasms/mortality , Humans , Male , Middle Aged , Penetrance , United Kingdom
5.
Hum Mol Genet ; 30(16): 1521-1534, 2021 07 28.
Article in English | MEDLINE | ID: mdl-33987664

ABSTRACT

It is important to study the genetics of complex traits in diverse populations. Here, we introduce covariate-adjusted linkage disequilibrium (LD) score regression (cov-LDSC), a method to estimate SNP-heritability (${\boldsymbol{h}}_{\boldsymbol{g}}^{\mathbf{2}})$ and its enrichment in homogenous and admixed populations with summary statistics and in-sample LD estimates. In-sample LD can be estimated from a subset of the genome-wide association studies samples, allowing our method to be applied efficiently to very large cohorts. In simulations, we show that unadjusted LDSC underestimates ${\boldsymbol{h}}_{\boldsymbol{g}}^{\mathbf{2}}$ by 10-60% in admixed populations; in contrast, cov-LDSC is robustly accurate. We apply cov-LDSC to genotyping data from 8124 individuals, mostly of admixed ancestry, from the Slim Initiative in Genomic Medicine for the Americas study, and to approximately 161 000 Latino-ancestry individuals, 47 000 African American-ancestry individuals and 135 000 European-ancestry individuals, as classified by 23andMe. We estimate ${\boldsymbol{h}}_{\boldsymbol{g}}^{\mathbf{2}}$ and detect heritability enrichment in three quantitative and five dichotomous phenotypes, making this, to our knowledge, the most comprehensive heritability-based analysis of admixed individuals to date. Most traits have high concordance of ${\boldsymbol{h}}_{\boldsymbol{g}}^{\mathbf{2}}$ and consistent tissue-specific heritability enrichment among different populations. However, for age at menarche, we observe population-specific heritability estimates of ${\boldsymbol{h}}_{\boldsymbol{g}}^{\mathbf{2}}$. We observe consistent patterns of tissue-specific heritability enrichment across populations; for example, in the limbic system for BMI, the per-standardized-annotation effect size $ \tau $* is 0.16 ± 0.04, 0.28 ± 0.11 and 0.18 ± 0.03 in the Latino-, African American- and European-ancestry populations, respectively. Our approach is a powerful way to analyze genetic data for complex traits from admixed populations.


Subject(s)
Genetics, Population , Genome-Wide Association Study/statistics & numerical data , Linkage Disequilibrium/genetics , Multifactorial Inheritance/genetics , Genotyping Techniques/statistics & numerical data , Humans , Phenotype , Polymorphism, Single Nucleotide/genetics , Quantitative Trait, Heritable
6.
Nat Rev Genet ; 18(2): 117-127, 2017 02.
Article in English | MEDLINE | ID: mdl-27840428

ABSTRACT

During the past decade, genome-wide association studies (GWAS) have been used to successfully identify tens of thousands of genetic variants associated with complex traits and diseases. These studies have produced extensive repositories of genetic variation and trait measurements across large numbers of individuals, providing tremendous opportunities for further analyses. However, privacy concerns and other logistical considerations often limit access to individual-level genetic data, motivating the development of methods that analyse summary association statistics. Here, we review recent progress on statistical methods that leverage summary association data to gain insights into the genetic basis of complex traits and diseases.


Subject(s)
Genetic Variation/genetics , Genome-Wide Association Study/statistics & numerical data , Models, Statistical , Quantitative Trait Loci , Quantitative Trait, Heritable , Computer Simulation , Genotype , Humans , Models, Genetic , Phenotype
7.
Hum Mol Genet ; 29(7): 1057-1067, 2020 05 08.
Article in English | MEDLINE | ID: mdl-31595288

ABSTRACT

Regulatory variation plays a major role in complex disease and that cell type-specific binding of transcription factors (TF) is critical to gene regulation. However, assessing the contribution of genetic variation in TF-binding sites to disease heritability is challenging, as binding is often cell type-specific and annotations from directly measured TF binding are not currently available for most cell type-TF pairs. We investigate approaches to annotate TF binding, including directly measured chromatin data and sequence-based predictions. We find that TF-binding annotations constructed by intersecting sequence-based TF-binding predictions with cell type-specific chromatin data explain a large fraction of heritability across a broad set of diseases and corresponding cell types; this strategy of constructing annotations addresses both the limitation that identical sequences may be bound or unbound depending on surrounding chromatin context and the limitation that sequence-based predictions are generally not cell type-specific. We partitioned the heritability of 49 diseases and complex traits using stratified linkage disequilibrium (LD) score regression with the baseline-LD model (which is not cell type-specific) plus the new annotations. We determined that 100 bp windows around MotifMap sequenced-based TF-binding predictions intersected with a union of six cell type-specific chromatin marks (imputed using ChromImpute) performed best, with an 58% increase in heritability enrichment compared to the chromatin marks alone (11.6× vs. 7.3×, P = 9 × 10-14 for difference) and a 20% increase in cell type-specific signal conditional on annotations from the baseline-LD model (P = 8 × 10-11 for difference). Our results show that TF-binding annotations explain substantial disease heritability and can help refine genome-wide association signals.


Subject(s)
Chromatin/genetics , Genetic Diseases, Inborn/genetics , Molecular Sequence Annotation , Transcription Factors/genetics , Binding Sites/genetics , Computational Biology , Gene Expression Regulation/genetics , Genetic Diseases, Inborn/classification , Genetic Diseases, Inborn/pathology , Humans , Linkage Disequilibrium/genetics , Multifactorial Inheritance/genetics , Polymorphism, Single Nucleotide/genetics , Protein Binding/genetics
8.
Am J Hum Genet ; 104(4): 611-624, 2019 04 04.
Article in English | MEDLINE | ID: mdl-30905396

ABSTRACT

Regulatory elements, e.g., enhancers and promoters, have been widely reported to be enriched for disease and complex trait heritability. We investigated how this enrichment varies with the age of the underlying genome sequence, the conservation of regulatory function across species, and the target gene of the regulatory element. We estimated heritability enrichment by applying stratified LD score regression to summary statistics from 41 independent diseases and complex traits (average N = 320K) and meta-analyzing results across traits. Enrichment of human putative enhancers and promoters was larger in elements with older sequence age, assessed via alignment with other species irrespective of conserved functionality: putative enhancer elements with ancient sequence age (older than the split between marsupial and placental mammals) were 8.8× enriched (versus 2.5× for all putative enhancers; p = 3e-14), and promoter elements with ancient sequence age were 13.5× enriched (versus 5.1× for all promoters; p = 5e-16). Enrichment of human putative enhancers and promoters was also larger in elements whose regulatory function was conserved across species, e.g., human putative enhancers that were enhancers in ≥5 of 9 other mammals were 4.6× enriched (p = 5e-12 versus all putative enhancers). Enrichment of human promoters was larger in promoters of loss-of-function intolerant genes: 12.0× enrichment (p = 8e-15 versus all promoters). The mean value of several measures of negative selection within these genomic annotations mirrored all of these findings. Notably, the annotations with these excess heritability enrichments were jointly significant conditional on each other and on our baseline-LD model, which includes a broad set of coding, conserved, regulatory, and LD-related annotations.


Subject(s)
Enhancer Elements, Genetic , Genetic Diseases, Inborn/genetics , Promoter Regions, Genetic , Animals , Conserved Sequence , Genome-Wide Association Study , Genomics , Humans , Linkage Disequilibrium , Mammals/genetics , Marsupialia/genetics , Phenotype , Polymorphism, Single Nucleotide , Species Specificity
9.
Am J Hum Genet ; 105(3): 456-476, 2019 09 05.
Article in English | MEDLINE | ID: mdl-31402091

ABSTRACT

Complex traits and common diseases are extremely polygenic, their heritability spread across thousands of loci. One possible explanation is that thousands of genes and loci have similarly important biological effects when mutated. However, we hypothesize that for most complex traits, relatively few genes and loci are critical, and negative selection-purging large-effect mutations in these regions-leaves behind common-variant associations in thousands of less critical regions instead. We refer to this phenomenon as flattening. To quantify its effects, we introduce a mathematical definition of polygenicity, the effective number of independently associated SNPs (Me), which describes how evenly the heritability of a trait is spread across the genome. We developed a method, stratified LD fourth moments regression (S-LD4M), to estimate Me, validating that it produces robust estimates in simulations. Analyzing 33 complex traits (average N = 361k), we determined that heritability is spread ∼4× more evenly among common SNPs than among low-frequency SNPs. This difference, together with evolutionary modeling of new mutations, suggests that complex traits would be orders of magnitude less polygenic if not for the influence of negative selection. We also determined that heritability is spread more evenly within functionally important regions in proportion to their heritability enrichment; functionally important regions do not harbor common SNPs with greatly increased causal effect sizes, due to selective constraint. Our results suggest that for most complex traits, the genes and loci with the most critical biological effects often differ from those with the strongest common-variant associations.


Subject(s)
Multifactorial Inheritance , Selection, Genetic , Humans , Linkage Disequilibrium , Polymorphism, Single Nucleotide
10.
Am J Hum Genet ; 104(1): 65-75, 2019 01 03.
Article in English | MEDLINE | ID: mdl-30595370

ABSTRACT

Functional genomics data has the potential to increase GWAS power by identifying SNPs that have a higher prior probability of association. Here, we introduce a method that leverages polygenic functional enrichment to incorporate coding, conserved, regulatory, and LD-related genomic annotations into association analyses. We show via simulations with real genotypes that the method, functionally informed novel discovery of risk loci (FINDOR), correctly controls the false-positive rate at null loci and attains a 9%-38% increase in the number of independent associations detected at causal loci, depending on trait polygenicity and sample size. We applied FINDOR to 27 independent complex traits and diseases from the interim UK Biobank release (average N = 130K). Averaged across traits, we attained a 13% increase in genome-wide significant loci detected (including a 20% increase for disease traits) compared to unweighted raw p values that do not use functional data. We replicated the additional loci in independent UK Biobank and non-UK Biobank data, yielding a highly statistically significant replication slope (0.66-0.69) in each case. Finally, we applied FINDOR to the full UK Biobank release (average N = 416K), attaining smaller relative improvements (consistent with simulations) but larger absolute improvements, detecting an additional 583 GWAS loci. In conclusion, leveraging functional enrichment using our method robustly increases GWAS power.


Subject(s)
Genome-Wide Association Study , Multifactorial Inheritance/genetics , Polymorphism, Single Nucleotide/genetics , Calibration , Databases, Genetic , Datasets as Topic , False Positive Reactions , Humans , Probability , Reproducibility of Results , United Kingdom
11.
Am J Hum Genet ; 104(5): 879-895, 2019 05 02.
Article in English | MEDLINE | ID: mdl-31006511

ABSTRACT

Despite significant progress in annotating the genome with experimental methods, much of the regulatory noncoding genome remains poorly defined. Here we assert that regulatory elements may be characterized by leveraging local epigenomic signatures where specific transcription factors (TFs) are bound. To link these two features, we introduce IMPACT, a genome annotation strategy that identifies regulatory elements defined by cell-state-specific TF binding profiles, learned from 515 chromatin and sequence annotations. We validate IMPACT using multiple compelling applications. First, IMPACT distinguishes between bound and unbound TF motif sites with high accuracy (average AUPRC 0.81, SE 0.07; across 8 tested TFs) and outperforms state-of-the-art TF binding prediction methods, MocapG, MocapS, and Virtual ChIP-seq. Second, in eight tested cell types, RNA polymerase II IMPACT annotations capture more cis-eQTL variation than sequence-based annotations, such as promoters and TSS windows (25% average increase in enrichment). Third, integration with rheumatoid arthritis (RA) summary statistics from European (N = 38,242) and East Asian (N = 22,515) populations revealed that the top 5% of CD4+ Treg IMPACT regulatory elements capture 85.7% of RA h2, the most comprehensive explanation for RA h2 to date. In comparison, the average RA h2 captured by compared CD4+ T histone marks is 42.3% and by CD4+ T specifically expressed gene sets is 36.4%. Lastly, we find that IMPACT may be used in many different cell types to identify complex trait associated regulatory elements.


Subject(s)
Arthritis, Rheumatoid/metabolism , Epigenome , Epigenomics/methods , Genome, Human , Molecular Sequence Annotation , Regulatory Sequences, Nucleic Acid , Transcription Factors/metabolism , Arthritis, Rheumatoid/genetics , Chromatin/genetics , Chromatin/metabolism , Computational Biology/methods , Histones/genetics , Histones/metabolism , Humans , Promoter Regions, Genetic , Protein Binding , Transcription Factors/genetics
12.
Am J Hum Genet ; 104(5): 896-913, 2019 05 02.
Article in English | MEDLINE | ID: mdl-31051114

ABSTRACT

Recent studies have highlighted the role of gene networks in disease biology. To formally assess this, we constructed a broad set of pathway, network, and pathway+network annotations and applied stratified LD score regression to 42 diseases and complex traits (average N = 323K) to identify enriched annotations. First, we analyzed 18,119 biological pathways. We identified 156 pathway-trait pairs whose disease enrichment was statistically significant (FDR < 5%) after conditioning on all genes and 75 known functional annotations (from the baseline-LD model), a stringent step that greatly reduced the number of pathways detected; most significant pathway-trait pairs were previously unreported. Next, for each of four published gene networks, we constructed probabilistic annotations based on network connectivity. For each gene network, the network connectivity annotation was strongly significantly enriched. Surprisingly, the enrichments were fully explained by excess overlap between network annotations and regulatory annotations from the baseline-LD model, validating the informativeness of the baseline-LD model and emphasizing the importance of accounting for regulatory annotations in gene network analyses. Finally, for each of the 156 enriched pathway-trait pairs, for each of the four gene networks, we constructed pathway+network annotations by annotating genes with high network connectivity to the input pathway. For each gene network, these pathway+network annotations were strongly significantly enriched for the corresponding traits. Once again, the enrichments were largely explained by the baseline-LD model. In conclusion, gene network connectivity is highly informative for disease architectures, but the information in gene networks may be subsumed by regulatory annotations, emphasizing the importance of accounting for known annotations.


Subject(s)
Computational Biology/methods , Gene Regulatory Networks , Genes/genetics , Genetic Diseases, Inborn/genetics , Multifactorial Inheritance/genetics , Polymorphism, Single Nucleotide , Quantitative Trait, Heritable , Humans , Molecular Sequence Annotation , Phenotype , Software
13.
Genet Epidemiol ; 43(2): 180-188, 2019 Mar.
Article in English | MEDLINE | ID: mdl-30474154

ABSTRACT

Recent studies have examined the genetic correlations of single-nucleotide polymorphism (SNP) effect sizes across pairs of populations to better understand the genetic architectures of complex traits. These studies have estimated ρ g , the cross-population correlation of joint-fit effect sizes at genotyped SNPs. However, the value of ρ g depends both on the cross-population correlation of true causal effect sizes ( ρ b ) and on the similarity in linkage disequilibrium (LD) patterns in the two populations, which drive tagging effects. Here, we derive the value of the ratio ρ g / ρ b as a function of LD in each population. By applying existing methods to obtain estimates of ρ g , we can use this ratio to estimate ρ b . Our estimates of ρ b were equal to 0.55 ( SE = 0.14) between Europeans and East Asians averaged across nine traits in the Genetic Epidemiology Research on Adult Health and Aging data set, 0.54 ( SE = 0.18) between Europeans and South Asians averaged across 13 traits in the UK Biobank data set, and 0.48 ( SE = 0.06) and 0.65 ( SE = 0.09) between Europeans and East Asians in summary statistic data sets for type 2 diabetes and rheumatoid arthritis, respectively. These results implicate substantially different causal genetic architectures across continental populations.


Subject(s)
Genetics, Population , Adult , Aging/genetics , Arthritis, Rheumatoid/genetics , Biological Specimen Banks , Databases, Genetic , Diabetes Mellitus, Type 2/genetics , Genotype , Humans , Phenotype , Quantitative Trait, Heritable , United Kingdom
14.
Am J Hum Genet ; 100(1): 31-39, 2017 Jan 05.
Article in English | MEDLINE | ID: mdl-28017371

ABSTRACT

Mixed models have become the tool of choice for genetic association studies; however, standard mixed model methods may be poorly calibrated or underpowered under family sampling bias and/or case-control ascertainment. Previously, we introduced a liability threshold-based mixed model association statistic (LTMLM) to address case-control ascertainment in unrelated samples. Here, we consider family-biased case-control ascertainment, where case and control subjects are ascertained non-randomly with respect to family relatedness. Previous work has shown that this type of ascertainment can severely bias heritability estimates; we show here that it also impacts mixed model association statistics. We introduce a family-based association statistic (LT-Fam) that is robust to this problem. Similar to LTMLM, LT-Fam is computed from posterior mean liabilities (PML) under a liability threshold model; however, LT-Fam uses published narrow-sense heritability estimates to avoid the problem of biased heritability estimation, enabling correct calibration. In simulations with family-biased case-control ascertainment, LT-Fam was correctly calibrated (average χ2 = 1.00-1.02 for null SNPs), whereas the Armitage trend test (ATT), standard mixed model association (MLM), and case-control retrospective association test (CARAT) were mis-calibrated (e.g., average χ2 = 0.50-1.22 for MLM, 0.89-2.65 for CARAT). LT-Fam also attained higher power than other methods in some settings. In 1,259 type 2 diabetes-affected case subjects and 5,765 control subjects from the CARe cohort, downsampled to induce family-biased ascertainment, LT-Fam was correctly calibrated whereas ATT, MLM, and CARAT were again mis-calibrated. Our results highlight the importance of modeling family sampling bias in case-control datasets with related samples.


Subject(s)
Family , Genetic Association Studies/methods , Models, Genetic , Bias , Calibration , Diabetes Mellitus, Type 2/genetics , Genotype , Humans , Phenotype , Polymorphism, Single Nucleotide/genetics , Retrospective Studies
15.
Am J Hum Genet ; 100(4): 605-616, 2017 Apr 06.
Article in English | MEDLINE | ID: mdl-28343628

ABSTRACT

Genetic variants that modulate gene expression levels play an important role in the etiology of human diseases and complex traits. Although large-scale eQTL mapping studies routinely identify many local eQTLs, the molecular mechanisms by which genetic variants regulate expression remain unclear, particularly for distal eQTLs, which these studies are not well powered to detect. Here, we leveraged all variants (not just those that pass stringent significance thresholds) to analyze the functional architecture of local and distal regulation of gene expression in 15 human tissues by employing an extension of stratified LD-score regression that produces robust results in simulations. The top enriched functional categories in local regulation of peripheral-blood gene expression included coding regions (11.41×), conserved regions (4.67×), and four histone marks (p < 5 × 10-5 for all enrichments); local enrichments were similar across the 15 tissues. We also observed substantial enrichments for distal regulation of peripheral-blood gene expression: coding regions (4.47×), conserved regions (4.51×), and two histone marks (p < 3 × 10-7 for all enrichments). Analyses of the genetic correlation of gene expression across tissues confirmed that local regulation of gene expression is largely shared across tissues but that distal regulation is highly tissue specific. Our results elucidate the functional components of the genetic architecture of local and distal regulation of gene expression.


Subject(s)
Gene Expression Regulation , Anxiety/genetics , Computer Simulation , Depression/genetics , Humans , Linkage Disequilibrium , Organ Specificity , Quantitative Trait Loci , Regression Analysis , Twins/genetics
16.
Am J Hum Genet ; 99(1): 76-88, 2016 07 07.
Article in English | MEDLINE | ID: mdl-27321947

ABSTRACT

The increasing number of genetic association studies conducted in multiple populations provides an unprecedented opportunity to study how the genetic architecture of complex phenotypes varies between populations, a problem important for both medical and population genetics. Here, we have developed a method for estimating the transethnic genetic correlation: the correlation of causal-variant effect sizes at SNPs common in populations. This methods takes advantage of the entire spectrum of SNP associations and uses only summary-level data from genome-wide association studies. This avoids the computational costs and privacy concerns associated with genotype-level information while remaining scalable to hundreds of thousands of individuals and millions of SNPs. We applied our method to data on gene expression, rheumatoid arthritis, and type 2 diabetes and overwhelmingly found that the genetic correlation was significantly less than 1. Our method is implemented in a Python package called Popcorn.


Subject(s)
Arthritis, Rheumatoid/genetics , Diabetes Mellitus, Type 2/genetics , Ethnicity/genetics , Genome-Wide Association Study/methods , Software , Body Height , Body Mass Index , Genotype , Humans , Likelihood Functions , Models, Genetic , Phenotype , Polymorphism, Single Nucleotide/genetics , Sample Size
17.
Am J Hum Genet ; 99(5): 1130-1139, 2016 Nov 03.
Article in English | MEDLINE | ID: mdl-27773431

ABSTRACT

Analyzing genetic differences between closely related populations can be a powerful way to detect recent adaptation. The very large sample size of the UK Biobank is ideal for using population differentiation to detect selection and enables an analysis of the UK population structure at fine resolution. In this study, analyses of 113,851 UK Biobank samples showed that population structure in the UK is dominated by five principal components (PCs) spanning six clusters: Northern Ireland, Scotland, northern England, southern England, and two Welsh clusters. Analyses of ancient Eurasians revealed that populations in the northern UK have higher levels of Steppe ancestry and that UK population structure cannot be explained as a simple mixture of Celts and Saxons. A scan for unusual population differentiation along the top PCs identified a genome-wide-significant signal of selection at the coding variant rs601338 in FUT2 (p = 9.16 × 10-9). In addition, by combining evidence of unusual differentiation within the UK with evidence from ancient Eurasians, we identified genome-wide-significant (p = 5 × 10-8) signals of recent selection at two additional loci: CYP1A2-CSK and F12. We detected strong associations between diastolic blood pressure in the UK Biobank and both the variants with selection signals at CYP1A2-CSK (p = 1.10 × 10-19) and the variants with ancient Eurasian selection signals at the ATXN2-SH2B3 locus (p = 8.00 × 10-33), implicating recent adaptation related to blood pressure.


Subject(s)
Biological Specimen Banks/organization & administration , Blood Pressure/genetics , Adaptation, Physiological/genetics , Genetic Loci , Genetics, Population , Genome, Human , Humans , Multigene Family , Phylogeography , Selection, Genetic , United Kingdom , White People/genetics
18.
Am J Hum Genet ; 98(3): 456-472, 2016 Mar 03.
Article in English | MEDLINE | ID: mdl-26924531

ABSTRACT

Searching for genetic variants with unusual differentiation between subpopulations is an established approach for identifying signals of natural selection. However, existing methods generally require discrete subpopulations. We introduce a method that infers selection using principal components (PCs) by identifying variants whose differentiation along top PCs is significantly greater than the null distribution of genetic drift. To enable the application of this method to large datasets, we developed the FastPCA software, which employs recent advances in random matrix theory to accurately approximate top PCs while reducing time and memory cost from quadratic to linear in the number of individuals, a computational improvement of many orders of magnitude. We apply FastPCA to a cohort of 54,734 European Americans, identifying 5 distinct subpopulations spanning the top 4 PCs. Using the PC-based test for natural selection, we replicate previously known selected loci and identify three new genome-wide significant signals of selection, including selection in Europeans at ADH1B. The coding variant rs1229984(∗)T has previously been associated to a decreased risk of alcoholism and shown to be under selection in East Asians; we show that it is a rare example of independent evolution on two continents. We also detect selection signals at IGFBP3 and IGH, which have also previously been associated to human disease.


Subject(s)
Alcohol Dehydrogenase/genetics , Asian People/genetics , Evolution, Molecular , Principal Component Analysis , White People/genetics , Computational Biology , Databases, Genetic , Europe , Asia, Eastern , Genetic Loci , Genetics, Population , Genome-Wide Association Study , Humans , Insulin-Like Growth Factor Binding Protein 3/genetics , Models, Genetic , Phylogeny , Polymorphism, Single Nucleotide , Selection, Genetic
19.
Nat Rev Genet ; 14(7): 507-15, 2013 07.
Article in English | MEDLINE | ID: mdl-23774735

ABSTRACT

The success of genome-wide association studies (GWASs) has led to increasing interest in making predictions of complex trait phenotypes, including disease, from genotype data. Rigorous assessment of the value of predictors is crucial before implementation. Here we discuss some of the limitations and pitfalls of prediction analysis and show how naive implementations can lead to severe bias and misinterpretation of results.


Subject(s)
Genome-Wide Association Study , Phenotype , Polymorphism, Single Nucleotide , Genetic Markers/genetics , Genetic Variation , Genomics , Genotype , Humans , Models, Genetic , Models, Statistical , Reproducibility of Results , Risk
20.
Genet Epidemiol ; 41(8): 811-823, 2017 12.
Article in English | MEDLINE | ID: mdl-29110330

ABSTRACT

Methods for genetic risk prediction have been widely investigated in recent years. However, most available training data involves European samples, and it is currently unclear how to accurately predict disease risk in other populations. Previous studies have used either training data from European samples in large sample size or training data from the target population in small sample size, but not both. Here, we introduce a multiethnic polygenic risk score that combines training data from European samples and training data from the target population. We applied this approach to predict type 2 diabetes (T2D) in a Latino cohort using both publicly available European summary statistics in large sample size (Neff  = 40k) and Latino training data in small sample size (Neff  = 8k). Here, we attained a >70% relative improvement in prediction accuracy (from R2  = 0.027 to 0.047) compared to methods that use only one source of training data, consistent with large relative improvements in simulations. We observed a systematically lower load of T2D risk alleles in Latino individuals with more European ancestry, which could be explained by polygenic selection in ancestral European and/or Native American populations. We predict T2D in a South Asian UK Biobank cohort using European (Neff  = 40k) and South Asian (Neff  = 16k) training data and attained a >70% relative improvement in prediction accuracy, and application to predict height in an African UK Biobank cohort using European (N = 113k) and African (N = 2k) training data attained a 30% relative improvement. Our work reduces the gap in polygenic risk prediction accuracy between European and non-European target populations.


Subject(s)
Diabetes Mellitus, Type 2/genetics , Models, Genetic , Alleles , Cohort Studies , Diabetes Mellitus, Type 2/pathology , Ethnicity/genetics , Genome-Wide Association Study , Genotype , Hispanic or Latino/genetics , Humans , Multifactorial Inheritance , Phenotype , Polymorphism, Single Nucleotide , Risk Factors
SELECTION OF CITATIONS
SEARCH DETAIL