RESUMEN
Both common and rare genetic variants influence complex traits and common diseases. Genome-wide association studies have identified thousands of common-variant associations, and more recently, large-scale exome sequencing studies have identified rare-variant associations in hundreds of genes1-3. However, rare-variant genetic architecture is not well characterized, and the relationship between common-variant and rare-variant architecture is unclear4. Here we quantify the heritability explained by the gene-wise burden of rare coding variants across 22 common traits and diseases in 394,783 UK Biobank exomes5. Rare coding variants (allele frequency < 1 × 10-3) explain 1.3% (s.e. = 0.03%) of phenotypic variance on average-much less than common variants-and most burden heritability is explained by ultrarare loss-of-function variants (allele frequency < 1 × 10-5). Common and rare variants implicate the same cell types, with similar enrichments, and they have pleiotropic effects on the same pairs of traits, with similar genetic correlations. They partially colocalize at individual genes and loci, but not to the same extent: burden heritability is strongly concentrated in significant genes, while common-variant heritability is more polygenic, and burden heritability is also more strongly concentrated in constrained genes. Finally, we find that burden heritability for schizophrenia and bipolar disorder6,7 is approximately 2%. Our results indicate that rare coding variants will implicate a tractable number of large-effect genes, that common and rare associations are mechanistically convergent, and that rare coding variants will contribute only modestly to missing heritability and population risk stratification.
Asunto(s)
Exoma , Frecuencia de los Genes , Variación Genética , Herencia Multifactorial , Humanos , Exoma/genética , Variación Genética/genética , Estudio de Asociación del Genoma Completo , Herencia Multifactorial/genética , Factores de Riesgo , Reino Unido , Sitios Genéticos/genética , Esquizofrenia/genética , Trastorno Bipolar/genéticaRESUMEN
Unknown SNP-to-gene regulatory architecture complicates efforts to link noncoding GWAS associations with genes implicated by sequencing or functional studies. eQTLs are often used to link SNPs to genes, but expression in bulk tissue explains a small fraction of disease heritability. A simple but successful approach has been to link SNPs with nearby genes via base pair windows, but genes may often be regulated by SNPs outside their window. We propose the abstract mediation model (AMM) to estimate (1) the fraction of heritability mediated by the closest or kth-closest gene to each SNP and (2) the mediated heritability enrichment of a gene set (e.g., genes with rare-variant associations). AMM jointly estimates these quantities by matching the decay in SNP enrichment with distance from genes in the gene set. Across 47 complex traits and diseases, we estimate that the closest gene to each SNP mediates 27% (SE: 6%) of heritability and that a substantial fraction is mediated by genes outside the ten closest. Mendelian disease genes are strongly enriched for common-variant heritability; for example, just 21 dyslipidemia genes mediate 25% of LDL heritability (211× enrichment, p = 0.01). Among brain-related traits, genes involved in neurodevelopmental disorders are only about 4× enriched, but gene expression patterns are highly informative, as they have detectable differences in per-gene heritability even among weakly brain-expressed genes.
Asunto(s)
Estudio de Asociación del Genoma Completo , Polimorfismo de Nucleótido Simple , Regulación de la Expresión Génica/genética , Humanos , Fenotipo , Polimorfismo de Nucleótido Simple/genéticaRESUMEN
MOTIVATION: Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking. RESULTS: We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures. AVAILABILITY AND IMPLEMENTATION: A synthetic dataset of 1 008 000 individuals and nine traits for 6.8 million common variants is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The HAPNEST software for generating synthetic datasets is available as Docker/Singularity containers and open source Julia and C code at https://github.com/intervene-EU-H2020/synthetic_data.
Asunto(s)
Benchmarking , Exactitud de los Datos , Humanos , Genotipo , Fenotipo , Herencia MultifactorialRESUMEN
Complex traits and common diseases are extremely polygenic, their heritability spread across thousands of loci. One possible explanation is that thousands of genes and loci have similarly important biological effects when mutated. However, we hypothesize that for most complex traits, relatively few genes and loci are critical, and negative selection-purging large-effect mutations in these regions-leaves behind common-variant associations in thousands of less critical regions instead. We refer to this phenomenon as flattening. To quantify its effects, we introduce a mathematical definition of polygenicity, the effective number of independently associated SNPs (Me), which describes how evenly the heritability of a trait is spread across the genome. We developed a method, stratified LD fourth moments regression (S-LD4M), to estimate Me, validating that it produces robust estimates in simulations. Analyzing 33 complex traits (average N = 361k), we determined that heritability is spread â¼4× more evenly among common SNPs than among low-frequency SNPs. This difference, together with evolutionary modeling of new mutations, suggests that complex traits would be orders of magnitude less polygenic if not for the influence of negative selection. We also determined that heritability is spread more evenly within functionally important regions in proportion to their heritability enrichment; functionally important regions do not harbor common SNPs with greatly increased causal effect sizes, due to selective constraint. Our results suggest that for most complex traits, the genes and loci with the most critical biological effects often differ from those with the strongest common-variant associations.
Asunto(s)
Herencia Multifactorial , Selección Genética , Humanos , Desequilibrio de Ligamiento , Polimorfismo de Nucleótido SimpleRESUMEN
BACKGROUND: Clinical and epidemiologic studies have shown that obesity is associated with asthma and that these associations differ by asthma subtype. Little is known about the shared genetic components between obesity and asthma. OBJECTIVE: We sought to identify shared genetic associations between obesity-related traits and asthma subtypes in adults. METHODS: A cross-trait genome-wide association study (GWAS) was performed using 457,822 subjects of European ancestry from the UK Biobank. Experimental evidence to support the role of genes significantly associated with both obesity-related traits and asthma through a GWAS was sought by using results from obese versus lean mouse RNA sequencing and RT-PCR experiments. RESULTS: We found a substantial positive genetic correlation between body mass index and later-onset asthma defined by asthma age of onset at 16 years or greater (Rg = 0.25, P = 9.56 × 10-22). Mendelian randomization analysis provided strong evidence in support of body mass index causally increasing asthma risk. Cross-trait meta-analysis identified 34 shared loci among 3 obesity-related traits and 2 asthma subtypes. GWAS functional analyses identified potential causal relationships between the shared loci and Genotype-Tissue Expression (GTEx) quantitative trait loci and shared immune- and cell differentiation-related pathways between obesity and asthma. Finally, RNA sequencing data from lungs of obese versus control mice found that 2 genes (acyl-coenzyme A oxidase-like [ACOXL] and myosin light chain 6 [MYL6]) from the cross-trait meta-analysis were differentially expressed, and these findings were validated by using RT-PCR in an independent set of mice. CONCLUSIONS: Our work identified shared genetic components between obesity-related traits and specific asthma subtypes, reinforcing the hypothesis that obesity causally increases the risk of asthma and identifying molecular pathways that might underlie both obesity and asthma.
Asunto(s)
Asma/genética , Predisposición Genética a la Enfermedad/genética , Obesidad/genética , Adulto , Animales , Bancos de Muestras Biológicas , Índice de Masa Corporal , Femenino , Estudio de Asociación del Genoma Completo , Humanos , Masculino , Ratones , Reino UnidoRESUMEN
Single cell CRISPR screens such as Perturb-seq enable transcriptomic profiling of genetic perturbations at scale. However, the data produced by these screens are often noisy due to cost and technical constraints, limiting power to detect true effects with conventional differential expression analyses. Here, we introduce TRanscriptome-wide Analysis of Differential Expression (TRADE), a statistical framework which estimates the transcriptome-wide distribution of true differential expression effects from noisy gene-level measurements. Within TRADE, we derive multiple novel, interpretable statistical metrics, including the "transcriptome-wide impact", an estimator of the overall transcriptional effect of a perturbation which is stable across sampling depths. We analyze new and published large-scale Perturb-seq datasets to show that many true transcriptional effects are not statistically significant, but detectable in aggregate with TRADE. In a genome-scale Perturb-seq screen, we find that a typical gene perturbation affects an estimated 45 genes, whereas a typical essential gene perturbation affects over 500 genes. An advantage of our approach is its ability to compare the transcriptomic effects of genetic perturbations across contexts and dosages despite differences in power. We use this ability to identify perturbations with cell-type dependent effects and to find examples of perturbations where transcriptional responses are not only larger in magnitude, but also qualitatively different, as a function of dosage. Lastly, we expand our analysis to case/control comparison of gene expression for neuropsychiatric conditions, finding that transcriptomic effect correlations are greater than genetic correlations for these diagnoses. TRADE lays an analytic foundation for the systematic comparison of genetic perturbation atlases, as well as differential expression experiments more broadly.
RESUMEN
Linkage disequilibrium (LD) is the correlation among nearby genetic variants. In genetic association studies, LD is often modeled using large correlation matrices, but this approach is inefficient, especially in ancestrally diverse studies. In the present study, we introduce LD graphical models (LDGMs), which are an extremely sparse and efficient representation of LD. LDGMs are derived from genome-wide genealogies; statistical relationships among alleles in the LDGM correspond to genealogical relationships among haplotypes. We published LDGMs and ancestry-specific LDGM precision matrices for 18 million common variants (minor allele frequency >1%) in five ancestry groups, validated their accuracy and demonstrated order-of-magnitude improvements in runtime for commonly used LD matrix computations. We implemented an extremely fast multiancestry polygenic prediction method, BLUPx-ldgm, which performs better than a similar method based on the reference LD correlation matrix. LDGMs will enable sophisticated methods that scale to ancestrally diverse genetic association data across millions of variants and individuals.
Asunto(s)
Desequilibrio de Ligamiento , Humanos , Alelos , Frecuencia de los Genes/genética , Estudios de Asociación Genética , Haplotipos/genéticaRESUMEN
Disease-associated single-nucleotide polymorphisms (SNPs) generally do not implicate target genes, as most disease SNPs are regulatory. Many SNP-to-gene (S2G) linking strategies have been developed to link regulatory SNPs to the genes that they regulate in cis. Here, we developed a heritability-based framework for evaluating and combining different S2G strategies to optimize their informativeness for common disease risk. Our optimal combined S2G strategy (cS2G) included seven constituent S2G strategies and achieved a precision of 0.75 and a recall of 0.33, more than doubling the recall of any individual strategy. We applied cS2G to fine-mapping results for 49 UK Biobank diseases/traits to predict 5,095 causal SNP-gene-disease triplets (with S2G-derived functional interpretation) with high confidence. We further applied cS2G to provide an empirical assessment of disease omnigenicity; we determined that the top 1% of genes explained roughly half of the SNP heritability linked to all genes and that gene-level architectures vary with variant allele frequency.
Asunto(s)
Estudio de Asociación del Genoma Completo , Polimorfismo de Nucleótido Simple , Estudio de Asociación del Genoma Completo/métodos , Fenotipo , Polimorfismo de Nucleótido Simple/genéticaRESUMEN
The canonical paradigm for converting genetic association to mechanism involves iteratively mapping individual associations to the proximal genes through which they act. In contrast, in the present study we demonstrate the feasibility of extracting biological insights from a very large region of the genome and leverage this strategy to study the genetic influences on autism. Using a new statistical approach, we identified the 33-Mb p-arm of chromosome 16 (16p) as harboring the greatest excess of autism's common polygenic influences. The region also includes the mechanistically cryptic and autism-associated 16p11.2 copy number variant. Analysis of RNA-sequencing data revealed that both the common polygenic influences within 16p and the 16p11.2 deletion were associated with decreased average gene expression across 16p. The transcriptional effects of the rare deletion and diffuse common variation were correlated at the level of individual genes and analysis of Hi-C data revealed patterns of chromatin contact that may explain this transcriptional convergence. These results reflect a new approach for extracting biological insight from genetic association data and suggest convergence of common and rare genetic influences on autism at 16p.
Asunto(s)
Trastorno Autístico , Humanos , Trastorno Autístico/genética , Variaciones en el Número de Copia de ADN , Cromosomas , Deleción Cromosómica , Cromosomas Humanos Par 16/genéticaRESUMEN
The genetic effect-size distribution of a disease describes the number of risk variants, the range of their effect sizes and sample sizes that will be required to discover them. Accurate estimation has been a challenge. Here I propose Fourier Mixture Regression (FMR), validating that it accurately estimates real and simulated effect-size distributions. Applied to summary statistics for ten diseases (average [Formula: see text]), FMR estimates that 100,000-1,000,000 cases will be required for genome-wide significant SNPs to explain 50% of SNP heritability. In such large studies, genome-wide significance becomes increasingly conservative, and less stringent thresholds achieve high true positive rates if confounding is controlled. Across traits, polygenicity varies, but the range of their effect sizes is similar. Compared with effect sizes in the top 10% of heritability, including most discovered thus far, those in the bottom 10-50% are orders of magnitude smaller and more numerous, spanning a large fraction of the genome.
Asunto(s)
Estudio de Asociación del Genoma Completo , Modelos Genéticos , Herencia Multifactorial/genética , Polimorfismo de Nucleótido Simple , Bancos de Muestras Biológicas , Análisis de Fourier , Predisposición Genética a la Enfermedad , Humanos , Desequilibrio de Ligamiento , Análisis de Regresión , Reino UnidoRESUMEN
Disease variants identified by genome-wide association studies (GWAS) tend to overlap with expression quantitative trait loci (eQTLs), but it remains unclear whether this overlap is driven by gene expression levels 'mediating' genetic effects on disease. Here, we introduce a new method, mediated expression score regression (MESC), to estimate disease heritability mediated by the cis genetic component of gene expression levels. We applied MESC to GWAS summary statistics for 42 traits (average N = 323,000) and cis-eQTL summary statistics for 48 tissues from the Genotype-Tissue Expression (GTEx) consortium. Averaging across traits, only 11 ± 2% of heritability was mediated by assayed gene expression levels. Expression-mediated heritability was enriched in genes with evidence of selective constraint and genes with disease-appropriate annotations. Our results demonstrate that assayed bulk tissue eQTLs, although disease relevant, cannot explain the majority of disease heritability.
Asunto(s)
Expresión Génica , Predisposición Genética a la Enfermedad , Estudio de Asociación del Genoma Completo/estadística & datos numéricos , Sitios de Carácter Cuantitativo , Calibración , Estudio de Asociación del Genoma Completo/métodos , Humanos , Desequilibrio de Ligamiento , Polimorfismo de Nucleótido Simple , Análisis de RegresiónRESUMEN
Understanding the role of rare variants is important in elucidating the genetic basis of human disease. Negative selection can cause rare variants to have larger per-allele effect sizes than common variants. Here, we develop a method to estimate the minor allele frequency (MAF) dependence of SNP effect sizes. We use a model in which per-allele effect sizes have variance proportional to [p(1 - p)]α, where p is the MAF and negative values of α imply larger effect sizes for rare variants. We estimate α for 25 UK Biobank diseases and complex traits. All traits produce negative α estimates, with best-fit mean of -0.38 (s.e. 0.02) across traits. Despite larger rare variant effect sizes, rare variants (MAF < 1%) explain less than 10% of total SNP-heritability for most traits analyzed. Using evolutionary modeling and forward simulations, we validate the α model of MAF-dependent trait effects and assess plausible values of relevant evolutionary parameters.
Asunto(s)
Bancos de Muestras Biológicas , Estudio de Asociación del Genoma Completo/métodos , Polimorfismo de Nucleótido Simple , Carácter Cuantitativo Heredable , Selección Genética , Algoritmos , Alelos , Frecuencia de los Genes , Genotipo , Humanos , Modelos Genéticos , Reino UnidoRESUMEN
Elevated serum urate levels cause gout and correlate with cardiometabolic diseases via poorly understood mechanisms. We performed a trans-ancestry genome-wide association study of serum urate in 457,690 individuals, identifying 183 loci (147 previously unknown) that improve the prediction of gout in an independent cohort of 334,880 individuals. Serum urate showed significant genetic correlations with many cardiometabolic traits, with genetic causality analyses supporting a substantial role for pleiotropy. Enrichment analysis, fine-mapping of urate-associated loci and colocalization with gene expression in 47 tissues implicated the kidney and liver as the main target organs and prioritized potentially causal genes and variants, including the transcriptional master regulators in the liver and kidney, HNF1A and HNF4A. Experimental validation showed that HNF4A transactivated the promoter of ABCG2, encoding a major urate transporter, in kidney cells, and that HNF4A p.Thr139Ile is a functional variant. Transcriptional coregulation within and across organs may be a general mechanism underlying the observed pleiotropy between urate and cardiometabolic traits.
Asunto(s)
Enfermedades Cardiovasculares/sangre , Marcadores Genéticos , Gota/sangre , Enfermedades Metabólicas/sangre , Polimorfismo de Nucleótido Simple , Transducción de Señal , Ácido Úrico/sangre , Transportador de Casetes de Unión a ATP, Subfamilia G, Miembro 2/genética , Enfermedades Cardiovasculares/epidemiología , Enfermedades Cardiovasculares/genética , Estudios de Cohortes , Sitios Genéticos , Predisposición Genética a la Enfermedad , Estudio de Asociación del Genoma Completo , Gota/epidemiología , Gota/genética , Factor Nuclear 1-alfa del Hepatocito/genética , Factor Nuclear 4 del Hepatocito/genética , Humanos , Riñón/metabolismo , Riñón/patología , Hígado/metabolismo , Hígado/patología , Enfermedades Metabólicas/epidemiología , Enfermedades Metabólicas/genética , Proteínas de Neoplasias/genética , Especificidad de ÓrganosRESUMEN
In the version of this article originally published, there were errors in equations. In the HTML and PDF, the initial term of equation 10 was estimated GCP but should have been estimated standard error, while a 'hat' was missing from the first alpha in the second term of the expression at the end of the paragraph following equation (6) in the Methods. In addition, in the abstract in the PDF, a subscript 1 was used instead of a subscript 2 for the final term of the first fourth-moment expression. These errors have been corrected in the HTML, PDF and print versions of the paper.
RESUMEN
Mendelian randomization, a method to infer causal relationships, is confounded by genetic correlations reflecting shared etiology. We developed a model in which a latent causal variable mediates the genetic correlation; trait 1 is partially genetically causal for trait 2 if it is strongly genetically correlated with the latent causal variable, quantified using the genetic causality proportion. We fit this model using mixed fourth moments [Formula: see text] and [Formula: see text] of marginal effect sizes for each trait; if trait 1 is causal for trait 2, then SNPs affecting trait 1 (large [Formula: see text]) will have correlated effects on trait 2 (large α1α2), but not vice versa. In simulations, our method avoided false positives due to genetic correlations, unlike Mendelian randomization. Across 52 traits (average n = 331,000), we identified 30 causal relationships with high genetic causality proportion estimates. Novel findings included a causal effect of low-density lipoprotein on bone mineral density, consistent with clinical trials of statins in osteoporosis.