RESUMEN
Many regions in the human genome vary in length among individuals due to variable numbers of tandem repeats (VNTRs). To assess the phenotypic impact of VNTRs genome-wide, we applied a statistical imputation approach to estimate the lengths of 9,561 autosomal VNTR loci in 418,136 unrelated UK Biobank participants and 838 GTEx participants. Association and statistical fine-mapping analyses identified 58 VNTRs that appeared to influence a complex trait in UK Biobank, 18 of which also appeared to modulate expression or splicing of a nearby gene. Non-coding VNTRs at TMCO1 and EIF3H appeared to generate the largest known contributions of common human genetic variation to risk of glaucoma and colorectal cancer, respectively. Each of these two VNTRs associated with a >2-fold range of risk across individuals. These results reveal a substantial and previously unappreciated role of non-coding VNTRs in human health and gene regulation.
Asunto(s)
Canales de Calcio , Neoplasias Colorrectales , Factor 3 de Iniciación Eucariótica , Glaucoma , Repeticiones de Minisatélite , Humanos , Canales de Calcio/genética , Neoplasias Colorrectales/genética , Genoma Humano , Glaucoma/genética , Polimorfismo Genético , Factor 3 de Iniciación Eucariótica/genéticaRESUMEN
The human genome contains hundreds of thousands of regions harboring copy-number variants (CNV). However, the phenotypic effects of most such polymorphisms are unknown because only larger CNVs have been ascertainable from SNP-array data generated by large biobanks. We developed a computational approach leveraging haplotype sharing in biobank cohorts to more sensitively detect CNVs. Applied to UK Biobank, this approach accounted for approximately half of all rare gene inactivation events produced by genomic structural variation. This CNV call set enabled a detailed analysis of associations between CNVs and 56 quantitative traits, identifying 269 independent associations (p < 5 × 10-8) likely to be causally driven by CNVs. Putative target genes were identifiable for nearly half of the loci, enabling insights into dosage sensitivity of these genes and uncovering several gene-trait relationships. These results demonstrate the ability of haplotype-informed analysis to provide insights into the genetic basis of human complex traits.
Asunto(s)
Herencia Multifactorial , Sitios de Carácter Cuantitativo , Humanos , Variaciones en el Número de Copia de ADN , Fenotipo , Genoma Humano , Polimorfismo de Nucleótido Simple/genéticaRESUMEN
Recent work has found increasing evidence of mitigated, incompletely penetrant phenotypes in heterozygous carriers of recessive Mendelian disease variants. We leveraged whole-exome imputation within the full UK Biobank cohort (n â¼ 500K) to extend such analyses to 3,475 rare variants curated from ClinVar and OMIM. Testing these variants for association with 58 quantitative traits yielded 102 significant associations involving variants previously implicated in 34 different diseases. Notable examples included a POR missense variant implicated in Antley-Bixler syndrome that associated with a 1.76 (SE 0.27) cm increase in height and an ABCA3 missense variant implicated in interstitial lung disease that associated with reduced FEV1/FVC ratio. Association analyses with 1,134 disease traits yielded five additional variant-disease associations. We also observed contrasting levels of recessiveness between two more-common, classical Mendelian diseases. Carriers of cystic fibrosis variants exhibited increased risk of several mitigated disease phenotypes, whereas carriers of spinal muscular atrophy alleles showed no evidence of altered phenotypes. Incomplete penetrance of cystic fibrosis carrier phenotypes did not appear to be mediated by common allelic variation on the functional haplotype. Our results show that many disease-associated recessive variants can produce mitigated phenotypes in heterozygous carriers and motivate further work exploring penetrance mechanisms.
Asunto(s)
Fenotipo del Síndrome de Antley-Bixler , Fibrosis Quística , Enfermedades Pulmonares Intersticiales , Alelos , Fenotipo del Síndrome de Antley-Bixler/genética , Fibrosis Quística/genética , Bases de Datos Factuales , Predisposición Genética a la Enfermedad , Humanos , Enfermedades Pulmonares Intersticiales/genética , Atrofia Muscular Espinal/genética , Penetrancia , Fenotipo , Reino UnidoRESUMEN
Single cell whole-genome sequencing (scWGS) is providing novel insights into the nature of genetic heterogeneity in normal and diseased cells. However, the whole-genome amplification process required for scWGS introduces biases into the resulting sequencing that can confound downstream analysis. Here, we present a statistical method, with an accompanying package PaSD-qc (Power Spectral Density-qc), that evaluates the properties and quality of single cell libraries. It uses a modified power spectral density to assess amplification uniformity, amplicon size distribution, autocovariance and inter-sample consistency as well as to identify chromosomes with aberrant read-density profiles due either to copy alterations or poor amplification. These metrics provide a standard way to compare the quality of single cell samples as well as yield information necessary to improve variant calling strategies. We demonstrate the usefulness of this tool in comparing the properties of scWGS protocols, identifying potential chromosomal copy number variation, determining chromosomal and subchromosomal regions of poor amplification, and selecting high-quality libraries from low-coverage data for deep sequencing. The software is available free and open-source at https://github.com/parklab/PaSDqc.
Asunto(s)
Secuenciación Completa del Genoma/normas , Variaciones en el Número de Copia de ADN , Humanos , Control de Calidad , Análisis de la Célula Individual/normas , Programas Informáticos , Secuenciación Completa del Genoma/métodosRESUMEN
We present a method for detecting evidence of natural selection in ancient DNA time-series data that leverages an opportunity not utilized in previous scans: testing for a consistent trend in allele frequency change over time. By applying this to 8433 West Eurasians who lived over the past 14000 years and 6510 contemporary people, we find an order of magnitude more genome-wide significant signals than previous studies: 347 independent loci with >99% probability of selection. Previous work showed that classic hard sweeps driving advantageous mutations to fixation have been rare over the broad span of human evolution, but in the last ten millennia, many hundreds of alleles have been affected by strong directional selection. Discoveries include an increase from ~0% to ~20% in 4000 years for the major risk factor for celiac disease at HLA-DQB1; a rise from ~0% to ~8% in 6000 years of blood type B; and fluctuating selection at the TYK2 tuberculosis risk allele rising from ~2% to ~9% from ~5500 to ~3000 years ago before dropping to ~3%. We identify instances of coordinated selection on alleles affecting the same trait, with the polygenic score today predictive of body fat percentage decreasing by around a standard deviation over ten millennia, consistent with the "Thrifty Gene" hypothesis that a genetic predisposition to store energy during food scarcity became disadvantageous after farming. We also identify selection for combinations of alleles that are today associated with lighter skin color, lower risk for schizophrenia and bipolar disease, slower health decline, and increased measures related to cognitive performance (scores on intelligence tests, household income, and years of schooling). These traits are measured in modern industrialized societies, so what phenotypes were adaptive in the past is unclear. We estimate selection coefficients at 9.9 million variants, enabling study of how Darwinian forces couple to allelic effects and shape the genetic architecture of complex traits.
RESUMEN
Copy number variants (CNVs) are among the largest genetic variants, yet CNVs have not been effectively ascertained in most genetic association studies. Here we ascertained protein-altering CNVs from UK Biobank whole-exome sequencing data (n = 468,570) using haplotype-informed methods capable of detecting subexonic CNVs and variation within segmental duplications. Incorporating CNVs into analyses of rare variants predicted to cause gene loss of function (LOF) identified 100 associations of predicted LOF variants with 41 quantitative traits. A low-frequency partial deletion of RGL3 exon 6 conferred one of the strongest protective effects of gene LOF on hypertension risk (odds ratio = 0.86 (0.82-0.90)). Protein-coding variation in rapidly evolving gene families within segmental duplications-previously invisible to most analysis methods-generated some of the human genome's largest contributions to variation in type 2 diabetes risk, chronotype and blood cell traits. These results illustrate the potential for new genetic insights from genomic variation that has escaped large-scale analysis to date.
Asunto(s)
Variaciones en el Número de Copia de ADN , Diabetes Mellitus Tipo 2 , Humanos , Variaciones en el Número de Copia de ADN/genética , Diabetes Mellitus Tipo 2/genética , Fenotipo , Estudios de Asociación Genética , ExonesRESUMEN
Klunk et al. analyzed ancient DNA data from individuals in London and Denmark before, during and after the Black Death [1], and argued that allele frequency changes at immune genes were too large to be produced by random genetic drift and thus must reflect natural selection. They also identified four specific variants that they claimed show evidence of selection including at ERAP2, for which they estimate a selection coefficient of 0.39-several times larger than any selection coefficient on a common human variant reported to date. Here we show that these claims are unsupported for four reasons. First, the signal of enrichment of large allele frequency changes in immune genes comparing people in London before and after the Black Death disappears after an appropriate randomization test is carried out: the P value increases by ten orders of magnitude and is no longer significant. Second, a technical error in the estimation of allele frequencies means that none of the four originally reported loci actually pass the filtering thresholds. Third, the filtering thresholds do not adequately correct for multiple testing. Finally, in the case of the ERAP2 variant rs2549794, which Klunk et al. show experimentally may be associated with a host interaction with Y. pestis, we find no evidence of significant frequency change either in the data that Klunk et al. report, or in published data spanning 2,000 years. While it remains plausible that immune genes were subject to natural selection during the Black Death, the magnitude of this selection and which specific genes may have been affected remains unknown.
RESUMEN
Structural variants (SVs) comprise the largest genetic variants, altering from 50 base pairs to megabases of DNA. However, SVs have not been effectively ascertained in most genetic association studies, leaving a key gap in our understanding of human complex trait genetics. We ascertained protein-altering SVs from UK Biobank whole-exome sequencing data (n=468,570) using haplotype-informed methods capable of detecting sub-exonic SVs and variation within segmental duplications. Incorporating SVs into analyses of rare variants predicted to cause gene loss-of-function (pLoF) identified 100 associations of pLoF variants with 41 quantitative traits. A low-frequency partial deletion of RGL3 exon 6 appeared to confer one of the strongest protective effects of gene LoF on hypertension risk (OR = 0.86 [0.82-0.90]). Protein-coding variation in rapidly-evolving gene families within segmental duplications-previously invisible to most analysis methods-appeared to generate some of the human genome's largest contributions to variation in type 2 diabetes risk, chronotype, and blood cell traits. These results illustrate the potential for new genetic insights from genomic variation that has escaped large-scale analysis to date.
RESUMEN
The genetic architecture of human diseases and complex traits has been extensively studied, but little is known about the relationship of causal disease effect sizes between proximal SNPs, which have largely been assumed to be independent. We introduce a new method, LD SNP-pair effect correlation regression (LDSPEC), to estimate the correlation of causal disease effect sizes of derived alleles between proximal SNPs, depending on their allele frequencies, LD, and functional annotations; LDSPEC produced robust estimates in simulations across various genetic architectures. We applied LDSPEC to 70 diseases and complex traits from the UK Biobank (average N=306K), meta-analyzing results across diseases/traits. We detected significantly nonzero effect correlations for proximal SNP pairs (e.g., -0.37±0.09 for low-frequency positive-LD 0-100bp SNP pairs) that decayed with distance (e.g., -0.07±0.01 for low-frequency positive-LD 1-10kb), varied with allele frequency (e.g., -0.15±0.04 for common positive-LD 0-100bp), and varied with LD between SNPs (e.g., +0.12±0.05 for common negative-LD 0-100bp) (because we consider derived alleles, positive-LD and negative-LD SNP pairs may yield very different results). We further determined that SNP pairs with shared functions had stronger effect correlations that spanned longer genomic distances, e.g., -0.37±0.08 for low-frequency positive-LD same-gene promoter SNP pairs (average genomic distance of 47kb (due to alternative splicing)) and -0.32±0.04 for low-frequency positive-LD H3K27ac 0-1kb SNP pairs. Consequently, SNP-heritability estimates were substantially smaller than estimates of the sum of causal effect size variances across all SNPs (ratio of 0.87±0.02 across diseases/traits), particularly for certain functional annotations (e.g., 0.78±0.01 for common Super enhancer SNPs)-even though these quantities are widely assumed to be equal. We recapitulated our findings via forward simulations with an evolutionary model involving stabilizing selection, implicating the action of linkage masking, whereby haplotypes containing linked SNPs with opposite effects on disease have reduced effects on fitness and escape negative selection.
RESUMEN
The genetic architecture of human diseases and complex traits has been extensively studied, but little is known about the relationship of causal disease effect sizes between proximal SNPs, which have largely been assumed to be independent. We introduce a new method, LD SNP-pair effect correlation regression (LDSPEC), to estimate the correlation of causal disease effect sizes of derived alleles between proximal SNPs, depending on their allele frequencies, LD, and functional annotations; LDSPEC produced robust estimates in simulations across various genetic architectures. We applied LDSPEC to 70 diseases and complex traits from the UK Biobank (average N=306K), meta-analyzing results across diseases/traits. We detected significantly nonzero effect correlations for proximal SNP pairs (e.g., -0.37±0.09 for low-frequency positive-LD 0-100bp SNP pairs) that decayed with distance (e.g., -0.07±0.01 for low-frequency positive-LD 1-10kb), varied with allele frequency (e.g., -0.15±0.04 for common positive-LD 0-100bp), and varied with LD between SNPs (e.g., +0.12±0.05 for common negative-LD 0-100bp) (because we consider derived alleles, positive-LD and negative-LD SNP pairs may yield very different results). We further determined that SNP pairs with shared functions had stronger effect correlations that spanned longer genomic distances, e.g., -0.37±0.08 for low-frequency positive-LD same-gene promoter SNP pairs (average genomic distance of 47kb (due to alternative splicing)) and -0.32±0.04 for low-frequency positive-LD H3K27ac 0-1kb SNP pairs. Consequently, SNP-heritability estimates were substantially smaller than estimates of the sum of causal effect size variances across all SNPs (ratio of 0.87±0.02 across diseases/traits), particularly for certain functional annotations (e.g., 0.78±0.01 for common Super enhancer SNPs)-even though these quantities are widely assumed to be equal. We recapitulated our findings via forward simulations with an evolutionary model involving stabilizing selection, implicating the action of linkage masking, whereby haplotypes containing linked SNPs with opposite effects on disease have reduced effects on fitness and escape negative selection.
RESUMEN
Exome association studies to date have generally been underpowered to systematically evaluate the phenotypic impact of very rare coding variants. We leveraged extensive haplotype sharing between 49,960 exome-sequenced UK Biobank participants and the remainder of the cohort (total n ≈ 500,000) to impute exome-wide variants with accuracy R2 > 0.5 down to minor allele frequency (MAF) ~0.00005. Association and fine-mapping analyses of 54 quantitative traits identified 1,189 significant associations (P < 5 × 10-8) involving 675 distinct rare protein-altering variants (MAF < 0.01) that passed stringent filters for likely causality. Across all traits, 49% of associations (578/1,189) occurred in genes with two or more hits; follow-up analyses of these genes identified allelic series containing up to 45 distinct 'likely-causal' variants. Our results demonstrate the utility of within-cohort imputation in population-scale genome-wide association studies, provide a catalog of likely-causal, large-effect coding variant associations and foreshadow the insights that will be revealed as genetic biobank studies continue to grow.
Asunto(s)
Bancos de Muestras Biológicas , Secuenciación del Exoma/estadística & datos numéricos , Frecuencia de los Genes , Proteínas/genética , Presión Sanguínea/genética , Mapeo Cromosómico/métodos , Mapeo Cromosómico/estadística & datos numéricos , Marcadores Genéticos , Estudio de Asociación del Genoma Completo/estadística & datos numéricos , Humanos , Desequilibrio de Ligamiento , Proteínas de la Membrana/genética , Modelos Genéticos , Fenotipo , Polimorfismo de Nucleótido Simple , Proteínas/metabolismo , Receptores del Factor Natriurético Atrial/genética , Reino Unido , Secuenciación del Exoma/métodosRESUMEN
Many human proteins contain domains that vary in size or copy number because of variable numbers of tandem repeats (VNTRs) in protein-coding exons. However, the relationships of VNTRs to most phenotypes are unknown because of difficulties in measuring such repetitive elements. We developed methods to estimate VNTR lengths from whole-exome sequencing data and impute VNTR alleles into single-nucleotide polymorphism haplotypes. Analyzing 118 protein-altering VNTRs in 415,280 UK Biobank participants for association with 786 phenotypes identified some of the strongest associations of common variants with human phenotypes, including height, hair morphology, and biomarkers of health. Accounting for large-effect VNTRs further enabled fine-mapping of associations to many more protein-coding mutations in the same genes. These results point to cryptic effects of highly polymorphic common structural variants that have eluded molecular analyses to date.
Asunto(s)
Genoma Humano , Repeticiones de Minisatélite/genética , Fenotipo , Polimorfismo Genético , Agrecanos/genética , Antígenos/genética , Población Negra , Estatura/genética , Estudios de Asociación Genética , Cabello , Haplotipos , Humanos , Proteínas de Filamentos Intermediarios/genética , Riñón/fisiología , Lipoproteína(a)/sangre , Lipoproteína(a)/genética , Mucina-1/genética , Polimorfismo de Nucleótido Simple , Polinucleotido Adenililtransferasa/genética , Población Blanca/genética , Secuenciación del ExomaRESUMEN
Hundreds of the proteins encoded in human genomes contain domains that vary in size or copy number due to variable numbers of tandem repeats (VNTRs) in protein-coding exons. VNTRs have eluded analysis by the molecular methods-SNP arrays and high-throughput sequencing-used in large-scale human genetic studies to date; thus, the relationships of VNTRs to most human phenotypes are unknown. We developed ways to estimate VNTR lengths from whole-exome sequencing data, identify the SNP haplotypes on which VNTR alleles reside, and use imputation to project these haplotypes into abundant SNP data. We analyzed 118 protein-altering VNTRs in 415,280 UK Biobank participants for association with 791 phenotypes. Analysis revealed some of the strongest associations of common variants with human phenotypes including height, hair morphology, and biomarkers of human health; for example, a VNTR encoding 13-44 copies of a 19-amino-acid repeat in the chondroitin sulfate domain of aggrecan (ACAN) associated with height variation of 3.4 centimeters (s.e. 0.3 cm). Incorporating large-effect VNTRs into analysis also made it possible to map many additional effects at the same loci: for the blood biomarker lipoprotein(a), for example, analysis of the kringle IV-2 VNTR within the LPA gene revealed that 18 coding SNPs and the VNTR in LPA explained 90% of lipoprotein(a) heritability in Europeans, enabling insights about population differences and epidemiological significance of this clinical biomarker. These results point to strong, cryptic effects of highly polymorphic common structural variants that have largely eluded molecular analyses to date.
RESUMEN
Although germline de novo copy number variants (CNVs) are known causes of autism spectrum disorder (ASD), the contribution of mosaic (early-developmental) copy number variants (mCNVs) has not been explored. In this study, we assessed the contribution of mCNVs to ASD by ascertaining mCNVs in genotype array intensity data from 12,077 probands with ASD and 5,500 unaffected siblings. We detected 46 mCNVs in probands and 19 mCNVs in siblings, affecting 2.8-73.8% of cells. Probands carried a significant burden of large (>4-Mb) mCNVs, which were detected in 25 probands but only one sibling (odds ratio = 11.4, 95% confidence interval = 1.5-84.2, P = 7.4 × 10-4). Event size positively correlated with severity of ASD symptoms (P = 0.016). Surprisingly, we did not observe mosaic analogues of the short de novo CNVs recurrently observed in ASD (eg, 16p11.2). We further experimentally validated two mCNVs in postmortem brain tissue from 59 additional probands. These results indicate that mCNVs contribute a previously unexplained component of ASD risk.
Asunto(s)
Trastorno del Espectro Autista/genética , Variaciones en el Número de Copia de ADN , Mosaicismo , Adulto , Trastorno del Espectro Autista/epidemiología , Autopsia , Química Encefálica/genética , Niño , Trastornos Generalizados del Desarrollo Infantil/genética , Estudios de Cohortes , Predisposición Genética a la Enfermedad , Genotipo , Mutación de Línea Germinal , Humanos , Medición de Riesgo , Bancos de TejidosRESUMEN
Whole-genome sequencing of DNA from single cells has the potential to reshape our understanding of mutational heterogeneity in normal and diseased tissues. However, a major difficulty is distinguishing amplification artifacts from biologically derived somatic mutations. Here, we describe linked-read analysis (LiRA), a method that accurately identifies somatic single-nucleotide variants (sSNVs) by using read-level phasing with nearby germline heterozygous polymorphisms, thereby enabling the characterization of mutational signatures and estimation of somatic mutation rates in single cells.
Asunto(s)
Mutación/genética , Análisis Mutacional de ADN/métodos , Heterocigoto , Humanos , Tasa de Mutación , Polimorfismo de Nucleótido Simple/genética , Análisis de Secuencia de ADN/métodos , Análisis de la Célula Individual/métodos , Secuenciación Completa del Genoma/métodosRESUMEN
It has long been hypothesized that aging and neurodegeneration are associated with somatic mutation in neurons; however, methodological hurdles have prevented testing this hypothesis directly. We used single-cell whole-genome sequencing to perform genome-wide somatic single-nucleotide variant (sSNV) identification on DNA from 161 single neurons from the prefrontal cortex and hippocampus of 15 normal individuals (aged 4 months to 82 years), as well as 9 individuals affected by early-onset neurodegeneration due to genetic disorders of DNA repair (Cockayne syndrome and xeroderma pigmentosum). sSNVs increased approximately linearly with age in both areas (with a higher rate in hippocampus) and were more abundant in neurodegenerative disease. The accumulation of somatic mutations with age-which we term genosenium-shows age-related, region-related, and disease-related molecular signatures and may be important in other human age-associated conditions.