ABSTRACT
Despite strong transethnic genetic correlations reported in the literature for many complex traits, the non-transferability of polygenic risk scores across populations suggests the presence of population-specific components of genetic architecture. We propose an approach that models GWAS summary data for one trait in two populations to estimate genome-wide proportions of population-specific/shared causal SNPs. In simulations across various genetic architectures, we show that our approach yields approximately unbiased estimates with in-sample LD and slight upward-bias with out-of-sample LD. We analyze nine complex traits in individuals of East Asian and European ancestry, restricting to common SNPs (MAF > 5%), and find that most common causal SNPs are shared by both populations. Using the genome-wide estimates as priors in an empirical Bayes framework, we perform fine-mapping and observe that high-posterior SNPs (for both the population-specific and shared causal configurations) have highly correlated effects in East Asians and Europeans. In population-specific GWAS risk regions, we observe a 2.8× enrichment of shared high-posterior SNPs, suggesting that population-specific GWAS risk regions harbor shared causal SNPs that are undetected in the other GWASs due to differences in LD, allele frequencies, and/or sample size. Finally, we report enrichments of shared high-posterior SNPs in 53 tissue-specific functional categories and find evidence that SNP-heritability enrichments are driven largely by many low-effect common SNPs.
Subject(s)
Ethnicity/genetics , Genome-Wide Association Study , Multifactorial Inheritance/genetics , Polymorphism, Single Nucleotide/genetics , Bayes Theorem , Europe/ethnology , Asia, Eastern/ethnology , Gene Frequency , Humans , Linkage Disequilibrium , Organ Specificity/geneticsABSTRACT
Regulatory variation plays a major role in complex disease and that cell type-specific binding of transcription factors (TF) is critical to gene regulation. However, assessing the contribution of genetic variation in TF-binding sites to disease heritability is challenging, as binding is often cell type-specific and annotations from directly measured TF binding are not currently available for most cell type-TF pairs. We investigate approaches to annotate TF binding, including directly measured chromatin data and sequence-based predictions. We find that TF-binding annotations constructed by intersecting sequence-based TF-binding predictions with cell type-specific chromatin data explain a large fraction of heritability across a broad set of diseases and corresponding cell types; this strategy of constructing annotations addresses both the limitation that identical sequences may be bound or unbound depending on surrounding chromatin context and the limitation that sequence-based predictions are generally not cell type-specific. We partitioned the heritability of 49 diseases and complex traits using stratified linkage disequilibrium (LD) score regression with the baseline-LD model (which is not cell type-specific) plus the new annotations. We determined that 100 bp windows around MotifMap sequenced-based TF-binding predictions intersected with a union of six cell type-specific chromatin marks (imputed using ChromImpute) performed best, with an 58% increase in heritability enrichment compared to the chromatin marks alone (11.6× vs. 7.3×, P = 9 × 10-14 for difference) and a 20% increase in cell type-specific signal conditional on annotations from the baseline-LD model (P = 8 × 10-11 for difference). Our results show that TF-binding annotations explain substantial disease heritability and can help refine genome-wide association signals.
Subject(s)
Chromatin/genetics , Genetic Diseases, Inborn/genetics , Molecular Sequence Annotation , Transcription Factors/genetics , Binding Sites/genetics , Computational Biology , Gene Expression Regulation/genetics , Genetic Diseases, Inborn/classification , Genetic Diseases, Inborn/pathology , Humans , Linkage Disequilibrium/genetics , Multifactorial Inheritance/genetics , Polymorphism, Single Nucleotide/genetics , Protein Binding/geneticsABSTRACT
Functional genomics data has the potential to increase GWAS power by identifying SNPs that have a higher prior probability of association. Here, we introduce a method that leverages polygenic functional enrichment to incorporate coding, conserved, regulatory, and LD-related genomic annotations into association analyses. We show via simulations with real genotypes that the method, functionally informed novel discovery of risk loci (FINDOR), correctly controls the false-positive rate at null loci and attains a 9%-38% increase in the number of independent associations detected at causal loci, depending on trait polygenicity and sample size. We applied FINDOR to 27 independent complex traits and diseases from the interim UK Biobank release (average N = 130K). Averaged across traits, we attained a 13% increase in genome-wide significant loci detected (including a 20% increase for disease traits) compared to unweighted raw p values that do not use functional data. We replicated the additional loci in independent UK Biobank and non-UK Biobank data, yielding a highly statistically significant replication slope (0.66-0.69) in each case. Finally, we applied FINDOR to the full UK Biobank release (average N = 416K), attaining smaller relative improvements (consistent with simulations) but larger absolute improvements, detecting an additional 583 GWAS loci. In conclusion, leveraging functional enrichment using our method robustly increases GWAS power.
Subject(s)
Genome-Wide Association Study , Multifactorial Inheritance/genetics , Polymorphism, Single Nucleotide/genetics , Calibration , Databases, Genetic , Datasets as Topic , False Positive Reactions , Humans , Probability , Reproducibility of Results , United KingdomABSTRACT
Although recent studies provide evidence for a common genetic basis between complex traits and Mendelian disorders, a thorough quantification of their overlap in a phenotype-specific manner remains elusive. Here, we have quantified the overlap of genes identified through large-scale genome-wide association studies (GWASs) for 62 complex traits and diseases with genes containing mutations known to cause 20 broad categories of Mendelian disorders. We identified a significant enrichment of genes linked to phenotypically matched Mendelian disorders in GWAS gene sets; of the total 1,240 comparisons, a higher proportion of phenotypically matched or related pairs (n = 50 of 92 [54%]) than phenotypically unmatched pairs (n = 27 of 1,148 [2%]) demonstrated significant overlap, confirming a phenotype-specific enrichment pattern. Further, we observed elevated GWAS effect sizes near genes linked to phenotypically matched Mendelian disorders. Finally, we report examples of GWAS variants localized at the transcription start site or physically interacting with the promoters of genes linked to phenotypically matched Mendelian disorders. Our results are consistent with the hypothesis that genes that are disrupted in Mendelian disorders are dysregulated by non-coding variants in complex traits and demonstrate how leveraging findings from related Mendelian disorders and functional genomic datasets can prioritize genes that are putatively dysregulated by local and distal non-coding GWAS variants.
Subject(s)
Multifactorial Inheritance/genetics , Polymorphism, Single Nucleotide/genetics , Quantitative Trait Loci/genetics , Female , Genetic Predisposition to Disease/genetics , Genome-Wide Association Study/methods , Humans , Male , Phenotype , Promoter Regions, Genetic/genetics , Transcription Initiation Site/physiologyABSTRACT
Recent studies have identified thousands of regions in the genome associated with chromatin modifications, which may in turn be affecting gene expression. Existing works have used heuristic methods to investigate the relationships between genome, epigenome, and gene expression, but, to our knowledge, none have explicitly modeled the chain of causality whereby genetic variants impact chromatin, which impacts gene expression. In this work we introduce a new hierarchical fine-mapping framework that integrates information across all three levels of data to better identify the causal variant and chromatin mark that are concordantly influencing gene expression. In simulations we show that our method is more accurate than existing approaches at identifying the causal mark influencing expression. We analyze empirical genetic, chromatin, and gene expression data from 65 African-ancestry and 47 European-ancestry individuals and show that many of the paths prioritized by our method are consistent with the proposed causal model and often lie in likely functional regions.
Subject(s)
Chromatin/genetics , Chromosome Mapping/methods , Gene Expression , Black People/genetics , Chromatin/metabolism , Databases, Genetic , Genetic Loci , Genetic Markers , Genetic Predisposition to Disease , Genetic Variation , Genome-Wide Association Study , Humans , Linkage Disequilibrium , Models, Genetic , Polymorphism, Single Nucleotide , Quantitative Trait Loci , Statistics as Topic/methods , White People/geneticsABSTRACT
Although genome-wide association studies (GWASs) have identified thousands of risk loci for many complex traits and diseases, the causal variants and genes at these loci remain largely unknown. Here, we introduce a method for estimating the local genetic correlation between gene expression and a complex trait and utilize it to estimate the genetic correlation due to predicted expression between pairs of traits. We integrated gene expression measurements from 45 expression panels with summary GWAS data to perform 30 multi-tissue transcriptome-wide association studies (TWASs). We identified 1,196 genes whose expression is associated with these traits; of these, 168 reside more than 0.5 Mb away from any previously reported GWAS significant variant. We then used our approach to find 43 pairs of traits with significant genetic correlation at the level of predicted expression; of these, eight were not found through genetic correlation at the SNP level. Finally, we used bi-directional regression to find evidence that BMI causally influences triglyceride levels and that triglyceride levels causally influence low-density lipoprotein. Together, our results provide insight into the role of gene expression in the susceptibility of complex traits and diseases.
Subject(s)
Gene Expression Profiling , Genome-Wide Association Study , Phenotype , Transcriptome , Body Mass Index , Genetic Predisposition to Disease , Humans , Linkage Disequilibrium , Lipoproteins, LDL/blood , Models, Theoretical , Polymorphism, Single Nucleotide , Quantitative Trait Loci , Sequence Analysis, RNA , Triglycerides/bloodABSTRACT
Recent successes in genome-wide association studies (GWASs) make it possible to address important questions about the genetic architecture of complex traits, such as allele frequency and effect size. One lesser-known aspect of complex traits is the extent of allelic heterogeneity (AH) arising from multiple causal variants at a locus. We developed a computational method to infer the probability of AH and applied it to three GWASs and four expression quantitative trait loci (eQTL) datasets. We identified a total of 4,152 loci with strong evidence of AH. The proportion of all loci with identified AH is 4%-23% in eQTLs, 35% in GWASs of high-density lipoprotein (HDL), and 23% in GWASs of schizophrenia. For eQTLs, we observed a strong correlation between sample size and the proportion of loci with AH (R2 = 0.85, p = 2.2 × 10-16), indicating that statistical power prevents identification of AH in other loci. Understanding the extent of AH may guide the development of new methods for fine mapping and association mapping of complex traits.
Subject(s)
Alleles , Gene Frequency , Quantitative Trait Loci , Databases, Genetic , Genetic Association Studies , Humans , Linkage Disequilibrium , Models, Molecular , PhenotypeABSTRACT
Although genome-wide association studies (GWAS) have identified hundreds of risk loci for breast and prostate cancer, only a few studies have characterized the GWAS association signals across functional genomic annotations with a particular focus on single nucleotide polymorphisms (SNPs) located in DNA regulatory elements. In this study, we investigated the enrichment pattern of GWAS signals for breast and prostate cancer in genomic functional regions located in normal tissue and cancer cell lines. We quantified the overall enrichment of SNPs with breast and prostate cancer association p values < 1 × 10-8 across regulatory categories. We then obtained annotations for DNaseI hypersensitive sites (DHS), typical enhancers, and super enhancers across multiple tissue types, to assess if significant GWAS signals were selectively enriched in annotations found in disease-related tissue. Finally, we quantified the enrichment of breast and prostate cancer SNP heritability in regulatory regions, and compared the enrichment pattern of SNP heritability with GWAS signals. DHS, typical enhancers, and super enhancers identified in the breast cancer cell line MCF-7 were observed with the highest enrichment of genome-wide significant variants for breast cancer. For prostate cancer, GWAS signals were mostly enriched in DHS and typical enhancers identified in the prostate cancer cell line LNCaP. With progressively stringent GWAS p value thresholds, an increasing trend of enrichment was observed for both diseases in DHS, typical enhancers, and super enhancers located in disease-related tissue. Results from heritability enrichment analysis supported the selective enrichment pattern of functional genomic regions in disease-related cell lines for both breast and prostate cancer. Our results suggest the importance of studying functional annotations identified in disease-related tissues when characterizing GWAS results, and further demonstrate the role of germline DNA regulatory elements from disease-related tissue in breast and prostate carcinogenesis.
Subject(s)
Breast Neoplasms/genetics , Genetic Predisposition to Disease , Genetic Variation , Prostatic Neoplasms/genetics , Regulatory Sequences, Nucleic Acid , Biomarkers, Tumor , Cell Line, Tumor , Computational Biology/methods , Female , Genetic Association Studies , Genome-Wide Association Study , Humans , Male , Molecular Sequence Annotation , Organ SpecificityABSTRACT
Variance-component methods that estimate the aggregate contribution of large sets of variants to the heritability of complex traits have yielded important insights into the genetic architecture of common diseases. Here, we introduce methods that estimate the total trait variance explained by the typed variants at a single locus in the genome (local SNP heritability) from genome-wide association study (GWAS) summary data while accounting for linkage disequilibrium among variants. We applied our estimator to ultra-large-scale GWAS summary data of 30 common traits and diseases to gain insights into their local genetic architecture. First, we found that common SNPs have a high contribution to the heritability of all studied traits. Second, we identified traits for which the majority of the SNP heritability can be confined to a small percentage of the genome. Third, we identified GWAS risk loci where the entire locus explains significantly more variance in the trait than the GWAS reported variants. Finally, we identified loci that explain a significant amount of heritability across multiple traits.
Subject(s)
Genome-Wide Association Study , Models, Genetic , Phenotype , Datasets as Topic , Genetic Predisposition to Disease/genetics , Humans , Linkage Disequilibrium/genetics , Multifactorial Inheritance/genetics , Polymorphism, Single Nucleotide/genetics , Quantitative Trait Loci/geneticsABSTRACT
Genome-wide association studies have identified over 150 loci associated with lipid traits, however, no large-scale studies exist for Hispanics and other minority populations. Additionally, the genetic architecture of lipid-influencing loci remains largely unknown. We performed one of the most racially/ethnically diverse fine-mapping genetic studies of HDL-C, LDL-C, and triglycerides to-date using SNPs on the MetaboChip array on 54,119 individuals: 21,304 African Americans, 19,829 Hispanic Americans, 12,456 Asians, and 530 American Indians. The majority of signals found in these groups generalize to European Americans. While we uncovered signals unique to racial/ethnic populations, we also observed systematically consistent lipid associations across these groups. In African Americans, we identified three novel signals associated with HDL-C (LPL, APOA5, LCAT) and two associated with LDL-C (ABCG8, DHODH). In addition, using this population, we refined the location for 16 out of the 58 known MetaboChip lipid loci. These results can guide tailored screening efforts, reveal population-specific responses to lipid-lowering medications, and aid in the development of new targeted drug therapies.
Subject(s)
Cholesterol, HDL/genetics , Cholesterol, LDL/genetics , Genome-Wide Association Study , Lipids/genetics , ATP Binding Cassette Transporter, Subfamily G, Member 8/genetics , Black or African American/genetics , Apolipoprotein A-V/genetics , Asian People/genetics , Female , Hispanic or Latino/genetics , Humans , Indians, North American/genetics , Lipoprotein Lipase/genetics , Male , Triglycerides/geneticsABSTRACT
Localization of causal variants underlying known risk loci is one of the main research challenges following genome-wide association studies. Risk loci are typically dissected through fine-mapping experiments in trans-ethnic cohorts for leveraging the variability in the local genetic structure across populations. More recent works have shown that genomic functional annotations (i.e., localization of tissue-specific regulatory marks) can be integrated for increasing fine-mapping performance within single-population studies. Here, we introduce methods that integrate the strength of association between genotype and phenotype, the variability in the genetic backgrounds across populations, and the genomic map of tissue-specific functional elements to increase trans-ethnic fine-mapping accuracy. Through extensive simulations and empirical data, we have demonstrated that our approach increases fine-mapping resolution over existing methods. We analyzed empirical data from a large-scale trans-ethnic rheumatoid arthritis (RA) study and showed that the functional genetic architecture of RA is consistent across European and Asian ancestries. In these data, we used our proposed methods to reduce the average size of the 90% credible set from 29 variants per locus for standard non-integrative approaches to 22 variants.
Subject(s)
Arthritis, Rheumatoid/genetics , Chromosome Mapping/methods , Genetic Variation , Genetics, Population/methods , Models, Genetic , Molecular Sequence Annotation/methods , Risk Assessment/methods , Bayes Theorem , Ethnicity/genetics , HumansABSTRACT
MOTIVATION: Expression quantitative trait loci (eQTLs), genetic variants associated with gene expression levels, are identified in eQTL mapping studies. Such studies typically test for an association between single nucleotide polymorphisms (SNPs) and expression under an additive model, which ignores interaction and haplotypic effects. Mismatches between the model tested and the underlying genetic architecture can lead to a loss of association power. Here we introduce a new haplotype-based test for eQTL studies that looks for haplotypic effects on expression levels. Our test is motivated by compound heterozygous architectures, a common disease model for recessive monogenic disorders, where two different alleles can have the same effect on a gene's function. RESULTS: When the underlying true causal architecture for a simulated gene is a compound heterozygote, our method is better able to capture the signal than the marginal SNP method. When the underlying model is a single SNP, there is no difference in the power of our method relative to the marginal SNP method. We apply our method to empirical gene expression data measured in 373 European individuals from the GEUVADIS study and find 29 more eGenes (genes with at least one association) than the standard marginal SNP method. Furthermore, in 974 of the 3529 total eGenes, our haplotype-based method results in a stronger association signal than the standard marginal SNP method. This demonstrates our method both increases power over the standard method and provides evidence of haplotypic architectures regulating gene expression. AVAILABILITY AND IMPLEMENTATION: http://bogdan.bioinformatics.ucla.edu/software/. CONTACT: rob.brown@ucla.edu or pasaniuc@ucla.edu.
Subject(s)
Gene Expression Regulation , Haplotypes , Models, Genetic , Quantitative Trait Loci , Statistics as Topic , Genetic Association Studies/methods , Humans , Polymorphism, Single NucleotideABSTRACT
MOTIVATION: Genome-wide association studies (GWAS) have identified thousands of regions in the genome that contain genetic variants that increase risk for complex traits and diseases. However, the variants uncovered in GWAS are typically not biologically causal, but rather, correlated to the true causal variant through linkage disequilibrium (LD). To discern the true causal variant(s), a variety of statistical fine-mapping methods have been proposed to prioritize variants for functional validation. RESULTS: In this work we introduce a new approach, fastPAINTOR, that leverages evidence across correlated traits, as well as functional annotation data, to improve fine-mapping accuracy at pleiotropic risk loci. To improve computational efficiency, we describe an new importance sampling scheme to perform model inference. First, we demonstrate in simulations that by leveraging functional annotation data, fastPAINTOR increases fine-mapping resolution relative to existing methods. Next, we show that jointly modeling pleiotropic risk regions improves fine-mapping resolution compared to standard single trait and pleiotropic fine mapping strategies. We report a reduction in the number of SNPs required for follow-up in order to capture 90% of the causal variants from 23 SNPs per locus using a single trait to 12 SNPs when fine-mapping two traits simultaneously. Finally, we analyze summary association data from a large-scale GWAS of lipids and show that these improvements are largely sustained in real data. AVAILABILITY AND IMPLEMENTATION: The fastPAINTOR framework is implemented in the PAINTOR v3.0 package which is publicly available to the research community http://bogdan.bioinformatics.ucla.edu/software/paintor CONTACT: gkichaev@ucla.eduSupplementary information: Supplementary data are available at Bioinformatics online.
Subject(s)
Chromosome Mapping/methods , Genetic Loci , Genetic Pleiotropy , Genomics/methods , Polymorphism, Single Nucleotide , Software , Genetic Diseases, Inborn/genetics , Genome-Wide Association Study , Humans , Linkage Disequilibrium , Lipid Metabolism/genetics , Models, GeneticABSTRACT
Standard statistical approaches for prioritization of variants for functional testing in fine-mapping studies either use marginal association statistics or estimate posterior probabilities for variants to be causal under simplifying assumptions. Here, we present a probabilistic framework that integrates association strength with functional genomic annotation data to improve accuracy in selecting plausible causal variants for functional validation. A key feature of our approach is that it empirically estimates the contribution of each functional annotation to the trait of interest directly from summary association statistics while allowing for multiple causal variants at any risk locus. We devise efficient algorithms that estimate the parameters of our model across all risk loci to further increase performance. Using simulations starting from the 1000 Genomes data, we find that our framework consistently outperforms the current state-of-the-art fine-mapping methods, reducing the number of variants that need to be selected to capture 90% of the causal variants from an average of 13.3 to 10.4 SNPs per locus (as compared to the next-best performing strategy). Furthermore, we introduce a cost-to-benefit optimization framework for determining the number of variants to be followed up in functional assays and assess its performance using real and simulation data. We validate our findings using a large scale meta-analysis of four blood lipids traits and find that the relative probability for causality is increased for variants in exons and transcription start sites and decreased in repressed genomic regions at the risk loci of these traits. Using these highly predictive, trait-specific functional annotations, we estimate causality probabilities across all traits and variants, reducing the size of the 90% confidence set from an average of 17.5 to 13.5 variants per locus in this data.
Subject(s)
Algorithms , Chromosome Mapping/methods , Genome-Wide Association Study/methods , Humans , Linkage Disequilibrium , Models, Theoretical , Polymorphism, Single Nucleotide/geneticsABSTRACT
MOTIVATION: Although genome-wide association studies (GWAS) have identified thousands of variants associated with common diseases and complex traits, only a handful of these variants are validated to be causal. We consider 'causal variants' as variants which are responsible for the association signal at a locus. As opposed to association studies that benefit from linkage disequilibrium (LD), the main challenge in identifying causal variants at associated loci lies in distinguishing among the many closely correlated variants due to LD. This is particularly important for model organisms such as inbred mice, where LD extends much further than in human populations, resulting in large stretches of the genome with significantly associated variants. Furthermore, these model organisms are highly structured and require correction for population structure to remove potential spurious associations. RESULTS: In this work, we propose CAVIAR-Gene (CAusal Variants Identification in Associated Regions), a novel method that is able to operate across large LD regions of the genome while also correcting for population structure. A key feature of our approach is that it provides as output a minimally sized set of genes that captures the genes which harbor causal variants with probability ρ. Through extensive simulations, we demonstrate that our method not only speeds up computation, but also have an average of 10% higher recall rate compared with the existing approaches. We validate our method using a real mouse high-density lipoprotein data (HDL) and show that CAVIAR-Gene is able to identify Apoa2 (a gene known to harbor causal variants for HDL), while reducing the number of genes that need to be tested for functionality by a factor of 2. AVAILABILITY AND IMPLEMENTATION: Software is freely available for download at genetics.cs.ucla.edu/caviar.
Subject(s)
Genes , Genome-Wide Association Study , Algorithms , Animals , Apolipoprotein A-II/genetics , Data Interpretation, Statistical , Humans , Linkage Disequilibrium , Mice , Polymorphism, Single NucleotideABSTRACT
Transcriptome-wide association studies using predicted expression have identified thousands of genes whose locally regulated expression is associated with complex traits and diseases. In this work, we show that linkage disequilibrium induces significant gene-trait associations at non-causal genes as a function of the expression quantitative trait loci weights used in expression prediction. We introduce a probabilistic framework that models correlation among transcriptome-wide association study signals to assign a probability for every gene in the risk region to explain the observed association signal. Importantly, our approach remains accurate when expression data for causal genes are not available in the causal tissue by leveraging expression prediction from other tissues. Our approach yields credible sets of genes containing the causal gene at a nominal confidence level (for example, 90%) that can be used to prioritize genes for functional assays. We illustrate our approach by using an integrative analysis of lipid traits, where our approach prioritizes genes with strong evidence for causality.
Subject(s)
Genetic Predisposition to Disease/genetics , Transcriptome/genetics , Chromosome Mapping/methods , Genome-Wide Association Study/methods , Humans , Linkage Disequilibrium/genetics , Models, Genetic , Phenotype , Polymorphism, Single Nucleotide/genetics , Probability , Quantitative Trait Loci/geneticsABSTRACT
Nearly 100 loci have been identified for pulmonary function, almost exclusively in studies of European ancestry populations. We extend previous research by meta-analyzing genome-wide association studies of 1000 Genomes imputed variants in relation to pulmonary function in a multiethnic population of 90,715 individuals of European (N = 60,552), African (N = 8429), Asian (N = 9959), and Hispanic/Latino (N = 11,775) ethnicities. We identify over 50 additional loci at genome-wide significance in ancestry-specific or multiethnic meta-analyses. Using recent fine-mapping methods incorporating functional annotation, gene expression, and differences in linkage disequilibrium between ethnicities, we further shed light on potential causal variants and genes at known and newly identified loci. Several of the novel genes encode proteins with predicted or established drug targets, including KCNK2 and CDK12. Our study highlights the utility of multiethnic and integrative genomics approaches to extend existing knowledge of the genetics of lung function and clinical relevance of implicated loci.
Subject(s)
Genome-Wide Association Study , Linkage Disequilibrium , Lung Diseases/ethnology , Lung Diseases/genetics , Lung/physiology , Polymorphism, Single Nucleotide , Asian , Black People/genetics , Female , Forced Expiratory Volume , Genetic Predisposition to Disease , Genomics , Hispanic or Latino , Humans , Male , Pulmonary Disease, Chronic Obstructive , Quantitative Trait Loci , Regression Analysis , Sample Size , Smoking , Vital Capacity , White People/geneticsABSTRACT
Recent breakthroughs in exome-sequencing technology have made possible the identification of many causal variants of monogenic disorders. Although extremely powerful when closely related individuals (eg, child and parents) are simultaneously sequenced, sequencing of a single case is often unsuccessful due to the large number of variants that need to be followed up for functional validation. Many approaches filter out common variants above a given frequency threshold (eg, 1%), and then prioritize the remaining variants according to their functional, structural and conservation properties. Here we present methods that leverage the genetic structure across different populations to improve filtering performance while accounting for the finite sample size of the reference panels. We show that leveraging genetic structure reduces the number of variants that need to be followed up by 16% in simulations and by up to 38% in empirical data of 20 exomes from individuals with monogenic disorders for which the causal variants are known.
Subject(s)
Algorithms , Computational Biology/methods , Exome , Genetic Diseases, Inborn/genetics , Models, Statistical , Polymorphism, Single Nucleotide , Computer Simulation , Female , Genetic Diseases, Inborn/diagnosis , Genetic Diseases, Inborn/ethnology , Genetic Variation , Genome, Human , High-Throughput Nucleotide Sequencing , Humans , Inheritance Patterns , Male , Pedigree , Racial Groups , Sequence Analysis, DNAABSTRACT
Although genome-wide association studies have identified over 100 risk loci that explain â¼33% of familial risk for prostate cancer (PrCa), their functional effects on risk remain largely unknown. Here we use genotype data from 59,089 men of European and African American ancestries combined with cell-type-specific epigenetic data to build a genomic atlas of single-nucleotide polymorphism (SNP) heritability in PrCa. We find significant differences in heritability between variants in prostate-relevant epigenetic marks defined in normal versus tumour tissue as well as between tissue and cell lines. The majority of SNP heritability lies in regions marked by H3k27 acetylation in prostate adenoc7arcinoma cell line (LNCaP) or by DNaseI hypersensitive sites in cancer cell lines. We find a high degree of similarity between European and African American ancestries suggesting a similar genetic architecture from common variation underlying PrCa risk. Our findings showcase the power of integrating functional annotation with genetic data to understand the genetic basis of PrCa.