RESUMO
Genomic deletions provide a powerful loss-of-function model in noncoding regions to assess the role of purifying selection on genetic variation. Regulatory element function is characterized by nonuniform tissue and cell type activity, necessarily linking the study of fitness consequences from regulatory variants to their corresponding cellular activity. We generated a callset of deletions from genomes in the Alzheimer's Disease Neuroimaging Initiative (ADNI) and used deletions from The 1000 Genomes Project Consortium (1000GP) in order to examine whether purifying selection preserves noncoding sites of chromatin accessibility marked by DNase I hypersensitivity (DHS), histone modification (enhancer, transcribed, Polycomb-repressed, heterochromatin), and chromatin loop anchors. To examine this in a cellular activity-aware manner, we developed a statistical method, pleiotropy ratio score (PlyRS), which calculates a correlation-adjusted count of "cellular pleiotropy" for each noncoding base pair by analyzing shared regulatory annotations across tissues and cell types. By comparing real deletion PlyRS values to simulations in a length-matched framework and by using genomic covariates in analyses, we found that purifying selection acts to preserve both DHS and enhancer noncoding sites. However, we did not find evidence of purifying selection for noncoding transcribed, Polycomb-repressed, or heterochromatin sites beyond that of the noncoding background. Additionally, we found evidence that purifying selection is acting on chromatin loop integrity by preserving colocalized CTCF binding sites. At regions of DHS, enhancer, and CTCF within chromatin loop anchors, we found evidence that both sites of activity specific to a particular tissue or cell type and sites of cellularly pleiotropic activity are preserved by selection.
Assuntos
Cromatina , Genômica , Sítios de Ligação , Cromatina/genética , Humanos , Proteínas do Grupo Polycomb/metabolismoRESUMO
The rapid decrease in sequencing cost has enabled genetic studies to discover rare variants associated with complex diseases and traits. Once this association is identified, the next step is to understand the genetic mechanism of rare variants on how the variants influence diseases. Similar to the hypothesis of common variants, rare variants may affect diseases by regulating gene expression, and recently, several studies have identified the effects of rare variants on gene expression using heritability and expression outlier analyses. However, identifying individual genes whose expression is regulated by rare variants has been challenging due to the relatively small sample size of expression quantitative trait loci studies and statistical approaches not optimized to detect the effects of rare variants. In this study, we analyze whole-genome sequencing and RNA-seq data of 681 European individuals collected for the Genotype-Tissue Expression (GTEx) project (v8) to identify individual genes in 49 human tissues whose expression is regulated by rare variants. To improve statistical power, we develop an approach based on a likelihood ratio test that combines effects of multiple rare variants in a nonlinear manner and has higher power than previous approaches. Using GTEx data, we identify many genes regulated by rare variants, and some of them are only regulated by rare variants and not by common variants. We also find that genes regulated by rare variants are enriched for expression outliers and disease-causing genes. These results suggest the regulatory effects of rare variants, which would be important in interpreting associations of rare variants with complex traits.
Assuntos
Regulação da Expressão Gênica , Locos de Características Quantitativas , Humanos , Herança MultifatorialRESUMO
Late-onset Alzheimer's disease (LOAD) is the most common type of dementia causing irreversible brain damage to the elderly and presents a major public health challenge. Clinical research and genome-wide association studies have suggested a potential contribution of the endocytic pathway to AD, with an emphasis on common loci. However, the contribution of rare variants in this pathway to AD has not been thoroughly investigated. In this study, we focused on the effect of rare variants on AD by first applying a rare-variant gene-set burden analysis using genes in the endocytic pathway on over 3,000 individuals with European ancestry from three large whole-genome sequencing (WGS) studies. We identified significant associations of rare-variant burden within the endocytic pathway with AD, which were successfully replicated in independent datasets. We further demonstrated that this endocytic rare-variant enrichment is associated with neurofibrillary tangles (NFTs) and age-related phenotypes, increasing the risk of obtaining severer brain damage, earlier age-at-onset, and earlier age-of-death. Next, by aggregating rare variants within each gene, we sought to identify single endocytic genes associated with AD and NFTs. Careful examination using NFTs revealed one significantly associated gene, ANKRD13D. To identify functional associations, we integrated bulk RNA-Seq data from over 600 brain tissues and found two endocytic expression genes (eGenes), HLA-A and SLC26A7, that displayed significant influences on their gene expressions. Differential expressions between AD patients and controls of these three identified genes were further examined by incorporating scRNA-Seq data from 48 post-mortem brain samples and demonstrated distinct expression patterns across cell types. Taken together, our results demonstrated strong rare-variant effect in the endocytic pathway on AD risk and progression and functional effect of gene expression alteration in both bulk and single-cell resolution, which may bring more insight and serve as valuable resources for future AD genetic studies, clinical research, and therapeutic targeting.
Assuntos
Doença de Alzheimer/patologia , Endocitose , Fenótipo , Doença de Alzheimer/genética , Estudo de Associação Genômica Ampla , Humanos , Polimorfismo de Nucleotídeo Único , Sequenciamento Completo do GenomaRESUMO
MOTIVATION: Since the first human genome was sequenced in 2001, there has been a rapid growth in the number of bioinformatic methods to process and analyze next-generation sequencing (NGS) data for research and clinical studies that aim to identify genetic variants influencing diseases and traits. To achieve this goal, one first needs to call genetic variants from NGS data, which requires multiple computationally intensive analysis steps. Unfortunately, there is a lack of an open-source pipeline that can perform all these steps on NGS data in a manner, which is fully automated, efficient, rapid, scalable, modular, user-friendly and fault tolerant. To address this, we introduce xGAP, an extensible Genome Analysis Pipeline, which implements modified GATK best practice to analyze DNA-seq data with the aforementioned functionalities. RESULTS: xGAP implements massive parallelization of the modified GATK best practice pipeline by splitting a genome into many smaller regions with efficient load-balancing to achieve high scalability. It can process 30× coverage whole-genome sequencing (WGS) data in â¼90 min. In terms of accuracy of discovered variants, xGAP achieves average F1 scores of 99.37% for single nucleotide variants and 99.20% for insertion/deletions across seven benchmark WGS datasets. We achieve highly consistent results across multiple on-premises (SGE & SLURM) high-performance clusters. Compared to the Churchill pipeline, with similar parallelization, xGAP is 20% faster when analyzing 50× coverage WGS on Amazon Web Service. Finally, xGAP is user-friendly and fault tolerant where it can automatically re-initiate failed processes to minimize required user intervention. AVAILABILITY AND IMPLEMENTATION: xGAP is available at https://github.com/Adigorla/xgap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
RESUMO
Many disease risk loci identified in genome-wide association studies are present in non-coding regions of the genome. Previous studies have found enrichment of expression quantitative trait loci (eQTLs) in disease risk loci, indicating that identifying causal variants for gene expression is important for elucidating the genetic basis of not only gene expression but also complex traits. However, detecting causal variants is challenging due to complex genetic correlation among variants known as linkage disequilibrium (LD) and the presence of multiple causal variants within a locus. Although several fine-mapping approaches have been developed to overcome these challenges, they may produce large sets of putative causal variants when true causal variants are in high LD with many non-causal variants. In eQTL studies, there is an additional source of information that can be used to improve fine-mapping called allelic imbalance (AIM) that measures imbalance in gene expression on two chromosomes of a diploid organism. In this work, we develop a novel statistical method that leverages both AIM and total expression data to detect causal variants that regulate gene expression. We illustrate through simulations and application to 10 tissues of the Genotype-Tissue Expression (GTEx) dataset that our method identifies the true causal variants with higher specificity than an approach that uses only eQTL information. Across all tissues and genes, our method achieves a median reduction rate of 11% in the number of putative causal variants. We use chromatin state data from the Roadmap Epigenomics Consortium to show that the putative causal variants identified by our method are enriched for active regions of the genome, providing orthogonal support that our method identifies causal variants with increased specificity.
Assuntos
Desequilíbrio Alélico , Cromatina/genética , Mapeamento Cromossômico/métodos , Locos de Características Quantitativas , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla , Humanos , Desequilíbrio de Ligação , Herança Multifatorial , Polimorfismo de Nucleotídeo ÚnicoRESUMO
BACE1 is the rate-limiting enzyme for amyloid-ß peptides (Aß) generation, a key event in the pathogenesis of Alzheimer's disease (AD). By an unknown mechanism, levels of BACE1 and a BACE1 mRNA-stabilizing antisense RNA (BACE1-AS) are elevated in the brains of AD patients, implicating that dysregulation of BACE1 expression plays an important role in AD pathogenesis. We found that nuclear factor erythroid-derived 2-related factor 2 (NRF2/NFE2L2) represses the expression of BACE1 and BACE1-AS through binding to antioxidant response elements (AREs) in their promoters of mouse and human. NRF2-mediated inhibition of BACE1 and BACE1-AS expression is independent of redox regulation. NRF2 activation decreases production of BACE1 and BACE1-AS transcripts and Aß production and ameliorates cognitive deficits in animal models of AD. Depletion of NRF2 increases BACE1 and BACE1-AS expression and Aß production and worsens cognitive deficits. Our findings suggest that activation of NRF2 can prevent a key early pathogenic process in AD.
Assuntos
Doença de Alzheimer/metabolismo , Secretases da Proteína Precursora do Amiloide/metabolismo , Ácido Aspártico Endopeptidases/metabolismo , Transtornos Cognitivos/metabolismo , Fator 2 Relacionado a NF-E2/metabolismo , Doença de Alzheimer/patologia , Secretases da Proteína Precursora do Amiloide/genética , Peptídeos beta-Amiloides/metabolismo , Animais , Ácido Aspártico Endopeptidases/genética , Transtornos Cognitivos/patologia , Modelos Animais de Doenças , Regulação da Expressão Gênica , Humanos , Isotiocianatos/farmacologia , Camundongos , Camundongos Transgênicos , Fator 2 Relacionado a NF-E2/biossíntese , Regiões Promotoras Genéticas , Ligação Proteica , Espécies Reativas de Oxigênio/metabolismo , Sulfóxidos , Transcrição GênicaRESUMO
Most population isolates examined to date were founded from a single ancestral population. Consequently, there is limited knowledge about the demographic history of admixed population isolates. Here we investigate genomic diversity of recently admixed population isolates from Costa Rica and Colombia and compare their diversity to a benchmark population isolate, the Finnish. These Latin American isolates originated during the 16th century from admixture between a few hundred European males and Amerindian females, with a limited contribution from African founders. We examine whole-genome sequence data from 449 individuals, ascertained as families to build mutigenerational pedigrees, with a mean sequencing depth of coverage of approximately 36×. We find that Latin American isolates have increased genetic diversity relative to the Finnish. However, there is an increase in the amount of identity by descent (IBD) segments in the Latin American isolates relative to the Finnish. The increase in IBD segments is likely a consequence of a very recent and severe population bottleneck during the founding of the admixed population isolates. Furthermore, the proportion of the genome that falls within a long run of homozygosity (ROH) in Costa Rican and Colombian individuals is significantly greater than that in the Finnish, suggesting more recent consanguinity in the Latin American isolates relative to that seen in the Finnish. Lastly, we find that recent consanguinity increased the number of deleterious variants found in the homozygous state, which is relevant if deleterious variants are recessive. Our study suggests that there is no single genetic signature of a population isolate.
Assuntos
Genoma Humano/genética , Colômbia , Consanguinidade , Costa Rica , Feminino , Genética Populacional/métodos , Genômica/métodos , Homozigoto , Humanos , Masculino , Linhagem , População Branca/genética , Sequenciamento Completo do Genoma/métodosRESUMO
A genome-wide association study (GWAS) seeks to identify genetic variants that contribute to the development and progression of a specific disease. Over the past 10 years, new approaches using mixed models have emerged to mitigate the deleterious effects of population structure and relatedness in association studies. However, developing GWAS techniques to accurately test for association while correcting for population structure is a computational and statistical challenge. Using laboratory mouse strains as an example, our review characterizes the problem of population structure in association studies and describes how it can cause false positive associations. We then motivate mixed models in the context of unmodeled factors.
Assuntos
Genética Populacional , Estudo de Associação Genômica Ampla/métodos , Modelos Genéticos , Animais , Viés , Doença/genética , Feminino , Estudo de Associação Genômica Ampla/estatística & dados numéricos , Humanos , Modelos Lineares , Masculino , Camundongos , Modelos Estatísticos , Linhagem , Fenótipo , Filogenia , Polimorfismo de Nucleotídeo ÚnicoRESUMO
Next-generation sequencing technology (NGS) enables the discovery of nearly all genetic variants present in a genome. A subset of these variants, however, may have poor sequencing quality due to limitations in NGS or variant callers. In genetic studies that analyze a large number of sequenced individuals, it is critical to detect and remove those variants with poor quality as they may cause spurious findings. In this paper, we present ForestQC, a statistical tool for performing quality control on variants identified from NGS data by combining a traditional filtering approach and a machine learning approach. Our software uses the information on sequencing quality, such as sequencing depth, genotyping quality, and GC contents, to predict whether a particular variant is likely to be false-positive. To evaluate ForestQC, we applied it to two whole-genome sequencing datasets where one dataset consists of related individuals from families while the other consists of unrelated individuals. Results indicate that ForestQC outperforms widely used methods for performing quality control on variants such as VQSR of GATK by considerably improving the quality of variants to be included in the analysis. ForestQC is also very efficient, and hence can be applied to large sequencing datasets. We conclude that combining a machine learning algorithm trained with sequencing quality information and the filtering approach is a practical approach to perform quality control on genetic variants from sequencing data.
Assuntos
Variação Genética , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Software , Algoritmos , Biologia Computacional , Bases de Dados Genéticas/estatística & dados numéricos , Sequenciamento de Nucleotídeos em Larga Escala/normas , Humanos , Aprendizado de Máquina , Polimorfismo de Nucleotídeo Único , Controle de Qualidade , Sequenciamento Completo do Genoma/normas , Sequenciamento Completo do Genoma/estatística & dados numéricosRESUMO
Recently, multiple studies have performed whole-exome or whole-genome sequencing to identify groups of rare variants associated with complex traits and diseases. They have primarily utilized case-control study designs that often require thousands of individuals to reach acceptable statistical power. Family-based studies can be more powerful because a rare variant can be enriched in an extended pedigree and segregate with the phenotype. Although many methods have been proposed for using family data to discover rare variants involved in a disease, a majority of them focus on a specific pedigree structure and are designed to analyze either binary or continuously measured outcomes. In this article, we propose RareIBD, a general and powerful approach to identifying rare variants involved in disease susceptibility. Our method can be applied to large extended families of arbitrary structure, including pedigrees with only affected individuals. The method accommodates both binary and quantitative traits. A series of simulation experiments suggest that RareIBD is a powerful test that outperforms existing approaches. In addition, our method accounts for individuals in top generations, which are not usually genotyped in extended families. In contrast to available statistical tests, RareIBD generates accurate p values even when genetic data from these individuals are missing. We applied RareIBD, as well as other methods, to two extended family datasets generated by different genotyping technologies and representing different ethnicities. The analysis of real data confirmed that RareIBD is the only method that properly controls type I error.
Assuntos
Família , Predisposição Genética para Doença/genética , Variação Genética/genética , Linhagem , Conjuntos de Dados como Assunto , Etnicidade/genética , Feminino , Genótipo , Humanos , Masculino , Modelos Genéticos , Fenótipo , Projetos de PesquisaRESUMO
The vast majority of genome-wide association study (GWAS) risk loci fall in non-coding regions of the genome. One possible hypothesis is that these GWAS risk loci alter the individual's disease risk through their effect on gene expression in different tissues. In order to understand the mechanisms driving a GWAS risk locus, it is helpful to determine which gene is affected in specific tissue types. For example, the relevant gene and tissue could play a role in the disease mechanism if the same variant responsible for a GWAS locus also affects gene expression. Identifying whether or not the same variant is causal in both GWASs and expression quantitative trail locus (eQTL) studies is challenging because of the uncertainty induced by linkage disequilibrium and the fact that some loci harbor multiple causal variants. However, current methods that address this problem assume that each locus contains a single causal variant. In this paper, we present eCAVIAR, a probabilistic method that has several key advantages over existing methods. First, our method can account for more than one causal variant in any given locus. Second, it can leverage summary statistics without accessing the individual genotype data. We use both simulated and real datasets to demonstrate the utility of our method. Using publicly available eQTL data on 45 different tissues, we demonstrate that eCAVIAR can prioritize likely relevant tissues and target genes for a set of glucose- and insulin-related trait loci.
Assuntos
Predisposição Genética para Doença/genética , Estudo de Associação Genômica Ampla/métodos , Modelos Genéticos , Modelos Estatísticos , Locos de Características Quantitativas/genética , Conjuntos de Dados como Assunto , Regulação da Expressão Gênica/genética , Genótipo , Glucose/metabolismo , Humanos , Insulina/metabolismo , Desequilíbrio de Ligação , Especificidade de Órgãos , Probabilidade , Tamanho da AmostraRESUMO
Although genome-wide association studies (GWASs) have discovered numerous novel genetic variants associated with many complex traits and diseases, those genetic variants typically explain only a small fraction of phenotypic variance. Factors that account for phenotypic variance include environmental factors and gene-by-environment interactions (GEIs). Recently, several studies have conducted genome-wide gene-by-environment association analyses and demonstrated important roles of GEIs in complex traits. One of the main challenges in these association studies is to control effects of population structure that may cause spurious associations. Many studies have analyzed how population structure influences statistics of genetic variants and developed several statistical approaches to correct for population structure. However, the impact of population structure on GEI statistics in GWASs has not been extensively studied and nor have there been methods designed to correct for population structure on GEI statistics. In this paper, we show both analytically and empirically that population structure may cause spurious GEIs and use both simulation and two GWAS datasets to support our finding. We propose a statistical approach based on mixed models to account for population structure on GEI statistics. We find that our approach effectively controls population structure on statistics for GEIs as well as for genetic variants.
Assuntos
Interação Gene-Ambiente , Genética Populacional , Genoma Humano , Estudo de Associação Genômica Ampla/métodos , Simulação por Computador , Humanos , Modelos Genéticos , Fenótipo , Polimorfismo de Nucleotídeo Único/genéticaRESUMO
Obstructive sleep apnea (OSA) is a common heritable disorder displaying marked sexual dimorphism in disease prevalence and progression. Previous genetic association studies have identified a few genetic loci associated with OSA and related quantitative traits, but they have only focused on single ethnic groups, and a large proportion of the heritability remains unexplained. The apnea-hypopnea index (AHI) is a commonly used quantitative measure characterizing OSA severity. Because OSA differs by sex, and the pathophysiology of obstructive events differ in rapid eye movement (REM) and non-REM (NREM) sleep, we hypothesized that additional genetic association signals would be identified by analyzing the NREM/REM-specific AHI and by conducting sex-specific analyses in multiethnic samples. We performed genome-wide association tests for up to 19,733 participants of African, Asian, European, and Hispanic/Latino American ancestry in 7 studies. We identified rs12936587 on chromosome 17 as a possible quantitative trait locus for NREM AHI in men (N = 6,737; P = 1.7 × 10-8) but not in women (P = 0.77). The association with NREM AHI was replicated in a physiological research study (N = 67; P = 0.047). This locus overlapping the RAI1 gene and encompassing genes PEMT1, SREBF1, and RASD1 was previously reported to be associated with coronary artery disease, lipid metabolism, and implicated in Potocki-Lupski syndrome and Smith-Magenis syndrome, which are characterized by abnormal sleep phenotypes. We also identified gene-by-sex interactions in suggestive association regions, suggesting that genetic variants for AHI appear to vary by sex, consistent with the clinical observations of strong sexual dimorphism.
Assuntos
Estudo de Associação Genômica Ampla , Locos de Características Quantitativas/genética , Apneia Obstrutiva do Sono/genética , Sono REM/fisiologia , Fatores de Transcrição/genética , Adulto , Idoso , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Fosfatidiletanolamina N-Metiltransferase/genética , Caracteres Sexuais , Proteína de Ligação a Elemento Regulador de Esterol 1/genética , Transativadores , Proteínas ras/genéticaRESUMO
Meta-analysis strategies have become critical to augment power of genome-wide association studies (GWAS). To reduce genotyping or sequencing cost, many studies today utilize shared controls, and these individuals can inadvertently overlap among multiple studies. If these overlapping individuals are not taken into account in meta-analysis, they can induce spurious associations. In this article, we propose a general framework for adjusting association statistics to account for overlapping subjects within a meta-analysis. The key idea of our method is to transform the covariance structure of the data, so it can be used in downstream analyses. As a result, the strategy is very flexible and allows a wide range of meta-analysis methods, such as the random effects model, to account for overlapping subjects. Using simulations and real datasets, we demonstrate that our method has utility in meta-analyses of GWAS, as well as in a multi-tissue mouse expression quantitative trait loci (eQTL) study where our method increases the number of discovered eQTL by up to 19% compared with existing methods.
Assuntos
Doença/genética , Estudo de Associação Genômica Ampla/métodos , Metanálise como Assunto , Polimorfismo de Nucleotídeo Único/genética , Locos de Características Quantitativas/genética , Animais , Estudos de Casos e Controles , Perfilação da Expressão Gênica , Humanos , Camundongos , Modelos TeóricosRESUMO
In studies of expression quantitative trait loci (eQTLs), it is of increasing interest to identify eGenes, the genes whose expression levels are associated with variation at a particular genetic variant. Detecting eGenes is important for follow-up analyses and prioritization because genes are the main entities in biological processes. To detect eGenes, one typically focuses on the genetic variant with the minimum p value among all variants in cis with a gene and corrects for multiple testing to obtain a gene-level p value. For performing multiple-testing correction, a permutation test is widely used. Because of growing sample sizes of eQTL studies, however, the permutation test has become a computational bottleneck in eQTL studies. In this paper, we propose an efficient approach for correcting for multiple testing and assess eGene p values by utilizing a multivariate normal distribution. Our approach properly takes into account the linkage-disequilibrium structure among variants, and its time complexity is independent of sample size. By applying our small-sample correction techniques, our method achieves high accuracy in both small and large studies. We have shown that our method consistently produces extremely accurate p values (accuracy > 98%) for three human eQTL datasets with different sample sizes and SNP densities: the Genotype-Tissue Expression pilot dataset, the multi-region brain dataset, and the HapMap 3 dataset.
Assuntos
Interpretação Estatística de Dados , Regulação da Expressão Gênica/genética , Genes/genética , Variação Genética , Locos de Características Quantitativas/genética , Humanos , Análise Multivariada , Distribuição Normal , Polimorfismo de Nucleotídeo Único/genética , Probabilidade , Tamanho da Amostra , Estatísticas não ParamétricasRESUMO
MOTIVATION: There is recent interest in using gene expression data to contextualize findings from traditional genome-wide association studies (GWAS). Conditioned on a tissue, expression quantitative trait loci (eQTLs) are genetic variants associated with gene expression, and eGenes are genes whose expression levels are associated with genetic variants. eQTLs and eGenes provide great supporting evidence for GWAS hits and important insights into the regulatory pathways involved in many diseases. When a significant variant or a candidate gene identified by GWAS is also an eQTL or eGene, there is strong evidence to further study this variant or gene. Multi-tissue gene expression datasets like the Gene Tissue Expression (GTEx) data are used to find eQTLs and eGenes. Unfortunately, these datasets often have small sample sizes in some tissues. For this reason, there have been many meta-analysis methods designed to combine gene expression data across many tissues to increase power for finding eQTLs and eGenes. However, these existing techniques are not scalable to datasets containing many tissues, like the GTEx data. Furthermore, these methods ignore a biological insight that the same variant may be associated with the same gene across similar tissues. RESULTS: We introduce a meta-analysis model that addresses these problems in existing methods. We focus on the problem of finding eGenes in gene expression data from many tissues, and show that our model is better than other types of meta-analyses. AVAILABILITY AND IMPLEMENTATION: Source code is at https://github.com/datduong/RECOV . CONTACT: eeskin@cs.ucla.edu or datdb@cs.ucla.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Biologia Computacional/métodos , Variação Genética , Locos de Características Quantitativas , Software , Perfilação da Expressão Gênica/métodos , Estudo de Associação Genômica Ampla/métodos , Humanos , Metanálise como Assunto , Modelos GenéticosRESUMO
MOTIVATION: Expression quantitative trait loci (eQTLs) are genetic variants that affect gene expression. In eQTL studies, one important task is to find eGenes or genes whose expressions are associated with at least one eQTL. The standard statistical method to determine whether a gene is an eGene requires association testing at all nearby variants and the permutation test to correct for multiple testing. The standard method however does not consider genomic annotation of the variants. In practice, variants near gene transcription start sites (TSSs) or certain histone modifications are likely to regulate gene expression. In this article, we introduce a novel eGene detection method that considers this empirical evidence and thereby increases the statistical power. RESULTS: We applied our method to the liver Genotype-Tissue Expression (GTEx) data using distance from TSSs, DNase hypersensitivity sites, and six histone modifications as the genomic annotations for the variants. Each of these annotations helped us detected more candidate eGenes. Distance from TSS appears to be the most important annotation; specifically, using this annotation, our method discovered 50% more candidate eGenes than the standard permutation method. CONTACT: buhm.han@amc.seoul.kr or eeskin@cs.ucla.edu.
Assuntos
Genômica , Variação Genética , Genótipo , Polimorfismo de Nucleotídeo Único , Locos de Características QuantitativasRESUMO
Gene expression data, in conjunction with information on genetic variants, have enabled studies to identify expression quantitative trait loci (eQTLs) or polymorphic locations in the genome that are associated with expression levels. Moreover, recent technological developments and cost decreases have further enabled studies to collect expression data in multiple tissues. One advantage of multiple tissue datasets is that studies can combine results from different tissues to identify eQTLs more accurately than examining each tissue separately. The idea of aggregating results of multiple tissues is closely related to the idea of meta-analysis which aggregates results of multiple genome-wide association studies to improve the power to detect associations. In principle, meta-analysis methods can be used to combine results from multiple tissues. However, eQTLs may have effects in only a single tissue, in all tissues, or in a subset of tissues with possibly different effect sizes. This heterogeneity in terms of effects across multiple tissues presents a key challenge to detect eQTLs. In this paper, we develop a framework that leverages two popular meta-analysis methods that address effect size heterogeneity to detect eQTLs across multiple tissues. We show by using simulations and multiple tissue data from mouse that our approach detects many eQTLs undetected by traditional eQTL methods. Additionally, our method provides an interpretation framework that accurately predicts whether an eQTL has an effect in a particular tissue.
Assuntos
Expressão Gênica , Estudo de Associação Genômica Ampla , Locos de Características Quantitativas/genética , Animais , Perfilação da Expressão Gênica , Genoma , Camundongos , Modelos Teóricos , Especificidade de ÓrgãosRESUMO
Phenome-wide association studies (PheWAS) facilitate the discovery of associations between a single genetic variant with multiple phenotypes. For variants which impact a specific protein, this can help identify additional therapeutic indications or on-target side effects of intervening on that protein. However, PheWAS is restricted by an inability to distinguish confounding due to linkage disequilibrium (LD) from true pleiotropy. Here we describe CoPheScan (Coloc adapted Phenome-wide Scan), a Bayesian approach that enables an intuitive and systematic exploration of causal associations while simultaneously addressing LD confounding. We demonstrate its performance through simulation, showing considerably better control of false positive rates than a conventional approach not accounting for LD. We used CoPheScan to perform PheWAS of protein-truncating variants and fine-mapped variants from disease and pQTL studies, in 2275 disease phenotypes from the UK Biobank. Our results identify the complexity of known pleiotropic genes such as APOE, and suggest a new causal role for TGM3 in skin cancer.