RESUMEN
Long non-coding RNA (lncRNA) genes have well-established and important impacts on molecular and cellular functions. However, among the thousands of lncRNA genes, it is still a major challenge to identify the subset with disease or trait relevance. To systematically characterize these lncRNA genes, we used Genotype Tissue Expression (GTEx) project v8 genetic and multi-tissue transcriptomic data to profile the expression, genetic regulation, cellular contexts, and trait associations of 14,100 lncRNA genes across 49 tissues for 101 distinct complex genetic traits. Using these approaches, we identified 1,432 lncRNA gene-trait associations, 800 of which were not explained by stronger effects of neighboring protein-coding genes. This included associations between lncRNA quantitative trait loci and inflammatory bowel disease, type 1 and type 2 diabetes, and coronary artery disease, as well as rare variant associations to body mass index.
Asunto(s)
Enfermedad/genética , Herencia Multifactorial/genética , Población/genética , ARN Largo no Codificante/genética , Transcriptoma , Enfermedad de la Arteria Coronaria/genética , Diabetes Mellitus Tipo 1/genética , Diabetes Mellitus Tipo 2/genética , Perfilación de la Expresión Génica , Variación Genética , Humanos , Enfermedades Inflamatorias del Intestino/genética , Especificidad de Órganos/genética , Sitios de Carácter CuantitativoRESUMEN
Replicability is the cornerstone of modern scientific research. Reliable identifications of genotype-phenotype associations that are significant in multiple genome-wide association studies (GWASs) provide stronger evidence for the findings. Current replicability analysis relies on the independence assumption among single-nucleotide polymorphisms (SNPs) and ignores the linkage disequilibrium (LD) structure. We show that such a strategy may produce either overly liberal or overly conservative results in practice. We develop an efficient method, ReAD, to detect replicable SNPs associated with the phenotype from two GWASs accounting for the LD structure. The local dependence structure of SNPs across two heterogeneous studies is captured by a four-state hidden Markov model (HMM) built on two sequences of p values. By incorporating information from adjacent locations via the HMM, our approach provides more accurate SNP significance rankings. ReAD is scalable, platform independent, and more powerful than existing replicability analysis methods with effective false discovery rate control. Through analysis of datasets from two asthma GWASs and two ulcerative colitis GWASs, we show that ReAD can identify replicable genetic loci that existing methods might otherwise miss.
Asunto(s)
Asma , Estudio de Asociación del Genoma Completo , Desequilibrio de Ligamiento , Polimorfismo de Nucleótido Simple , Estudio de Asociación del Genoma Completo/métodos , Humanos , Asma/genética , Cadenas de Markov , Colitis Ulcerosa/genética , Reproducibilidad de los Resultados , Fenotipo , GenotipoRESUMEN
Integrative genetic association methods have shown great promise in post-GWAS (genome-wide association study) analyses, in which one of the most challenging tasks is identifying putative causal genes and uncovering molecular mechanisms of complex traits. Recent studies suggest that prevailing computational approaches, including transcriptome-wide association studies (TWASs) and colocalization analysis, are individually imperfect, but their joint usage can yield robust and powerful inference results. This paper presents INTACT, a computational framework to integrate probabilistic evidence from these distinct types of analyses and implicate putative causal genes. This procedure is flexible and can work with a wide range of existing integrative analysis approaches. It has the unique ability to quantify the uncertainty of implicated genes, enabling rigorous control of false-positive discoveries. Taking advantage of this highly desirable feature, we further propose an efficient algorithm, INTACT-GSE, for gene set enrichment analysis based on the integrated probabilistic evidence. We examine the proposed computational methods and illustrate their improved performance over the existing approaches through simulation studies. We apply the proposed methods to analyze the multi-tissue eQTL data from the GTEx project and eight large-scale complex- and molecular-trait GWAS datasets from multiple consortia and the UK Biobank. Overall, we find that the proposed methods markedly improve the existing putative gene implication methods and are particularly advantageous in evaluating and identifying key gene sets and biological pathways underlying complex traits.
Asunto(s)
Estudio de Asociación del Genoma Completo , Transcriptoma , Humanos , Transcriptoma/genética , Estudio de Asociación del Genoma Completo/métodos , Herencia Multifactorial/genética , Sitios de Carácter Cuantitativo/genética , Simulación por Computador , Polimorfismo de Nucleótido Simple/genética , Predisposición Genética a la EnfermedadRESUMEN
Synthetic glucocorticoids, such as dexamethasone, have been used as a treatment for many immune conditions, such as asthma and, more recently, severe COVID-19. Single-cell data can capture more fine-grained details on transcriptional variability and dynamics to gain a better understanding of the molecular underpinnings of inter-individual variation in drug response. Here, we used single-cell RNA-seq to study the dynamics of the transcriptional response to glucocorticoids in activated peripheral blood mononuclear cells from 96 African American children. We used novel statistical approaches to calculate a mean-independent measure of gene expression variability and a measure of transcriptional response pseudotime. Using these approaches, we showed that glucocorticoids reverse the effects of immune stimulation on both gene expression mean and variability. Our novel measure of gene expression response dynamics, based on the diagonal linear discriminant analysis, separated individual cells by response status on the basis of their transcriptional profiles and allowed us to identify different dynamic patterns of gene expression along the response pseudotime. We identified genetic variants regulating gene expression mean and variability, including treatment-specific effects, and showed widespread genetic regulation of the transcriptional dynamics of the gene expression response.
Asunto(s)
COVID-19 , Glucocorticoides , Niño , Humanos , Glucocorticoides/farmacología , Glucocorticoides/metabolismo , Leucocitos Mononucleares/metabolismo , COVID-19/genética , Regulación de la Expresión GénicaRESUMEN
Transcriptome-wide association studies and colocalization analysis are popular computational approaches for integrating genetic-association data from molecular and complex traits. They show the unique ability to go beyond variant-level genetic-association evidence and implicate critical functional units, e.g., genes, in disease etiology. However, in practice, when the two approaches are applied to the same molecular and complex-trait data, the inference results can be markedly different. This paper systematically investigates the inferential reproducibility between the two approaches through theoretical derivation, numerical experiments, and analyses of four complex trait GWAS and GTEx eQTL data. We identify two classes of inconsistent inference results. We find that the first class of inconsistent results (i.e., genes with strong colocalization but weak transcriptome-wide association study [TWAS] signals) might suggest an interesting biological phenomenon, i.e., horizontal pleiotropy; thus, the two approaches are truly complementary. The inconsistency in the second class (i.e., genes with weak colocalization but strong TWAS signals) can be understood and effectively reconciled. To this end, we propose a computational approach for locus-level colocalization analysis. We demonstrate that the joint TWAS and locus-level colocalization analysis improves specificity and sensitivity for implicating biologically relevant genes.
Asunto(s)
Estudio de Asociación del Genoma Completo , Transcriptoma , Predisposición Genética a la Enfermedad , Estudio de Asociación del Genoma Completo/métodos , Humanos , Polimorfismo de Nucleótido Simple/genética , Sitios de Carácter Cuantitativo/genética , Reproducibilidad de los Resultados , Transcriptoma/genéticaRESUMEN
Transcriptomics data have been integrated with genome-wide association studies (GWASs) to help understand disease/trait molecular mechanisms. The utility of metabolomics, integrated with transcriptomics and disease GWASs, to understand molecular mechanisms for metabolite levels or diseases has not been thoroughly evaluated. We performed probabilistic transcriptome-wide association and locus-level colocalization analyses to integrate transcriptomics results for 49 tissues in 706 individuals from the GTEx project, metabolomics results for 1,391 plasma metabolites in 6,136 Finnish men from the METSIM study, and GWAS results for 2,861 disease traits in 260,405 Finnish individuals from the FinnGen study. We found that genetic variants that regulate metabolite levels were more likely to influence gene expression and disease risk compared to the ones that do not. Integrating transcriptomics with metabolomics results prioritized 397 genes for 521 metabolites, including 496 previously identified gene-metabolite pairs with strong functional connections and suggested 33.3% of such gene-metabolite pairs shared the same causal variants with genetic associations of gene expression. Integrating transcriptomics and metabolomics individually with FinnGen GWAS results identified 1,597 genes for 790 disease traits. Integrating transcriptomics and metabolomics jointly with FinnGen GWAS results helped pinpoint metabolic pathways from genes to diseases. We identified putative causal effects of UGT1A1/UGT1A4 expression on gallbladder disorders through regulating plasma (E,E)-bilirubin levels, of SLC22A5 expression on nasal polyps and plasma carnitine levels through distinct pathways, and of LIPC expression on age-related macular degeneration through glycerophospholipid metabolic pathways. Our study highlights the power of integrating multiple sets of molecular traits and GWAS results to deepen understanding of disease pathophysiology.
Asunto(s)
Estudio de Asociación del Genoma Completo , Transcriptoma , Bilirrubina , Carnitina , Glicerofosfolípidos , Humanos , Masculino , Metabolómica , Sitios de Carácter Cuantitativo/genética , Miembro 5 de la Familia 22 de Transportadores de Solutos/genética , Transcriptoma/genéticaRESUMEN
Colocalization analysis has emerged as a powerful tool to uncover the overlapping of causal variants responsible for both molecular and complex disease phenotypes. The findings from colocalization analysis yield insights into the molecular pathways of complex diseases. In this paper, we conduct an in-depth investigation of the promise and limitations of the available colocalization analysis approaches. Focusing on variant-level colocalization approaches, we first establish the connections between various existing methods. We proceed to discuss the impacts of various controllable analytical factors and uncontrollable practical factors on outcomes of colocalization analysis through realistic simulations and real data examples. We identify a single analytical factor, the specification of prior enrichment levels, which can lead to severe inflation of false-positive colocalization findings. Meanwhile, the combination of many other analytical and practical factors all lead to diminished power. Consequently, we recommend the following strategies for the best practice of colocalization analysis: (1) estimating prior enrichment level from the observed data and (2) separating fine-mapping and colocalization analysis. Our analysis of 4,091 complex traits and the multi-tissue expression quantitative trait loci (eQTL) data from the GTEx (v.8) suggests that colocalizations of molecular QTLs and causal complex trait associations are widespread. However, only a small proportion can be confidently identified from currently available data due to a lack of power. Our findings set a benchmark for current and future integrative genetic association analysis applications.
Asunto(s)
Estudio de Asociación del Genoma Completo/métodos , Herencia Multifactorial/genética , Polimorfismo de Nucleótido Simple/genética , Sitios de Carácter Cuantitativo/genética , Predisposición Genética a la Enfermedad/genética , Humanos , Desequilibrio de Ligamiento/genética , FenotipoRESUMEN
MOTIVATION: Replicability is the cornerstone of scientific research. The current statistical method for high-dimensional replicability analysis either cannot control the false discovery rate (FDR) or is too conservative. RESULTS: We propose a statistical method, JUMP, for the high-dimensional replicability analysis of two studies. The input is a high-dimensional paired sequence of p-values from two studies and the test statistic is the maximum of p-values of the pair. JUMP uses four states of the p-value pairs to indicate whether they are null or non-null. Conditional on the hidden states, JUMP computes the cumulative distribution function of the maximum of p-values for each state to conservatively approximate the probability of rejection under the composite null of replicability. JUMP estimates unknown parameters and uses a step-up procedure to control FDR. By incorporating different states of composite null, JUMP achieves a substantial power gain over existing methods while controlling the FDR. Analyzing two pairs of spatially resolved transcriptomic datasets, JUMP makes biological discoveries that otherwise cannot be obtained by using existing methods. AVAILABILITY AND IMPLEMENTATION: An R package JUMP implementing the JUMP method is available on CRAN (https://CRAN.R-project.org/package=JUMP).
Asunto(s)
Perfilación de la Expresión Génica , Transcriptoma , Perfilación de la Expresión Génica/métodosRESUMEN
Ensuring reproducibility of results in high-throughput experiments is crucial for biomedical research. Here, we propose a set of computational methods, INTRIGUE, to evaluate and control reproducibility in high-throughput settings. Our approaches are built on a new definition of reproducibility that emphasizes directional consistency when experimental units are assessed with signed effect size estimates. The proposed methods are designed to (1) assess the overall reproducible quality of multiple studies and (2) evaluate reproducibility at the individual experimental unit levels. We demonstrate the proposed methods in detecting unobserved batch effects via simulations. We further illustrate the versatility of the proposed methods in transcriptome-wide association studies: in addition to reproducible quality control, they are also suited to investigating genuine biological heterogeneity. Finally, we discuss the potential extensions of the proposed methods in other vital areas of reproducible research (for example, publication bias and conceptual replications).
Asunto(s)
Estudio de Asociación del Genoma Completo/métodos , Ensayos Analíticos de Alto Rendimiento/métodos , Reproducibilidad de los Resultados , Transcriptoma/genética , Algoritmos , Simulación por Computador , Perfilación de la Expresión Génica , Humanos , Programas InformáticosRESUMEN
Gene-based association tests aggregate genotypes across multiple variants for each gene, providing an interpretable gene-level analysis framework for genome-wide association studies (GWAS). Early gene-based test applications often focused on rare coding variants; a more recent wave of gene-based methods, e.g. TWAS, use eQTLs to interrogate regulatory associations. Regulatory variants are expected to be particularly valuable for gene-based analysis, since most GWAS associations to date are non-coding. However, identifying causal genes from regulatory associations remains challenging and contentious. Here, we present a statistical framework and computational tool to integrate heterogeneous annotations with GWAS summary statistics for gene-based analysis, applied with comprehensive coding and tissue-specific regulatory annotations. We compare power and accuracy identifying causal genes across single-annotation, omnibus, and annotation-agnostic gene-based tests in simulation studies and an analysis of 128 traits from the UK Biobank, and find that incorporating heterogeneous annotations in gene-based association analysis increases power and performance identifying causal genes.
Asunto(s)
Estudio de Asociación del Genoma Completo/métodos , Anotación de Secuencia Molecular/métodos , Algoritmos , Estudio de Asociación del Genoma Completo/normas , Humanos , Anotación de Secuencia Molecular/normas , Polimorfismo Genético , Sitios de Carácter Cuantitativo , Reproducibilidad de los ResultadosRESUMEN
The integration of transcriptomic studies and genome-wide association studies (GWAS) via imputed expression has seen extensive application in recent years, enabling the functional characterization and causal gene prioritization of GWAS loci. However, the techniques for imputing transcriptomic traits from DNA variation remain underdeveloped. Furthermore, associations found when linking eQTL studies to complex traits through methods like PrediXcan can lead to false positives due to linkage disequilibrium between distinct causal variants. Therefore, the best prediction performance models may not necessarily lead to more reliable causal gene discovery. With the goal of improving discoveries without increasing false positives, we develop and compare multiple transcriptomic imputation approaches using the most recent GTEx release of expression and splicing data on 17,382 RNA-sequencing samples from 948 post-mortem donors in 54 tissues. We find that informing prediction models with posterior causal probability from fine-mapping (dap-g) and borrowing information across tissues (mashr) can lead to better performance in terms of number and proportion of significant associations that are colocalized and the proportion of silver standard genes identified as indicated by precision-recall and receiver operating characteristic curves. All prediction models are made publicly available at predictdb.org.
RESUMEN
Expression quantitative trait loci (eQTL) studies illuminate the genetics of gene expression and, in disease research, can be particularly illuminating when using the tissues directly impacted by the condition. In nephrology, there is a paucity of eQTL studies of human kidney. Here, we used whole-genome sequencing (WGS) and microdissected glomerular (GLOM) and tubulointerstitial (TI) transcriptomes from 187 individuals with nephrotic syndrome (NS) to describe the eQTL landscape in these functionally distinct kidney structures. Using MatrixEQTL, we performed cis-eQTL analysis on GLOM (n = 136) and TI (n = 166). We used the Bayesian "Deterministic Approximation of Posteriors" (DAP) to fine-map these signals, eQTLBMA to discover GLOM- or TI-specific eQTLs, and single-cell RNA-seq data of control kidney tissue to identify the cell type specificity of significant eQTLs. We integrated eQTL data with an IgA Nephropathy (IgAN) GWAS to perform a transcriptome-wide association study (TWAS). We discovered 894 GLOM eQTLs and 1,767 TI eQTLs at FDR < 0.05. 14% and 19% of GLOM and TI eQTLs, respectively, had >1 independent signal associated with its expression. 12% and 26% of eQTLs were GLOM specific and TI specific, respectively. GLOM eQTLs were most significantly enriched in podocyte transcripts and TI eQTLs in proximal tubules. The IgAN TWAS identified significant GLOM and TI genes, primarily at the HLA region. In this study, we discovered GLOM and TI eQTLs, identified those that were tissue specific, deconvoluted them into cell-specific signals, and used them to characterize known GWAS alleles. These data are available for browsing and download via our eQTL browser, "nephQTL."
Asunto(s)
Riñón/patología , Síndrome Nefrótico/genética , Sitios de Carácter Cuantitativo/genética , Adolescente , Adulto , Alelos , Teorema de Bayes , Femenino , Perfilación de la Expresión Génica/métodos , Estudio de Asociación del Genoma Completo/métodos , Humanos , Masculino , Persona de Mediana Edad , Polimorfismo de Nucleótido Simple/genética , Transcriptoma/genética , Adulto JovenRESUMEN
Many variants associated with complex traits are in noncoding regions and contribute to phenotypes by disrupting regulatory sequences. To characterize these variants, we developed a streamlined protocol for a high-throughput reporter assay, Biallelic Targeted STARR-seq (BiT-STARR-seq), that identifies allele-specific expression (ASE) while accounting for PCR duplicates through unique molecular identifiers. We tested 75,501 oligos (43,500 SNPs) and identified 2720 SNPs with significant ASE (FDR < 10%). To validate disruption of binding as one of the mechanisms underlying ASE, we developed a new high-throughput allele-specific binding assay for NFKB1. We identified 2684 SNPs with allele-specific binding (ASB) (FDR < 10%); 256 of these SNPs also had ASE (OR = 1.97, P-value = 0.0006). Of variants associated with complex traits, 1531 resulted in ASE, and 1662 showed ASB. For example, we characterized that the Crohn's disease risk variant for rs3810936 increases NFKB1 binding and results in altered gene expression.
Asunto(s)
Alelos , Subunidad p50 de NF-kappa B/metabolismo , Secuencias Reguladoras de Ácidos Nucleicos , Activación Transcripcional , Línea Celular , Ensayos Analíticos de Alto Rendimiento/métodos , Humanos , Subunidad p50 de NF-kappa B/genética , Polimorfismo de Nucleótido Simple , Unión ProteicaRESUMEN
MOTIVATION: Gene set enrichment analysis has been shown to be effective in identifying relevant biological pathways underlying complex diseases. Existing approaches lack the ability to quantify the enrichment levels accurately, hence preventing the enrichment information to be further utilized in both upstream and downstream analyses. A modernized and rigorous approach for gene set enrichment analysis that emphasizes both hypothesis testing and enrichment estimation is much needed. RESULTS: We propose a novel computational method, Bayesian Analysis of Gene Set Enrichment (BAGSE), for gene set enrichment analysis. BAGSE is built on a Bayesian hierarchical model and fully accounts for the uncertainty embedded in the association evidence of individual genes. We adopt an empirical Bayes inference framework to fit the proposed hierarchical model by implementing an efficient EM algorithm. Through simulation studies, we illustrate that BAGSE yields accurate enrichment quantification while achieving similar power as the state-of-the-art methods. Further simulation studies show that BAGSE can effectively utilize the enrichment information to improve the power in gene discovery. Finally, we demonstrate the application of BAGSE in analyzing real data from a differential expression experiment and a transcriptome-wide association study. Our results indicate that the proposed statistical framework is effective in aiding the discovery of potentially causal pathways and gene networks. AVAILABILITY AND IMPLEMENTATION: BAGSE is implemented using the C++ programing language and is freely available from https://github.com/xqwen/bagse/. Simulated and real data used in this paper are also available at the Github repository for reproducibility purposes. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Algoritmos , Transcriptoma , Teorema de Bayes , Probabilidad , Reproducibilidad de los ResultadosRESUMEN
Directed acyclic mixed graphs (DAMGs) provide a useful representation of network topology with both directed and undirected edges subject to the restriction of no directed cycles in the graph. This graphical framework may arise in many biomedical studies, for example, when a directed acyclic graph (DAG) of interest is contaminated with undirected edges induced by some unobserved confounding factors (eg, unmeasured environmental factors). Directed edges in a DAG are widely used to evaluate causal relationships among variables in a network, but detecting them is challenging when the underlying causality is obscured by some shared latent factors. The objective of this paper is to develop an effective structural equation model (SEM) method to extract reliable causal relationships from a DAMG. The proposed approach, termed structural factor equation model (SFEM), uses the SEM to capture the network topology of the DAG while accounting for the undirected edges in the graph with a factor analysis model. The latent factors in the SFEM enable the identification and removal of undirected edges, leading to a simpler and more interpretable causal network. The proposed method is evaluated and compared to existing methods through extensive simulation studies, and illustrated through the construction of gene regulatory networks related to breast cancer.
Asunto(s)
Modelos Teóricos , Proyectos de Investigación , Causalidad , Análisis FactorialRESUMEN
We propose a novel statistical framework for integrating the result from molecular quantitative trait loci (QTL) mapping into genome-wide genetic association analysis of complex traits, with the primary objectives of quantitatively assessing the enrichment of the molecular QTLs in complex trait-associated genetic variants and the colocalizations of the two types of association signals. We introduce a natural Bayesian hierarchical model that treats the latent association status of molecular QTLs as SNP-level annotations for candidate SNPs of complex traits. We detail a computational procedure to seamlessly perform enrichment, fine-mapping and colocalization analyses, which is a distinct feature compared to the existing colocalization analysis procedures in the literature. The proposed approach is computationally efficient and requires only summary-level statistics. We evaluate and demonstrate the proposed computational approach through extensive simulation studies and analyses of blood lipid data and the whole blood eQTL data from the GTEx project. In addition, a useful utility from our proposed method enables the computation of expected colocalization signals using simple characteristics of the association data. Using this utility, we further illustrate the importance of enrichment analysis on the ability to discover colocalized signals and the potential limitations of currently available molecular QTL data. The software pipeline that implements the proposed computation procedures, enloc, is freely available at https://github.com/xqwen/integrative.
Asunto(s)
Algoritmos , Mapeo Cromosómico/métodos , Estudio de Asociación del Genoma Completo/métodos , Polimorfismo de Nucleótido Simple , Sitios de Carácter Cuantitativo/genética , Teorema de Bayes , Simulación por Computador , Genotipo , Humanos , Modelos Lineales , Desequilibrio de Ligamiento , ProbabilidadRESUMEN
Environmental perturbations have large effects on both organismal and cellular traits, including gene expression, but the extent to which the environment affects RNA processing remains largely uncharacterized. Recent studies have identified a large number of genetic variants associated with variation in RNA processing that also have an important role in complex traits; yet we do not know in which contexts the different underlying isoforms are used. Here, we comprehensively characterized changes in RNA processing events across 89 environments in five human cell types and identified 15,300 event shifts (FDR = 15%) comprised of eight event types in over 4,000 genes. Many of these changes occur consistently in the same direction across conditions, indicative of global regulation by trans factors. Accordingly, we demonstrate that environmental modulation of splicing factor binding predicts shifts in intron retention, and that binding of transcription factors predicts shifts in alternative first exon (AFE) usage in response to specific treatments. We validated the mechanism hypothesized for AFE in two independent datasets. Using ATAC-seq, we found altered binding of 64 factors in response to selenium at sites of AFE shift, including ELF2 and other factors in the ETS family. We also performed AFE QTL mapping in 373 individuals and found an enrichment for SNPs predicted to disrupt binding of the ELF2 factor. Together, these results demonstrate that RNA processing is dramatically changed in response to environmental perturbations through specific mechanisms regulated by trans factors.
Asunto(s)
Ambiente , Procesamiento Postranscripcional del ARN , Línea Celular , Exones , Regulación de la Expresión Génica , Variación Genética , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Intrones , Sitios de Carácter Cuantitativo , Alineación de Secuencia , Análisis de Secuencia de ARN , Factores de Transcripción/genética , Factores de Transcripción/metabolismoRESUMEN
With the increasing availability of functional genomic data, incorporating genomic annotations into genetic association analysis has become a standard procedure. However, the existing methods often lack rigor and/or computational efficiency and consequently do not maximize the utility of functional annotations. In this paper, we propose a rigorous inference procedure to perform integrative association analysis incorporating genomic annotations for both traditional GWASs and emerging molecular QTL mapping studies. In particular, we propose an algorithm, named deterministic approximation of posteriors (DAP), which enables highly efficient and accurate joint enrichment analysis and identification of multiple causal variants. We use a series of simulation studies to highlight the power and computational efficiency of our proposed approach and further demonstrate it by analyzing the cross-population eQTL data from the GEUVADIS project and the multi-tissue eQTL data from the GTEx project. In particular, we find that genetic variants predicted to disrupt transcription factor binding sites are enriched in cis-eQTLs across all tissues. Moreover, the enrichment estimates obtained across the tissues are correlated with the cell types for which the annotations are derived.
Asunto(s)
Simulación por Computador , Estudio de Asociación del Genoma Completo , Genómica/métodos , Polimorfismo de Nucleótido Simple/genética , Sitios de Carácter Cuantitativo , Biología de Sistemas/métodos , Genotipo , HumanosRESUMEN
Gene-by-environment (GxE) interactions determine common disease risk factors and biomedically relevant complex traits. However, quantifying how the environment modulates genetic effects on human quantitative phenotypes presents unique challenges. Environmental covariates are complex and difficult to measure and control at the organismal level, as found in GWAS and epidemiological studies. An alternative approach focuses on the cellular environment using in vitro treatments as a proxy for the organismal environment. These cellular environments simplify the organism-level environmental exposures to provide a tractable influence on subcellular phenotypes, such as gene expression. Expression quantitative trait loci (eQTL) mapping studies identified GxE interactions in response to drug treatment and pathogen exposure. However, eQTL mapping approaches are infeasible for large-scale analysis of multiple cellular environments. Recently, allele-specific expression (ASE) analysis emerged as a powerful tool to identify GxE interactions in gene expression patterns by exploiting naturally occurring environmental exposures. Here we characterized genetic effects on the transcriptional response to 50 treatments in five cell types. We discovered 1455 genes with ASE (FDR < 10%) and 215 genes with GxE interactions. We demonstrated a major role for GxE interactions in complex traits. Genes with a transcriptional response to environmental perturbations showed sevenfold higher odds of being found in GWAS. Additionally, 105 genes that indicated GxE interactions (49%) were identified by GWAS as associated with complex traits. Examples include GIPR-caffeine interaction and obesity and include LAMP3-selenium interaction and Parkinson disease. Our results demonstrate that comprehensive catalogs of GxE interactions are indispensable to thoroughly annotate genes and bridge epidemiological and genome-wide association studies.
Asunto(s)
Perfilación de la Expresión Génica/métodos , Estudio de Asociación del Genoma Completo/métodos , Sitios de Carácter Cuantitativo/efectos de los fármacos , Alelos , Cafeína/farmacología , Línea Celular , Regulación de la Expresión Génica/efectos de los fármacos , Interacción Gen-Ambiente , Células Endoteliales de la Vena Umbilical Humana , Humanos , Melanocitos/citología , Melanocitos/efectos de los fármacos , Selenio/farmacología , Tunicamicina/farmacologíaRESUMEN
Motivation: The majority of the human genome is composed of non-coding regions containing regulatory elements such as enhancers, which are crucial for controlling gene expression. Many variants associated with complex traits are in these regions, and may disrupt gene regulatory sequences. Consequently, it is important to not only identify true enhancers but also to test if a variant within an enhancer affects gene regulation. Recently, allele-specific analysis in high-throughput reporter assays, such as massively parallel reporter assays (MPRAs), have been used to functionally validate non-coding variants. However, we are still missing high-quality and robust data analysis tools for these datasets. Results: We have further developed our method for allele-specific analysis QuASAR (quantitative allele-specific analysis of reads) to analyze allele-specific signals in barcoded read counts data from MPRA. Using this approach, we can take into account the uncertainty on the original plasmid proportions, over-dispersion, and sequencing errors. The provided allelic skew estimate and its standard error also simplifies meta-analysis of replicate experiments. Additionally, we show that a beta-binomial distribution better models the variability present in the allelic imbalance of these synthetic reporters and results in a test that is statistically well calibrated under the null. Applying this approach to the MPRA data, we found 602 SNPs with significant (false discovery rate 10%) allele-specific regulatory function in LCLs. We also show that we can combine MPRA with QuASAR estimates to validate existing experimental and computational annotations of regulatory variants. Our study shows that with appropriate data analysis tools, we can improve the power to detect allelic effects in high-throughput reporter assays. Availability and implementation: http://github.com/piquelab/QuASAR/tree/master/mpra. Contact: fluca@wayne.edu or rpique@wayne.edu. Supplementary information: Supplementary data are available online at Bioinformatics.