RESUMEN
Large-scale gene sequencing studies for complex traits have the potential to identify causal genes with therapeutic implications. We performed gene-based association testing of blood lipid levels with rare (minor allele frequency < 1%) predicted damaging coding variation by using sequence data from >170,000 individuals from multiple ancestries: 97,493 European, 30,025 South Asian, 16,507 African, 16,440 Hispanic/Latino, 10,420 East Asian, and 1,182 Samoan. We identified 35 genes associated with circulating lipid levels; some of these genes have not been previously associated with lipid levels when using rare coding variation from population-based samples. We prioritize 32 genes in array-based genome-wide association study (GWAS) loci based on aggregations of rare coding variants; three (EVI5, SH2B3, and PLIN1) had no prior association of rare coding variants with lipid levels. Most of our associated genes showed evidence of association among multiple ancestries. Finally, we observed an enrichment of gene-based associations for low-density lipoprotein cholesterol drug target genes and for genes closest to GWAS index single-nucleotide polymorphisms (SNPs). Our results demonstrate that gene-based associations can be beneficial for drug target development and provide evidence that the gene closest to the array-based GWAS index SNP is often the functional gene for blood lipid levels.
Asunto(s)
Exoma , Variación Genética , Estudio de Asociación del Genoma Completo , Lípidos/sangre , Sistemas de Lectura Abierta , Alelos , Glucemia/genética , Estudios de Casos y Controles , Biología Computacional/métodos , Bases de Datos Genéticas , Diabetes Mellitus Tipo 2/genética , Diabetes Mellitus Tipo 2/metabolismo , Predisposición Genética a la Enfermedad , Genética de Población , Estudio de Asociación del Genoma Completo/métodos , Humanos , Metabolismo de los Lípidos/genética , Hígado/metabolismo , Hígado/patología , Anotación de Secuencia Molecular , Herencia Multifactorial , Fenotipo , Polimorfismo de Nucleótido SimpleRESUMEN
The X Chromosome plays an important role in human development and disease. However, functional genomic and disease association studies of X genes greatly lag behind autosomal gene studies, in part owing to the unique biology of X-Chromosome inactivation (XCI). Because of XCI, most genes are only expressed from one allele. Yet, â¼30% of X genes "escape" XCI and are transcribed from both alleles, many only in a proportion of the population. Such interindividual differences are likely to be disease relevant, particularly for sex-biased disorders. To understand the functional biology for X-linked genes, we developed X-Chromosome inactivation for RNA-seq (XCIR), a novel approach to identify escape genes using bulk RNA-seq data. Our method, available as an R package, is more powerful than alternative approaches and is computationally efficient to handle large population-scale data sets. Using annotated XCI states, we examined the contribution of X-linked genes to the disease heritability in the United Kingdom Biobank data set. We show that escape and variable escape genes explain the largest proportion of X heritability, which is in large part attributable to X genes with Y homology. Finally, we investigated the role of each XCI state in sex-biased diseases and found that although XY homologous gene pairs have a larger overall effect size, enrichment for variable escape genes is significantly increased in female-biased diseases. Our results, for the first time, quantitate the importance of variable escape genes for the etiology of sex-biased disease, and our pipeline allows analysis of larger data sets for a broad range of phenotypes.
Asunto(s)
Genes Ligados a X , Inactivación del Cromosoma X , Alelos , Animales , Femenino , Genómica , Cromosoma X/genéticaRESUMEN
BACKGROUND: Substance use occurs at a high rate in persons with a psychiatric disorder. Genetically informative studies have the potential to elucidate the etiology of these phenomena. Recent developments in genome-wide association studies (GWAS) allow new avenues of investigation. METHOD: Using results of GWAS meta-analyses, we performed a factor analysis of the genetic correlation structure, a genome-wide search of shared loci, and causally informative tests for six substance use phenotypes (four smoking, one alcohol, and one cannabis use) and five psychiatric disorders (ADHD, anorexia, depression, bipolar disorder, and schizophrenia). RESULTS: Two correlated externalizing and internalizing/psychosis factor were found, although model fit was beneath conventional standards. Of 458 loci reported in previous univariate GWAS of substance use and psychiatric disorders, about 50% (230 loci) were pleiotropic with additional 111 pleiotropic loci not reported from past GWAS. Of the 341 pleiotropic loci, 152 were associated with both substance use and psychiatric disorders, implicating neurodevelopment, cell morphogenesis, biological adhesion pathways, and enrichment in 13 different brain tissues. Seventy-five and 114 pleiotropic loci were specific to either psychiatric disorders or substance use phenotypes, implicating neuronal signaling pathway and clathrin-binding functions/structures, respectively. No consistent evidence for phenotypic causation was found across different Mendelian randomization methods. CONCLUSIONS: Genetic etiology of substance use and psychiatric disorders is highly pleiotropic and involves shared neurodevelopmental path, neurotransmission, and intracellular trafficking. In aggregate, the patterns are not consistent with vertical pleiotropy, more likely reflecting horizontal pleiotropy or more complex forms of phenotypic causation.
Asunto(s)
Trastornos Mentales , Esquizofrenia , Trastornos Relacionados con Sustancias , Pleiotropía Genética , Predisposición Genética a la Enfermedad , Estudio de Asociación del Genoma Completo , Humanos , Trastornos Mentales/epidemiología , Trastornos Mentales/genética , Fenotipo , Polimorfismo de Nucleótido Simple , Esquizofrenia/epidemiología , Esquizofrenia/genética , Trastornos Relacionados con Sustancias/epidemiología , Trastornos Relacionados con Sustancias/genéticaRESUMEN
SUMMARY: Here, we present a highly efficient R-package seqminer2 for querying and retrieving sequence variants from biobank scale datasets of millions of individuals and hundreds of millions of genetic variants. Seqminer2 implements a novel variant-based index for querying VCF/BCF files. It improves the speed of query and retrieval by several magnitudes compared to the state-of-the-art tools based upon tabix. It also reimplements support for BGEN and PLINK format, which improves speed over alternative implementations. The improved efficiency and comprehensive support for popular file formats will facilitate method development, software prototyping and data analysis of biobank scale sequence datasets in R. AVAILABILITY AND IMPLEMENTATION: The seqminer2 R package is available from https://github.com/zhanxw/seqminer. Scripts used for the benchmarks are available in https://github.com/yang-lina/seqminer/blob/master/seqminer2%20benchmark%20script.txt. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Bancos de Muestras Biológicas , Programas Informáticos , Genotipo , HumanosRESUMEN
MOTIVATION: Large scale genome-wide association studies (GWAS) have resulted in the identification of a wide range of genetic variants related to a host of complex traits and disorders. Despite their success, the individual single-nucleotide polymorphism (SNP) analysis approach adopted in most current GWAS can be limited in that it is usually biologically simple to elucidate a comprehensive genetic architecture of phenotypes and statistically underpowered due to heavy multiple-testing correction burden. On the other hand, multiple-SNP analyses (e.g. gene-based or region-based SNP-set analysis) are usually more powerful to examine the joint effects of a set of SNPs on the phenotype of interest. However, current multiple-SNP approaches can only draw an overall conclusion at the SNP-set level and does not directly inform which SNPs in the SNP-set are driving the overall genotype-phenotype association. RESULTS: In this article, we propose a new permutation-assisted tuning procedure in lasso (plasso) to identify phenotype-associated SNPs in a joint multiple-SNP regression model in GWAS. The tuning parameter of lasso determines the amount of shrinkage and is essential to the performance of variable selection. In the proposed plasso procedure, we first generate permutations as pseudo-SNPs that are not associated with the phenotype. Then, the lasso tuning parameter is delicately chosen to separate true signal SNPs and non-informative pseudo-SNPs. We illustrate plasso using simulations to demonstrate its superior performance over existing methods, and application of plasso to a real GWAS dataset gains new additional insights into the genetic control of complex traits. AVAILABILITY AND IMPLEMENTATION: R codes to implement the proposed methodology is available at https://github.com/xyz5074/plasso. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Estudio de Asociación del Genoma Completo , Polimorfismo de Nucleótido Simple , Estudios de Asociación Genética , FenotipoRESUMEN
Meta-analysis of genetic association studies increases sample size and the power for mapping complex traits. Existing methods are mostly developed for datasets without missing values, i.e. the summary association statistics are measured for all variants in contributing studies. In practice, genotype imputation is not always effective. This may be the case when targeted genotyping/sequencing assays are used or when the un-typed genetic variant is rare. Therefore, contributed summary statistics often contain missing values. Existing methods for imputing missing summary association statistics and using imputed values in meta-analysis, approximate conditional analysis, or simple strategies such as complete case analysis all have theoretical limitations. Applying these approaches can bias genetic effect estimates and lead to seriously inflated type-I or type-II errors in conditional analysis, which is a critical tool for identifying independently associated variants. To address this challenge and complement imputation methods, we developed a method to combine summary statistics across participating studies and consistently estimate joint effects, even when the contributed summary statistics contain large amounts of missing values. Based on this estimator, we proposed a score statistic called PCBS (partial correlation based score statistic) for conditional analysis of single-variant and gene-level associations. Through extensive analysis of simulated and real data, we showed that the new method produces well-calibrated type-I errors and is substantially more powerful than existing approaches. We applied the proposed approach to one of the largest meta-analyses to date for the cigarettes-per-day phenotype. Using the new method, we identified multiple novel independently associated variants at known loci for tobacco use, which were otherwise missed by alternative methods. Together, the phenotypic variance explained by these variants was 1.1%, improving that of previously reported associations by 71%. These findings illustrate the extent of locus allelic heterogeneity and can help pinpoint causal variants.
Asunto(s)
Análisis de Datos , Productos de Tabaco/estadística & datos numéricos , Uso de Tabaco/genética , Alelos , Interpretación Estadística de Datos , Conjuntos de Datos como Asunto , Sitios Genéticos/genética , Estudio de Asociación del Genoma Completo , Genotipo , Humanos , Fenotipo , Polimorfismo de Nucleótido SimpleRESUMEN
Massively parallel sequencing technologies provide great opportunities for discovering rare susceptibility variants involved in complex disease etiology via large-scale imputation and exome and whole-genome sequence-based association studies. Due to modest effect sizes, large sample sizes of tens to hundreds of thousands of individuals are required for adequately powered studies. Current analytical tools are obsolete when it comes to handling these large datasets. To facilitate the analysis of large-scale sequence-based studies, we developed SEQSpark which implements parallel processing based on Spark to increase the speed and efficiency of performing data quality control, annotation, and association analysis. To demonstrate the versatility and speed of SEQSpark, we analyzed whole-genome sequence data from the UK10K, testing for associations with waist-to-hip ratios. The analysis, which was completed in 1.5 hr, included loading data, annotation, principal component analysis, and single variant and rare variant aggregate association analysis of >9 million variants. For rare variant aggregate analysis, an exome-wide significant association (p < 2.5 × 10-6) was observed with CCDC62 (SKAT-O [p = 6.89 × 10-7], combined multivariate collapsing [p = 1.48 × 10-6], and burden of rare variants [p = 1.48 × 10-6]). SEQSpark was also used to analyze 50,000 simulated exomes and it required 1.75 hr for the analysis of a quantitative trait using several rare variant aggregate association methods. Additionally, the performance of SEQSpark was compared to Variant Association Tools and PLINK/SEQ. SEQSpark was always faster and in some situations computation was reduced to a hundredth of the time. SEQSpark will empower large sequence-based epidemiological studies to quickly elucidate genetic variation involved in the etiology of complex traits.
Asunto(s)
Bases de Datos de Ácidos Nucleicos , Exoma/genética , Variación Genética , Estudio de Asociación del Genoma Completo/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Humanos , Análisis de Componente Principal , Relación Cintura-CaderaRESUMEN
Platelet production, maintenance, and clearance are tightly controlled processes indicative of platelets' important roles in hemostasis and thrombosis. Platelets are common targets for primary and secondary prevention of several conditions. They are monitored clinically by complete blood counts, specifically with measurements of platelet count (PLT) and mean platelet volume (MPV). Identifying genetic effects on PLT and MPV can provide mechanistic insights into platelet biology and their role in disease. Therefore, we formed the Blood Cell Consortium (BCX) to perform a large-scale meta-analysis of Exomechip association results for PLT and MPV in 157,293 and 57,617 individuals, respectively. Using the low-frequency/rare coding variant-enriched Exomechip genotyping array, we sought to identify genetic variants associated with PLT and MPV. In addition to confirming 47 known PLT and 20 known MPV associations, we identified 32 PLT and 18 MPV associations not previously observed in the literature across the allele frequency spectrum, including rare large effect (FCER1A), low-frequency (IQGAP2, MAP1A, LY75), and common (ZMIZ2, SMG6, PEAR1, ARFGAP3/PACSIN2) variants. Several variants associated with PLT/MPV (PEAR1, MRVI1, PTGES3) were also associated with platelet reactivity. In concurrent BCX analyses, there was overlap of platelet-associated variants with red (MAP1A, TMPRSS6, ZMIZ2) and white (PEAR1, ZMIZ2, LY75) blood cell traits, suggesting common regulatory pathways with shared genetic architecture among these hematopoietic lineages. Our large-scale Exomechip analyses identified previously undocumented associations with platelet traits and further indicate that several complex quantitative hematological, lipid, and cardiovascular traits share genetic factors.
Asunto(s)
Plaquetas/metabolismo , Exoma/genética , Variación Genética/genética , Femenino , Estudio de Asociación del Genoma Completo , Humanos , Masculino , Volúmen Plaquetario Medio , Recuento de PlaquetasRESUMEN
Whole genome sequencing (WGS) is a promising strategy to unravel variants or genes responsible for human diseases and traits. However, there is a lack of robust platforms for a comprehensive downstream analysis. In the present study, we first proposed three novel algorithms, sequence gap-filled gene feature annotation, bit-block encoded genotypes and sectional fast access to text lines to address three fundamental problems. The three algorithms then formed the infrastructure of a robust parallel computing framework, KGGSeq, for integrating downstream analysis functions for whole genome sequencing data. KGGSeq has been equipped with a comprehensive set of analysis functions for quality control, filtration, annotation, pathogenic prediction and statistical tests. In the tests with whole genome sequencing data from 1000 Genomes Project, KGGSeq annotated several thousand more reliable non-synonymous variants than other widely used tools (e.g. ANNOVAR and SNPEff). It took only around half an hour on a small server with 10 CPUs to access genotypes of â¼60 million variants of 2504 subjects, while a popular alternative tool required around one day. KGGSeq's bit-block genotype format used 1.5% or less space to flexibly represent phased or unphased genotypes with multiple alleles and achieved a speed of over 1000 times faster to calculate genotypic correlation.
Asunto(s)
Algoritmos , Genoma Humano , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , HumanosRESUMEN
MOTIVATION: Next-generation sequencing technologies have enabled the large-scale assessment of the impact of rare and low-frequency genetic variants for complex human diseases. Gene-level association tests are often performed to analyze rare variants, where multiple rare variants in a gene region are analyzed jointly. Applying gene-level association tests to analyze sequence data often requires integrating multiple heterogeneous sources of information (e.g. annotations, functional prediction scores, allele frequencies, genotypes and phenotypes) to determine the optimal analysis unit and prioritize causal variants. Given the complexity and scale of current sequence datasets and bioinformatics databases, there is a compelling need for more efficient software tools to facilitate these analyses. To answer this challenge, we developed RVTESTS, which implements a broad set of rare variant association statistics and supports the analysis of autosomal and X-linked variants for both unrelated and related individuals. RVTESTS also provides useful companion features for annotating sequence variants, integrating bioinformatics databases, performing data quality control and sample selection. We illustrate the advantages of RVTESTS in functionality and efficiency using the 1000 Genomes Project data. AVAILABILITY AND IMPLEMENTATION: RVTESTS is available on Linux, MacOS and Windows. Source code and executable files can be obtained at https://github.com/zhanxw/rvtests CONTACT: zhanxw@gmail.com; goncalo@umich.edu; dajiang.liu@outlook.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Variación Genética , Programas Informáticos , Animales , Genotipo , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Lenguajes de ProgramaciónRESUMEN
OBJECTIVE: In observational epidemiologic studies, higher plasma high-density lipoprotein cholesterol (HDL-C) has been associated with increased risk of intracerebral hemorrhage (ICH). DNA sequence variants that decrease cholesteryl ester transfer protein (CETP) gene activity increase plasma HDL-C; as such, medicines that inhibit CETP and raise HDL-C are in clinical development. Here, we test the hypothesis that CETP DNA sequence variants associated with higher HDL-C also increase risk for ICH. METHODS: We performed 2 candidate-gene analyses of CETP. First, we tested individual CETP variants in a discovery cohort of 1,149 ICH cases and 1,238 controls from 3 studies, followed by replication in 1,625 cases and 1,845 controls from 5 studies. Second, we constructed a genetic risk score comprised of 7 independent variants at the CETP locus and tested this score for association with HDL-C as well as ICH risk. RESULTS: Twelve variants within CETP demonstrated nominal association with ICH, with the strongest association at the rs173539 locus (odds ratio [OR] = 1.25, standard error [SE] = 0.06, p = 6.0 × 10-4 ) with no heterogeneity across studies (I2 = 0%). This association was replicated in patients of European ancestry (p = 0.03). A genetic score of CETP variants found to increase HDL-C by â¼2.85mg/dl in the Global Lipids Genetics Consortium was strongly associated with ICH risk (OR = 1.86, SE = 0.13, p = 1.39 × 10-6 ). INTERPRETATION: Genetic variants in CETP associated with increased HDL-C raise the risk of ICH. Given ongoing therapeutic development in CETP inhibition and other HDL-raising strategies, further exploration of potential adverse cerebrovascular outcomes may be warranted. Ann Neurol 2016;80:730-740.
Asunto(s)
Hemorragia Cerebral/genética , Proteínas de Transferencia de Ésteres de Colesterol/genética , Predisposición Genética a la Enfermedad/genética , Adulto , Anciano , HDL-Colesterol/sangre , HDL-Colesterol/genética , Femenino , Genotipo , Humanos , Masculino , Persona de Mediana Edad , Polimorfismo de Nucleótido SimpleRESUMEN
Next-generation sequencing has enabled the study of a comprehensive catalogue of genetic variants for their impact on various complex diseases. Numerous consortia studies of complex traits have publically released their summary association statistics, which have become an invaluable resource for learning the underlying biology, understanding the genetic architecture, and guiding clinical translations. There is great interest in the field in developing novel statistical methods for analyzing and interpreting results from these genotype-phenotype association studies. One popular platform for method development and data analysis is R. In order to enable these analyses in R, it is necessary to develop packages that can efficiently query files of summary association statistics, explore the linkage disequilibrium structure between variants, and integrate various bioinformatics databases. The complexity and scale of sequence datasets and databases pose significant computational challenges for method developers. To address these challenges and facilitate method development, we developed the R package SEQMINER for annotating and querying files of sequence variants (e.g., VCF/BCF files) and summary association statistics (e.g., METAL/RAREMETAL files), and for integrating bioinformatics databases. SEQMINER provides an infrastructure where novel methods can be distributed and applied to analyzing sequence datasets in practice. We illustrate the performance of SEQMINER using datasets from the 1000 Genomes Project. We show that SEQMINER is highly efficient and easy to use. It will greatly accelerate the process of applying statistical innovations to analyze and interpret sequence-based associations. The R package, its source code and documentations are available from http://cran.r-project.org/web/packages/seqminer and http://seqminer.genomic.codes/.
Asunto(s)
Biología Computacional/métodos , Estudios de Asociación Genética/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Lenguajes de Programación , Secuencia de Bases , Interpretación Estadística de Datos , Bases de Datos Factuales , Variación Genética/genética , Genoma Humano , Humanos , Análisis de Secuencia de ADN , Programas InformáticosRESUMEN
Advances in exome sequencing and the development of exome genotyping arrays are enabling explorations of association between rare coding variants and complex traits. To ensure power for these rare variant analyses, a variety of association tests that group variants by gene or functional unit have been proposed. Here, we extend these tests to family-based studies. We develop family-based burden tests, variable frequency threshold tests and sequence kernel association tests. Through simulations, we compare the performance of different tests. We describe situations where family-based studies provide greater power than studies of unrelated individuals to detect rare variants associated with moderate to large changes in trait values. Broadly speaking, we find that when sample sizes are limited and only a modest fraction of all trait-associated variants can be identified, family samples are more powerful. Finally, we illustrate our approach by analyzing the relationship between coding variants and levels of high-density lipoprotein (HDL) cholesterol in 11,556 individuals from the HUNT and SardiNIA studies, demonstrating association for coding variants in the APOC3, CETP, LIPC, LIPG, and LPL genes and illustrating the value of family samples, meta-analysis, and gene-level tests. Our methods are implemented in freely available C++ code.
Asunto(s)
Estudios de Asociación Genética/métodos , Variación Genética/genética , Modelos Genéticos , Programas Informáticos , Apolipoproteína C-III/genética , Proteínas de Transferencia de Ésteres de Colesterol/genética , HDL-Colesterol/genética , Simulación por Computador , Exoma/genética , Familia , Genotipo , Humanos , Lipasa/genética , Lipoproteína Lipasa/genética , FenotipoRESUMEN
Next-generation sequencing has led to many complex-trait rare-variant (RV) association studies. Although single-variant association analysis can be performed, it is grossly underpowered. Therefore, researchers have developed many RV association tests that aggregate multiple variant sites across a genetic region (e.g., gene), and test for the association between the trait and the aggregated genotype. After these aggregate tests detect an association, it is only possible to estimate the average genetic effect for a group of RVs. As a result of the "winner's curse," such an estimate can be biased. Although for common variants one can obtain unbiased estimates of genetic parameters by analyzing a replication sample, for RVs it is desirable to obtain unbiased genetic estimates for the study where the association is identified. This is because there can be substantial heterogeneity of RV sites and frequencies even among closely related populations. In order to obtain an unbiased estimate for aggregated RV analysis, we developed bootstrap-sample-split algorithms to reduce the bias of the winner's curse. The unbiased estimates are greatly important for understanding the population-specific contribution of RVs to the heritability of complex traits. We also demonstrate both theoretically and via simulations that for aggregate RV analysis the genetic variance for a gene or region will always be underestimated, sometimes substantially, because of the presence of noncausal variants or because of the presence of causal variants with effects of different magnitudes or directions. Therefore, even if RVs play a major role in the complex-trait etiologies, a portion of the heritability will remain missing, and the contribution of RVs to the complex-trait etiologies will be underestimated.
Asunto(s)
Estudio de Asociación del Genoma Completo/métodos , Modelos Genéticos , Algoritmos , Frecuencia de los Genes , Sitios Genéticos , Predisposición Genética a la Enfermedad , Variación Genética , Genotipo , Humanos , Modelos EstadísticosRESUMEN
Next-generation sequencing has made possible the detection of rare variant (RV) associations with quantitative traits (QT). Due to high sequencing cost, many studies can only sequence a modest number of selected samples with extreme QT. Therefore association testing in individual studies can be underpowered. Besides the primary trait, many clinically important secondary traits are often measured. It is highly beneficial if multiple studies can be jointly analyzed for detecting associations with commonly measured traits. However, analyzing secondary traits in selected samples can be biased if sample ascertainment is not properly modeled. Some methods exist for analyzing secondary traits in selected samples, where some burden tests can be implemented. However p-values can only be evaluated analytically via asymptotic approximations, which may not be accurate. Additionally, potentially more powerful sequence kernel association tests, variable selection-based methods, and burden tests that require permutations cannot be incorporated. To overcome these limitations, we developed a unified method for analyzing secondary trait associations with RVs (STAR) in selected samples, incorporating all RV tests. Statistical significance can be evaluated either through permutations or analytically. STAR makes it possible to apply more powerful RV tests to analyze secondary trait associations. It also enables jointly analyzing multiple cohorts ascertained under different study designs, which greatly boosts power. The performance of STAR and commonly used RV association tests were comprehensively evaluated using simulation studies. STAR was also implemented to analyze a dataset from the SardiNIA project where samples with extreme low-density lipoprotein levels were sequenced. A significant association between LDLR and systolic blood pressure was identified, which is supported by pharmacogenetic studies. In summary, for sequencing studies, STAR is an important tool for detecting secondary-trait RV associations.
Asunto(s)
Predisposición Genética a la Enfermedad , Variación Genética , Estudio de Asociación del Genoma Completo , Sitios de Carácter Cuantitativo/genética , Presión Sanguínea/genética , Simulación por Computador , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Italia , Lipoproteínas LDL/genética , Modelos Genéticos , Fenotipo , Programas InformáticosRESUMEN
Bridging the gap between genotype and phenotype in GWAS studies is challenging. A multitude of genetic variants have been associated with immune-related diseases, including cancer, yet the interpretability of most variants remains low. Here, we investigate the quantitative components in the T cell receptor (TCR) repertoire, the frequency of clusters of TCR sequences predicted to have common antigen specificity, to interpret the genetic associations of diverse human diseases. We first developed a statistical model to predict the TCR components using variants in the TRB and HLA loci. Applying this model to over 300,000 individuals in the UK Biobank data, we identified 2309 associations between TCR abundances and various immune diseases. TCR clusters predicted to be pathogenic for autoimmune diseases were significantly enriched for predicted autoantigen-specificity. Moreover, four TCR clusters were associated with better outcomes in distinct cancers, where conventional GWAS cannot identify any significant locus. Collectively, our results highlight the integral role of adaptive immune responses in explaining the associations between genotype and phenotype.
Asunto(s)
Estudio de Asociación del Genoma Completo , Fenotipo , Receptores de Antígenos de Linfocitos T , Humanos , Receptores de Antígenos de Linfocitos T/genética , Receptores de Antígenos de Linfocitos T/inmunología , Enfermedades Autoinmunes/genética , Enfermedades Autoinmunes/inmunología , Genotipo , Neoplasias/genética , Neoplasias/inmunología , Predisposición Genética a la EnfermedadRESUMEN
Large national-level electronic health record (EHR) datasets offer new opportunities for disentangling the role of genes and environment through deep phenotype information and approximate pedigree structures. Here we use the approximate geographical locations of patients as a proxy for spatially correlated community-level environmental risk factors. We develop a spatial mixed linear effect (SMILE) model that incorporates both genetics and environmental contribution. We extract EHR and geographical locations from 257,620 nuclear families and compile 1083 disease outcome measurements from the MarketScan dataset. We augment the EHR with publicly available environmental data, including levels of particulate matter 2.5 (PM2.5), nitrogen dioxide (NO2), climate, and sociodemographic data. We refine the estimates of genetic heritability and quantify community-level environmental contributions. We also use wind speed and direction as instrumental variables to assess the causal effects of air pollution. In total, we find PM2.5 or NO2 have statistically significant causal effects on 135 diseases, including respiratory, musculoskeletal, digestive, metabolic, and sleep disorders, where PM2.5 and NO2 tend to affect biologically distinct disease categories. These analyses showcase several robust strategies for jointly modeling genetic and environmental effects on disease risk using large EHR datasets and will benefit upcoming biobank studies in the era of precision medicine.
Asunto(s)
Contaminación del Aire , Dióxido de Nitrógeno , Material Particulado , Humanos , Contaminación del Aire/efectos adversos , Material Particulado/efectos adversos , Dióxido de Nitrógeno/efectos adversos , Dióxido de Nitrógeno/análisis , Factores de Riesgo , Exposición a Riesgos Ambientales/efectos adversos , Masculino , Femenino , Registros Electrónicos de Salud , Contaminantes Atmosféricos/efectos adversos , Contaminantes Atmosféricos/análisis , Contaminantes Atmosféricos/toxicidad , Predisposición Genética a la Enfermedad , Interacción Gen-Ambiente , Persona de Mediana Edad , AdultoRESUMEN
Transcriptome-wide association study (TWAS) is a popular approach to dissect the functional consequence of disease associated non-coding variants. Most existing TWAS use bulk tissues and may not have the resolution to reveal cell-type specific target genes. Single-cell expression quantitative trait loci (sc-eQTL) datasets are emerging. The largest bulk- and sc-eQTL datasets are most conveniently available as summary statistics, but have not been broadly utilized in TWAS. Here, we present a new method EXPRESSO (EXpression PREdiction with Summary Statistics Only), to analyze sc-eQTL summary statistics, which also integrates 3D genomic data and epigenomic annotation to prioritize causal variants. EXPRESSO substantially improves existing methods. We apply EXPRESSO to analyze multi-ancestry GWAS datasets for 14 autoimmune diseases. EXPRESSO uniquely identifies 958 novel gene x trait associations, which is 26% more than the second-best method. Among them, 492 are unique to cell type level analysis and missed by TWAS using whole blood. We also develop a cell type aware drug repurposing pipeline, which leverages EXPRESSO results to identify drug compounds that can reverse disease gene expressions in relevant cell types. Our results point to multiple drugs with therapeutic potentials, including metformin for type 1 diabetes, and vitamin K for ulcerative colitis.
Asunto(s)
Estudio de Asociación del Genoma Completo , Sitios de Carácter Cuantitativo , Análisis de la Célula Individual , Humanos , Análisis de la Célula Individual/métodos , Estudio de Asociación del Genoma Completo/métodos , Predisposición Genética a la Enfermedad/genética , Transcriptoma/genética , Enfermedades Autoinmunes/genética , Polimorfismo de Nucleótido Simple , Herencia Multifactorial/genética , Perfilación de la Expresión Génica/métodosRESUMEN
Genetic mechanisms of blood pressure (BP) regulation remain poorly defined. Using kidney-specific epigenomic annotations and 3D genome information we generated and validated gene expression prediction models for the purpose of transcriptome-wide association studies in 700 human kidneys. We identified 889 kidney genes associated with BP of which 399 were prioritised as contributors to BP regulation. Imputation of kidney proteome and microRNAome uncovered 97 renal proteins and 11 miRNAs associated with BP. Integration with plasma proteomics and metabolomics illuminated circulating levels of myo-inositol, 4-guanidinobutanoate and angiotensinogen as downstream effectors of several kidney BP genes (SLC5A11, AGMAT, AGT, respectively). We showed that genetically determined reduction in renal expression may mimic the effects of rare loss-of-function variants on kidney mRNA/protein and lead to an increase in BP (e.g., ENPEP). We demonstrated a strong correlation (r = 0.81) in expression of protein-coding genes between cells harvested from urine and the kidney highlighting a diagnostic potential of urinary cell transcriptomics. We uncovered adenylyl cyclase activators as a repurposing opportunity for hypertension and illustrated examples of BP-elevating effects of anticancer drugs (e.g. tubulin polymerisation inhibitors). Collectively, our studies provide new biological insights into genetic regulation of BP with potential to drive clinical translation in hypertension.
Asunto(s)
Hipertensión , Proteoma , Humanos , Presión Sanguínea/genética , Proteoma/genética , Proteoma/metabolismo , Transcriptoma/genética , Multiómica , Hipertensión/metabolismo , Riñón/metabolismo , Proteínas de Transporte de Sodio-Glucosa/genética , Proteínas de Transporte de Sodio-Glucosa/metabolismoRESUMEN
There is solid evidence that complex traits can be caused by rare variants. Next-generation sequencing technologies are powerful tools for mapping rare variants. Confirmation of significant findings in stage 1 through replication in an independent stage 2 sample is necessary for association studies. For gene-based mapping of rare variants, two replication strategies are possible: (1) variant-based replication, wherein only variants from nucleotide sites uncovered in stage 1 are genotyped and followed-up and (2) sequence-based replication, wherein the gene region is sequenced in the replication sample and both known and novel variants are tested. The efficiency of the two strategies is dependent on the proportions of causative variants discovered in stage 1 and sequencing/genotyping errors. With rigorous population genetic and phenotypic models, it is demonstrated that sequence-based replication is consistently more powerful. However, the power gain is small (1) for large-scale studies with thousands of individuals, because a large fraction of causative variant sites can be observed and (2) for small- to medium-scale studies with a few hundred samples, because a large proportion of the locus population attributable risk can be explained by the uncovered variants. Therefore, genotyping can be a temporal solution for replicating genetic studies if stage 1 and 2 samples are drawn from the same population. However, sequence-based replication is advantageous if the stage 1 sample is small or novel variants discovery is also of interest. It is shown that currently attainable levels of sequencing error only minimally affect the comparison, and the advantage of sequence-based replication remains.