RESUMEN
Both common and rare genetic variants influence complex traits and common diseases. Genome-wide association studies have identified thousands of common-variant associations, and more recently, large-scale exome sequencing studies have identified rare-variant associations in hundreds of genes1-3. However, rare-variant genetic architecture is not well characterized, and the relationship between common-variant and rare-variant architecture is unclear4. Here we quantify the heritability explained by the gene-wise burden of rare coding variants across 22 common traits and diseases in 394,783 UK Biobank exomes5. Rare coding variants (allele frequency < 1 × 10-3) explain 1.3% (s.e. = 0.03%) of phenotypic variance on average-much less than common variants-and most burden heritability is explained by ultrarare loss-of-function variants (allele frequency < 1 × 10-5). Common and rare variants implicate the same cell types, with similar enrichments, and they have pleiotropic effects on the same pairs of traits, with similar genetic correlations. They partially colocalize at individual genes and loci, but not to the same extent: burden heritability is strongly concentrated in significant genes, while common-variant heritability is more polygenic, and burden heritability is also more strongly concentrated in constrained genes. Finally, we find that burden heritability for schizophrenia and bipolar disorder6,7 is approximately 2%. Our results indicate that rare coding variants will implicate a tractable number of large-effect genes, that common and rare associations are mechanistically convergent, and that rare coding variants will contribute only modestly to missing heritability and population risk stratification.
Asunto(s)
Exoma , Frecuencia de los Genes , Variación Genética , Herencia Multifactorial , Humanos , Exoma/genética , Variación Genética/genética , Estudio de Asociación del Genoma Completo , Herencia Multifactorial/genética , Factores de Riesgo , Reino Unido , Sitios Genéticos/genética , Esquizofrenia/genética , Trastorno Bipolar/genéticaRESUMEN
COVID-19, which is caused by SARS-CoV-2, can result in acute respiratory distress syndrome and multiple organ failure1-4, but little is known about its pathophysiology. Here we generated single-cell atlases of 24 lung, 16 kidney, 16 liver and 19 heart autopsy tissue samples and spatial atlases of 14 lung samples from donors who died of COVID-19. Integrated computational analysis uncovered substantial remodelling in the lung epithelial, immune and stromal compartments, with evidence of multiple paths of failed tissue regeneration, including defective alveolar type 2 differentiation and expansion of fibroblasts and putative TP63+ intrapulmonary basal-like progenitor cells. Viral RNAs were enriched in mononuclear phagocytic and endothelial lung cells, which induced specific host programs. Spatial analysis in lung distinguished inflammatory host responses in lung regions with and without viral RNA. Analysis of the other tissue atlases showed transcriptional alterations in multiple cell types in heart tissue from donors with COVID-19, and mapped cell types and genes implicated with disease severity based on COVID-19 genome-wide association studies. Our foundational dataset elucidates the biological effect of severe SARS-CoV-2 infection across the body, a key step towards new treatments.
Asunto(s)
COVID-19/patología , COVID-19/virología , Riñón/patología , Hígado/patología , Pulmón/patología , Miocardio/patología , SARS-CoV-2/patogenicidad , Adulto , Anciano , Anciano de 80 o más Años , Atlas como Asunto , Autopsia , Bancos de Muestras Biológicas , COVID-19/genética , COVID-19/inmunología , Células Endoteliales , Células Epiteliales/patología , Células Epiteliales/virología , Femenino , Fibroblastos , Estudio de Asociación del Genoma Completo , Corazón/virología , Humanos , Inflamación/patología , Inflamación/virología , Riñón/virología , Hígado/virología , Pulmón/virología , Masculino , Persona de Mediana Edad , Especificidad de Órganos , Fagocitos , Alveolos Pulmonares/patología , Alveolos Pulmonares/virología , ARN Viral/análisis , Regeneración , SARS-CoV-2/inmunología , Análisis de la Célula Individual , Carga ViralRESUMEN
Cellular heterogeneity in gene expression is driven by cellular processes, such as cell cycle and cell-type identity, and cellular environment such as spatial location. The cell cycle, in particular, is thought to be a key driver of cell-to-cell heterogeneity in gene expression, even in otherwise homogeneous cell populations. Recent advances in single-cell RNA-sequencing (scRNA-seq) facilitate detailed characterization of gene expression heterogeneity and can thus shed new light on the processes driving heterogeneity. Here, we combined fluorescence imaging with scRNA-seq to measure cell cycle phase and gene expression levels in human induced pluripotent stem cells (iPSCs). By using these data, we developed a novel approach to characterize cell cycle progression. Although standard methods assign cells to discrete cell cycle stages, our method goes beyond this and quantifies cell cycle progression on a continuum. We found that, on average, scRNA-seq data from only five genes predicted a cell's position on the cell cycle continuum to within 14% of the entire cycle and that using more genes did not improve this accuracy. Our data and predictor of cell cycle phase can directly help future studies to account for cell cycle-related heterogeneity in iPSCs. Our results and methods also provide a foundation for future work to characterize the effects of the cell cycle on expression heterogeneity in other cell types.
Asunto(s)
Ciclo Celular/genética , Biología Computacional/métodos , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ARN , Análisis de la Célula Individual/métodos , Línea Celular , Perfilación de la Expresión Génica , Genes Reporteros , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Células Madre Pluripotentes Inducidas/metabolismo , Análisis de Secuencia de ARN/métodosRESUMEN
AIM: Biogeographical regions (realms) reflect patterns of co-distributed species (biotas) across space. Their boundaries are set by dispersal barriers and difficulties of establishment in new locations. We extend new methods to assess these two contributions by quantifying the degree to which realms intergrade across geographical space and the contributions of individual species to the delineation of those realms. As our example, we focus on Wallace's Line, the most enigmatic partitioning of the world's faunas, where climate is thought to have little effect and the majority of dispersal barriers are short water gaps. LOCATION: Indo-Pacific. TIME PERIOD: Present day. MAJOR TAXA STUDIED: Birds and mammals. METHODS: Terrestrial bird and mammal assemblages were established in 1-degree map cells using range maps. Assemblage structure was modelled using latent Dirichlet allocation, a continuous clustering method that simultaneously establishes the likely partitioning of species into biotas and the contribution of biotas to each map cell. Phylogenetic trees were used to assess the contribution of deep historical processes. Spatial segregation between biotas was evaluated across time and space in comparison with numerous hard realm boundaries drawn by various workers. RESULTS: We demonstrate that the strong turnover between biotas coincides with the north-western extent of the region not connected to the mainland during the Pleistocene, although the Philippines contains mixed contributions. At deeper taxonomic levels, Sulawesi and the Philippines shift to primarily Asian affinities, resulting from transgressions of a few Asian-derived lineages across the line. The partitioning of biotas sometimes produces fragmented regions that reflect habitat. Differences in partitions between birds and mammals reflect differences in dispersal ability. MAIN CONCLUSIONS: Permanent water barriers have selected for a dispersive archipelago fauna, excluded by an incumbent continental fauna on the Sunda shelf. Deep history, such as plate movements, is relatively unimportant in setting boundaries. The analysis implies a temporally dynamic interaction between a species' intrinsic dispersal ability, physiographic barriers, and recent climate change in the genesis of Earth's biotas.
RESUMEN
MOTIVATION: Quality control plays a major role in the analysis of ancient DNA (aDNA). One key step in this quality control is assessment of DNA damage: aDNA contains unique signatures of DNA damage that distinguish it from modern DNA, and so analyses of damage patterns can help confirm that DNA sequences obtained are from endogenous aDNA rather than from modern contamination. Predominant signatures of DNA damage include a high frequency of cytosine to thymine substitutions (C-to-T) at the ends of fragments, and elevated rates of purines (A & G) before the 5' strand-breaks. Existing QC procedures help assess damage by simply plotting for each sample, the C-to-T mismatch rate along the read and the composition of bases before the 5' strand-breaks. Here we present a more flexible and comprehensive model-based approach to infer and visualize damage patterns in aDNA, implemented in an R package aRchaic. This approach is based on a 'grade of membership' model (also known as 'admixture' or 'topic' model) in which each sample has an estimated grade of membership in each of K damage profiles that are estimated from the data. RESULTS: We illustrate aRchaic on data from several aDNA studies and modern individuals from 1000 Genomes Project Consortium (2012). Here, aRchaic clearly distinguishes modern from ancient samples irrespective of DNA extraction, lab and sequencing protocols. Additionally, through an in-silico contamination experiment, we show that the aRchaic grades of membership reflect relative levels of exogenous modern contamination. Together, the outputs of aRchaic provide a concise visual summary of DNA damage patterns, as well as other processes generating mismatches in the data. AVAILABILITY AND IMPLEMENTATION: aRchaic is available for download from https://www.github.com/kkdey/aRchaic. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Daño del ADN , Genoma , Citosina , ADN Antiguo , Humanos , Análisis de Secuencia de ADNRESUMEN
Grade of membership models, also known as "admixture models", "topic models" or "Latent Dirichlet Allocation", are a generalization of cluster models that allow each sample to have membership in multiple clusters. These models are widely used in population genetics to model admixed individuals who have ancestry from multiple "populations", and in natural language processing to model documents having words from multiple "topics". Here we illustrate the potential for these models to cluster samples of RNA-seq gene expression data, measured on either bulk samples or single cells. We also provide methods to help interpret the clusters, by identifying genes that are distinctively expressed in each cluster. By applying these methods to several example RNA-seq applications we demonstrate their utility in identifying and summarizing structure and heterogeneity. Applied to data from the GTEx project on 53 human tissues, the approach highlights similarities among biologically-related tissues and identifies distinctively-expressed genes that recapitulate known biology. Applied to single-cell expression data from mouse preimplantation embryos, the approach highlights both discrete and continuous variation through early embryonic development stages, and highlights genes involved in a variety of relevant processes-from germ cell development, through compaction and morula formation, to the formation of inner cell mass and trophoblast at the blastocyst stage. The methods are implemented in the Bioconductor package CountClust.
Asunto(s)
Algoritmos , Biología Computacional/métodos , Perfilación de la Expresión Génica/métodos , ARN/genética , Animales , Blastocisto/metabolismo , Encéfalo/metabolismo , Análisis por Conglomerados , Regulación del Desarrollo de la Expresión Génica , Humanos , Ratones , ARN/metabolismo , Reproducibilidad de los Resultados , Análisis de Secuencia de ARN , Transcriptoma/genéticaRESUMEN
[This corrects the article DOI: 10.1371/journal.pgen.1006599.].
RESUMEN
BACKGROUND: Sequence logo plots have become a standard graphical tool for visualizing sequence motifs in DNA, RNA or protein sequences. However standard logo plots primarily highlight enrichment of symbols, and may fail to highlight interesting depletions. Current alternatives that try to highlight depletion often produce visually cluttered logos. RESULTS: We introduce a new sequence logo plot, the EDLogo plot, that highlights both enrichment and depletion, while minimizing visual clutter. We provide an easy-to-use and highly customizable R package Logolas to produce a range of logo plots, including EDLogo plots. This software also allows elements in the logo plot to be strings of characters, rather than a single character, extending the range of applications beyond the usual DNA, RNA or protein sequences. And the software includes new Empirical Bayes methods to stabilize estimates of enrichment and depletion, and thus better highlight the most significant patterns in data. We illustrate our methods and software on applications to transcription factor binding site motifs, protein sequence alignments and cancer mutation signature profiles. CONCLUSIONS: Our new EDLogo plots and flexible software implementation can help data analysts visualize both enrichment and depletion of characters (DNA sequence bases, amino acids, etc.) across a wide range of applications.
Asunto(s)
Alineación de Secuencia , Secuencia de Aminoácidos , Secuencia de Bases , Teorema de Bayes , ADN/química , Humanos , Programas InformáticosRESUMEN
We present a gene-level regulatory model, single-cell ATAC + RNA linking (SCARlink), which predicts single-cell gene expression and links enhancers to target genes using multi-ome (scRNA-seq and scATAC-seq co-assay) sequencing data. The approach uses regularized Poisson regression on tile-level accessibility data to jointly model all regulatory effects at a gene locus, avoiding the limitations of pairwise gene-peak correlations and dependence on peak calling. SCARlink outperformed existing gene scoring methods for imputing gene expression from chromatin accessibility across high-coverage multi-ome datasets while giving comparable to improved performance on low-coverage datasets. Shapley value analysis on trained models identified cell-type-specific gene enhancers that are validated by promoter capture Hi-C and are 11× to 15× and 5× to 12× enriched in fine-mapped eQTLs and fine-mapped genome-wide association study (GWAS) variants, respectively. We further show that SCARlink-predicted and observed gene expression vectors provide a robust way to compute a chromatin potential vector field to enable developmental trajectory analysis.
Asunto(s)
Cromatina , Estudio de Asociación del Genoma Completo , Cromatina/genética , Secuencias Reguladoras de Ácidos Nucleicos , Regulación de la Expresión Génica , Regiones Promotoras Genéticas/genética , ARN , Análisis de la Célula Individual/métodosRESUMEN
Prioritizing disease-critical cell types by integrating genome-wide association studies (GWAS) with functional data is a fundamental goal. Single-cell chromatin accessibility (scATAC-seq) and gene expression (scRNA-seq) have characterized cell types at high resolution, and studies integrating GWAS with scRNA-seq have shown promise, but studies integrating GWAS with scATAC-seq have been limited. Here, we identify disease-critical fetal and adult brain cell types by integrating GWAS summary statistics from 28 brain-related diseases/traits (average N = 298 K) with 3.2 million scATAC-seq and scRNA-seq profiles from 83 cell types. We identified disease-critical fetal (respectively adult) brain cell types for 22 (respectively 23) of 28 traits using scATAC-seq, and for 8 (respectively 17) of 28 traits using scRNA-seq. Significant scATAC-seq enrichments included fetal photoreceptor cells for major depressive disorder, fetal ganglion cells for BMI, fetal astrocytes for ADHD, and adult VGLUT2 excitatory neurons for schizophrenia. Our findings improve our understanding of brain-related diseases/traits and inform future analyses.
Asunto(s)
Secuenciación de Inmunoprecipitación de Cromatina , Trastorno Depresivo Mayor , Humanos , RNA-Seq , Estudio de Asociación del Genoma Completo , Cromatina/genética , Encéfalo , Análisis de la Célula IndividualRESUMEN
Functional enhancer annotation is critical for understanding tissue-specific transcriptional regulation and prioritizing disease-associated non-coding variants. However, unbiased enhancer discovery in disease-relevant contexts remains challenging. To identify enhancers pertinent to diabetes, we conducted a CRISPR interference (CRISPRi) screen in the human pluripotent stem cell (hPSC) pancreatic differentiation system. Among the enhancers identified, we focused on an enhancer we named ONECUT1e-664kb, â¼664 kb from the ONECUT1 promoter. Previous studies have linked ONECUT1 coding mutations to pancreatic hypoplasia and neonatal diabetes. We found that homozygous deletion of ONECUT1e-664kb in hPSCs leads to a near-complete loss of ONECUT1 expression and impaired pancreatic differentiation. ONECUT1e-664kb contains a type 2 diabetes-associated variant (rs528350911) disrupting a GATA motif. Introducing the risk variant into hPSCs reduced binding of key pancreatic transcription factors (GATA4, GATA6, and FOXA2), supporting its causal role in diabetes. This work highlights the utility of unbiased enhancer discovery in disease-relevant settings for understanding monogenic and complex disease.
Asunto(s)
Diferenciación Celular , Elementos de Facilitación Genéticos , Páncreas , Humanos , Elementos de Facilitación Genéticos/genética , Diferenciación Celular/genética , Páncreas/metabolismo , Páncreas/patología , Diabetes Mellitus Tipo 2/genética , Diabetes Mellitus Tipo 2/metabolismo , Diabetes Mellitus Tipo 2/patología , Repeticiones Palindrómicas Cortas Agrupadas y Regularmente Espaciadas/genética , Células Madre Pluripotentes/metabolismo , Sistemas CRISPR-Cas/genética , Factor de Transcripción GATA6/metabolismo , Factor de Transcripción GATA6/genéticaRESUMEN
Functional enhancer annotation is a valuable first step for understanding tissue-specific transcriptional regulation and prioritizing disease-associated non-coding variants for investigation. However, unbiased enhancer discovery in physiologically relevant contexts remains a major challenge. To discover regulatory elements pertinent to diabetes, we conducted a CRISPR interference screen in the human pluripotent stem cell (hPSC) pancreatic differentiation system. Among the enhancers uncovered, we focused on a long-range enhancer â¼664 kb from the ONECUT1 promoter, since coding mutations in ONECUT1 cause pancreatic hypoplasia and neonatal diabetes. Homozygous enhancer deletion in hPSCs was associated with a near-complete loss of ONECUT1 gene expression and compromised pancreatic differentiation. This enhancer contains a confidently fine-mapped type 2 diabetes associated variant (rs528350911) which disrupts a GATA motif. Introduction of the risk variant into hPSCs revealed substantially reduced binding of key pancreatic transcription factors (GATA4, GATA6 and FOXA2) on the edited allele, accompanied by a slight reduction of ONECUT1 transcription, supporting a causal role for this risk variant in metabolic disease. This work expands our knowledge about transcriptional regulation in pancreatic development through the characterization of a long-range enhancer and highlights the utility of enhancer discovery in disease-relevant settings for understanding monogenic and complex disease.
RESUMEN
Translating genome-wide association study (GWAS) loci into causal variants and genes requires accurate cell-type-specific enhancer-gene maps from disease-relevant tissues. Building enhancer-gene maps is essential but challenging with current experimental methods in primary human tissues. Here we developed a nonparametric statistical method, SCENT (single-cell enhancer target gene mapping), that models association between enhancer chromatin accessibility and gene expression in single-cell or nucleus multimodal RNA sequencing and ATAC sequencing data. We applied SCENT to 9 multimodal datasets including >120,000 single cells or nuclei and created 23 cell-type-specific enhancer-gene maps. These maps were highly enriched for causal variants in expression quantitative loci and GWAS for 1,143 diseases and traits. We identified likely causal genes for both common and rare diseases and linked somatic mutation hotspots to target genes. We demonstrate that application of SCENT to multimodal data from disease-relevant human tissue enables the scalable construction of accurate cell-type-specific enhancer-gene maps, essential for defining noncoding variant function.
Asunto(s)
Estudio de Asociación del Genoma Completo , Secuencias Reguladoras de Ácidos Nucleicos , Humanos , Alelos , Estudio de Asociación del Genoma Completo/métodos , Mapeo Cromosómico , Fenotipo , Cromatina/genética , Polimorfismo de Nucleótido Simple , Predisposición Genética a la Enfermedad/genéticaRESUMEN
E3 ligases regulate key processes, but many of their roles remain unknown. Using Perturb-seq, we interrogated the function of 1,130 E3 ligases, partners and substrates in the inflammatory response in primary dendritic cells (DCs). Dozens impacted the balance of DC1, DC2, migratory DC and macrophage states and a gradient of DC maturation. Family members grouped into co-functional modules that were enriched for physical interactions and impacted specific programs through substrate transcription factors. E3s and their adaptors co-regulated the same processes, but partnered with different substrate recognition adaptors to impact distinct aspects of the DC life cycle. Genetic interactions were more prevalent within than between modules, and a deep learning model, comßVAE, predicts the outcome of new combinations by leveraging modularity. The E3 regulatory network was associated with heritable variation and aberrant gene expression in immune cells in human inflammatory diseases. Our study provides a general approach to dissect gene function.
RESUMEN
Identifying transcriptional enhancers and their target genes is essential for understanding gene regulation and the impact of human genetic variation on disease1-6. Here we create and evaluate a resource of >13 million enhancer-gene regulatory interactions across 352 cell types and tissues, by integrating predictive models, measurements of chromatin state and 3D contacts, and largescale genetic perturbations generated by the ENCODE Consortium7. We first create a systematic benchmarking pipeline to compare predictive models, assembling a dataset of 10,411 elementgene pairs measured in CRISPR perturbation experiments, >30,000 fine-mapped eQTLs, and 569 fine-mapped GWAS variants linked to a likely causal gene. Using this framework, we develop a new predictive model, ENCODE-rE2G, that achieves state-of-the-art performance across multiple prediction tasks, demonstrating a strategy involving iterative perturbations and supervised machine learning to build increasingly accurate predictive models of enhancer regulation. Using the ENCODE-rE2G model, we build an encyclopedia of enhancer-gene regulatory interactions in the human genome, which reveals global properties of enhancer networks, identifies differences in the functions of genes that have more or less complex regulatory landscapes, and improves analyses to link noncoding variants to target genes and cell types for common, complex diseases. By interpreting the model, we find evidence that, beyond enhancer activity and 3D enhancer-promoter contacts, additional features guide enhancerpromoter communication including promoter class and enhancer-enhancer synergy. Altogether, these genome-wide maps of enhancer-gene regulatory interactions, benchmarking software, predictive models, and insights about enhancer function provide a valuable resource for future studies of gene regulation and human genetics.
RESUMEN
Several biobanks, including UK Biobank (UKBB), are generating large-scale sequencing data. An existing method, SAIGE-GENE, performs well when testing variants with minor allele frequency (MAF) ≤ 1%, but inflation is observed in variance component set-based tests when restricting to variants with MAF ≤ 0.1% or 0.01%. Here, we propose SAIGE-GENE+ with greatly improved type I error control and computational efficiency to facilitate rare variant tests in large-scale data. We further show that incorporating multiple MAF cutoffs and functional annotations can improve power and thus uncover new gene-phenotype associations. In the analysis of UKBB whole exome sequencing data for 30 quantitative and 141 binary traits, SAIGE-GENE+ identified 551 gene-phenotype associations.
Asunto(s)
Estudio de Asociación del Genoma Completo , Frecuencia de los Genes/genética , Estudio de Asociación del Genoma Completo/métodos , Fenotipo , Secuenciación del ExomaRESUMEN
We assess contributions to autoimmune disease of genes whose regulation is driven by enhancer regions (enhancer-related) and genes that regulate other genes in trans (candidate master-regulator). We link these genes to SNPs using several SNP-to-gene (S2G) strategies and apply heritability analyses to draw three conclusions about 11 autoimmune/blood-related diseases/traits. First, several characterizations of enhancer-related genes using functional genomics data are informative for autoimmune disease heritability after conditioning on a broad set of regulatory annotations. Second, candidate master-regulator genes defined using trans-eQTL in blood are also conditionally informative for autoimmune disease heritability. Third, integrating enhancer-related and master-regulator gene sets with protein-protein interaction (PPI) network information magnified their disease signal. The resulting PPI-enhancer gene score produced >2-fold stronger heritability signal and >2-fold stronger enrichment for drug targets, compared with the recently proposed enhancer domain score. In each case, functionally informed S2G strategies produced 4.1- to 13-fold stronger disease signals than conventional window-based strategies.
RESUMEN
Single-cell RNA sequencing (scRNA-seq) provides unique insights into the pathology and cellular origin of disease. We introduce single-cell disease relevance score (scDRS), an approach that links scRNA-seq with polygenic disease risk at single-cell resolution, independent of annotated cell types. scDRS identifies cells exhibiting excess expression across disease-associated genes implicated by genome-wide association studies (GWASs). We applied scDRS to 74 diseases/traits and 1.3 million single-cell gene-expression profiles across 31 tissues/organs. Cell-type-level results broadly recapitulated known cell-type-disease associations. Individual-cell-level results identified subpopulations of disease-associated cells not captured by existing cell-type labels, including T cell subpopulations associated with inflammatory bowel disease, partially characterized by their effector-like states; neuron subpopulations associated with schizophrenia, partially characterized by their spatial locations; and hepatocyte subpopulations associated with triglyceride levels, partially characterized by their higher ploidy levels. Genes whose expression was correlated with the scDRS score across cells (reflecting coexpression with GWAS disease-associated genes) were strongly enriched for gold-standard drug target and Mendelian disease genes.
Asunto(s)
Estudio de Asociación del Genoma Completo , Análisis de la Célula Individual , Perfilación de la Expresión Génica/métodos , Herencia Multifactorial/genética , RNA-Seq , Análisis de la Célula Individual/métodos , TriglicéridosRESUMEN
Genome-wide association studies provide a powerful means of identifying loci and genes contributing to disease, but in many cases, the related cell types/states through which genes confer disease risk remain unknown. Deciphering such relationships is important for identifying pathogenic processes and developing therapeutics. In the present study, we introduce sc-linker, a framework for integrating single-cell RNA-sequencing, epigenomic SNP-to-gene maps and genome-wide association study summary statistics to infer the underlying cell types and processes by which genetic variants influence disease. The inferred disease enrichments recapitulated known biology and highlighted notable cell-disease relationships, including γ-aminobutyric acid-ergic neurons in major depressive disorder, a disease-dependent M-cell program in ulcerative colitis and a disease-specific complement cascade process in multiple sclerosis. In autoimmune disease, both healthy and disease-dependent immune cell-type programs were associated, whereas only disease-dependent epithelial cell programs were prominent, suggesting a role in disease response rather than initiation. Our framework provides a powerful approach for identifying the cell types and cellular processes by which genetic variants influence disease.
Asunto(s)
Trastorno Depresivo Mayor , Estudio de Asociación del Genoma Completo , Trastorno Depresivo Mayor/genética , Predisposición Genética a la Enfermedad , Genética Humana , Humanos , Polimorfismo de Nucleótido Simple/genética , ARN , Ácido gamma-AminobutíricoRESUMEN
Disease-associated single-nucleotide polymorphisms (SNPs) generally do not implicate target genes, as most disease SNPs are regulatory. Many SNP-to-gene (S2G) linking strategies have been developed to link regulatory SNPs to the genes that they regulate in cis. Here, we developed a heritability-based framework for evaluating and combining different S2G strategies to optimize their informativeness for common disease risk. Our optimal combined S2G strategy (cS2G) included seven constituent S2G strategies and achieved a precision of 0.75 and a recall of 0.33, more than doubling the recall of any individual strategy. We applied cS2G to fine-mapping results for 49 UK Biobank diseases/traits to predict 5,095 causal SNP-gene-disease triplets (with S2G-derived functional interpretation) with high confidence. We further applied cS2G to provide an empirical assessment of disease omnigenicity; we determined that the top 1% of genes explained roughly half of the SNP heritability linked to all genes and that gene-level architectures vary with variant allele frequency.