RESUMO
Genome-wide association studies (GWASs) have identified more than 200 prostate cancer (PrCa) risk regions, which provide potential insights into causal mechanisms. Multiple lines of evidence show that a significant proportion of PrCa risk can be explained by germline causal variants that dysregulate nearby target genes in prostate-relevant tissues, thus altering disease risk. The traditional approach to explore this hypothesis has been correlating GWAS variants with steady-state transcript levels, referred to as expression quantitative trait loci (eQTLs). In this work, we assess the utility of chromosome conformation capture (3C) coupled with immunoprecipitation (HiChIP) to identify target genes for PrCa GWAS risk loci. We find that interactome data confirm previously reported PrCa target genes identified through GWAS/eQTL overlap (e.g., MLPH). Interestingly, HiChIP identifies links between PrCa GWAS variants and genes well-known to play a role in prostate cancer biology (e.g., AR) that are not detected by eQTL-based methods. HiChIP predicted enhancer elements at the AR and NKX3-1 prostate cancer risk loci, and both were experimentally confirmed to regulate expression of the corresponding genes through CRISPR interference (CRISPRi) perturbation in LNCaP cells. Our results demonstrate that looping data harbor additional information beyond eQTLs and expand the number of PrCa GWAS loci that can be linked to candidate susceptibility genes.
Assuntos
Sequenciamento de Cromatina por Imunoprecipitação , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla , Código das Histonas/genética , Neoplasias da Próstata/genética , Linhagem Celular Tumoral , Cromossomos Humanos , Repetições Palindrômicas Curtas Agrupadas e Regularmente Espaçadas , Técnicas Genéticas , Humanos , Masculino , Locos de Características QuantitativasRESUMO
Despite strong transethnic genetic correlations reported in the literature for many complex traits, the non-transferability of polygenic risk scores across populations suggests the presence of population-specific components of genetic architecture. We propose an approach that models GWAS summary data for one trait in two populations to estimate genome-wide proportions of population-specific/shared causal SNPs. In simulations across various genetic architectures, we show that our approach yields approximately unbiased estimates with in-sample LD and slight upward-bias with out-of-sample LD. We analyze nine complex traits in individuals of East Asian and European ancestry, restricting to common SNPs (MAF > 5%), and find that most common causal SNPs are shared by both populations. Using the genome-wide estimates as priors in an empirical Bayes framework, we perform fine-mapping and observe that high-posterior SNPs (for both the population-specific and shared causal configurations) have highly correlated effects in East Asians and Europeans. In population-specific GWAS risk regions, we observe a 2.8× enrichment of shared high-posterior SNPs, suggesting that population-specific GWAS risk regions harbor shared causal SNPs that are undetected in the other GWASs due to differences in LD, allele frequencies, and/or sample size. Finally, we report enrichments of shared high-posterior SNPs in 53 tissue-specific functional categories and find evidence that SNP-heritability enrichments are driven largely by many low-effect common SNPs.
Assuntos
Etnicidade/genética , Estudo de Associação Genômica Ampla , Herança Multifatorial/genética , Polimorfismo de Nucleotídeo Único/genética , Teorema de Bayes , Europa (Continente)/etnologia , Ásia Oriental/etnologia , Frequência do Gene , Humanos , Desequilíbrio de Ligação , Especificidade de Órgãos/genéticaRESUMO
Functional genomics data has the potential to increase GWAS power by identifying SNPs that have a higher prior probability of association. Here, we introduce a method that leverages polygenic functional enrichment to incorporate coding, conserved, regulatory, and LD-related genomic annotations into association analyses. We show via simulations with real genotypes that the method, functionally informed novel discovery of risk loci (FINDOR), correctly controls the false-positive rate at null loci and attains a 9%-38% increase in the number of independent associations detected at causal loci, depending on trait polygenicity and sample size. We applied FINDOR to 27 independent complex traits and diseases from the interim UK Biobank release (average N = 130K). Averaged across traits, we attained a 13% increase in genome-wide significant loci detected (including a 20% increase for disease traits) compared to unweighted raw p values that do not use functional data. We replicated the additional loci in independent UK Biobank and non-UK Biobank data, yielding a highly statistically significant replication slope (0.66-0.69) in each case. Finally, we applied FINDOR to the full UK Biobank release (average N = 416K), attaining smaller relative improvements (consistent with simulations) but larger absolute improvements, detecting an additional 583 GWAS loci. In conclusion, leveraging functional enrichment using our method robustly increases GWAS power.
Assuntos
Estudo de Associação Genômica Ampla , Herança Multifatorial/genética , Polimorfismo de Nucleotídeo Único/genética , Calibragem , Bases de Dados Genéticas , Conjuntos de Dados como Assunto , Reações Falso-Positivas , Humanos , Probabilidade , Reprodutibilidade dos Testes , Reino UnidoRESUMO
Although recent studies provide evidence for a common genetic basis between complex traits and Mendelian disorders, a thorough quantification of their overlap in a phenotype-specific manner remains elusive. Here, we have quantified the overlap of genes identified through large-scale genome-wide association studies (GWASs) for 62 complex traits and diseases with genes containing mutations known to cause 20 broad categories of Mendelian disorders. We identified a significant enrichment of genes linked to phenotypically matched Mendelian disorders in GWAS gene sets; of the total 1,240 comparisons, a higher proportion of phenotypically matched or related pairs (n = 50 of 92 [54%]) than phenotypically unmatched pairs (n = 27 of 1,148 [2%]) demonstrated significant overlap, confirming a phenotype-specific enrichment pattern. Further, we observed elevated GWAS effect sizes near genes linked to phenotypically matched Mendelian disorders. Finally, we report examples of GWAS variants localized at the transcription start site or physically interacting with the promoters of genes linked to phenotypically matched Mendelian disorders. Our results are consistent with the hypothesis that genes that are disrupted in Mendelian disorders are dysregulated by non-coding variants in complex traits and demonstrate how leveraging findings from related Mendelian disorders and functional genomic datasets can prioritize genes that are putatively dysregulated by local and distal non-coding GWAS variants.
Assuntos
Herança Multifatorial/genética , Polimorfismo de Nucleotídeo Único/genética , Locos de Características Quantitativas/genética , Feminino , Predisposição Genética para Doença/genética , Estudo de Associação Genômica Ampla/métodos , Humanos , Masculino , Fenótipo , Regiões Promotoras Genéticas/genética , Sítio de Iniciação de Transcrição/fisiologiaRESUMO
Dupuytren's disease is a common inherited tissue-specific fibrotic disorder, characterized by progressive and irreversible fibroblastic proliferation affecting the palmar fascia of the hand. Although genome-wide association study (GWAS) have identified 24 genomic regions associated with Dupuytrens risk, the biological mechanisms driving signal at these regions remain elusive. We identify potential biological mechanisms for Dupuytren's disease by integrating the most recent, largest GWAS (3,871 cases and 4,686 controls) with eQTLs (47 tissue panels from five consortia, total n = 3,975) to perform a transcriptome-wide association study. We identify 43 tissue-specific gene associations with Dupuytren's risk, including one in a novel risk region. We also estimate the genome-wide genetic correlation between Dupuytren's disease and 45 complex traits and find significant genetic correlations between Dupuytren's disease and body mass index (BMI), type II diabetes, triglycerides, and high-density lipoprotein (HDL), suggesting a shared genetic etiology between these traits. We further examine local genetic correlation to identify 8 and 3 novel regions significantly correlated with BMI and HDL respectively. Our results are consistent with previous epidemiological findings showing that lower BMI increases risk for Dupuytren's disease. These 12 novel risk regions provide new insight into the biological mechanisms of Dupuytren's disease and serve as a starting point for functional validation.
Assuntos
Índice de Massa Corporal , Diabetes Mellitus Tipo 2/genética , Contratura de Dupuytren/etiologia , Marcadores Genéticos , Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Locos de Características Quantitativas , Estudos de Casos e Controles , Cromossomos Humanos Par 17/genética , Contratura de Dupuytren/patologia , Humanos , Fatores de RiscoRESUMO
Human inborn errors of immunity include rare disorders entailing functional and quantitative antibody deficiencies due to impaired B cells called the common variable immunodeficiency (CVID) phenotype. Patients with CVID face delayed diagnoses and treatments for 5 to 15 years after symptom onset because the disorders are rare (prevalence of ~1/25,000), and there is extensive heterogeneity in CVID phenotypes, ranging from infections to autoimmunity to inflammatory conditions, overlapping with other more common disorders. The prolonged diagnostic odyssey drives excessive system-wide costs before diagnosis. Because there is no single causal mechanism, there are no genetic tests to definitively diagnose CVID. Here, we present PheNet, a machine learning algorithm that identifies patients with CVID from their electronic health records (EHRs). PheNet learns phenotypic patterns from verified CVID cases and uses this knowledge to rank patients by likelihood of having CVID. PheNet could have diagnosed more than half of our patients with CVID 1 or more years earlier than they had been diagnosed. When applied to a large EHR dataset, followed by blinded chart review of the top 100 patients ranked by PheNet, we found that 74% were highly probable to have CVID. We externally validated PheNet using >6 million records from disparate medical systems in California and Tennessee. As artificial intelligence and machine learning make their way into health care, we show that algorithms such as PheNet can offer clinical benefits by expediting the diagnosis of rare diseases.
Assuntos
Imunodeficiência de Variável Comum , Registros Eletrônicos de Saúde , Humanos , Imunodeficiência de Variável Comum/diagnóstico , Aprendizado de Máquina , Algoritmos , Masculino , Feminino , Fenótipo , Adulto , Doenças não Diagnosticadas/diagnósticoRESUMO
Despite the profound impacts of scientific research, few scientists have received the necessary training to productively discuss the ethical and societal implications of their work. To address this critical gap, we-a group of predominantly human genetics trainees-developed a course on genetics, ethics, and society. We intend for this course to serve as a template for other institutions and scientific disciplines. Our curriculum positions human genetics within its historical and societal context and encourages students to evaluate how societal norms and structures impact the conduct of scientific research. We demonstrate the utility of this course via surveys of enrolled students and provide resources and strategies for others hoping to teach a similar course. We conclude by arguing that if we are to work toward rectifying the inequities and injustices produced by our field, we must first learn to view our own research as impacting and being impacted by society.
Assuntos
Currículo , Ciência , Humanos , Ciência/educação , Ciência/ética , Pesquisa Biomédica , GenéticaRESUMO
BACKGROUND: Large medical centers in urban areas, like Los Angeles, care for a diverse patient population and offer the potential to study the interplay between genetic ancestry and social determinants of health. Here, we explore the implications of genetic ancestry within the University of California, Los Angeles (UCLA) ATLAS Community Health Initiative-an ancestrally diverse biobank of genomic data linked with de-identified electronic health records (EHRs) of UCLA Health patients (N=36,736). METHODS: We quantify the extensive continental and subcontinental genetic diversity within the ATLAS data through principal component analysis, identity-by-descent, and genetic admixture. We assess the relationship between genetically inferred ancestry (GIA) and >1500 EHR-derived phenotypes (phecodes). Finally, we demonstrate the utility of genetic data linked with EHR to perform ancestry-specific and multi-ancestry genome and phenome-wide scans across a broad set of disease phenotypes. RESULTS: We identify 5 continental-scale GIA clusters including European American (EA), African American (AA), Hispanic Latino American (HL), South Asian American (SAA) and East Asian American (EAA) individuals and 7 subcontinental GIA clusters within the EAA GIA corresponding to Chinese American, Vietnamese American, and Japanese American individuals. Although we broadly find that self-identified race/ethnicity (SIRE) is highly correlated with GIA, we still observe marked differences between the two, emphasizing that the populations defined by these two criteria are not analogous. We find a total of 259 significant associations between continental GIA and phecodes even after accounting for individuals' SIRE, demonstrating that for some phenotypes, GIA provides information not already captured by SIRE. GWAS identifies significant associations for liver disease in the 22q13.31 locus across the HL and EAA GIA groups (HL p-value=2.32×10-16, EAA p-value=6.73×10-11). A subsequent PheWAS at the top SNP reveals significant associations with neurologic and neoplastic phenotypes specifically within the HL GIA group. CONCLUSIONS: Overall, our results explore the interplay between SIRE and GIA within a disease context and underscore the utility of studying the genomes of diverse individuals through biobank-scale genotyping linked with EHR-based phenotyping.
Assuntos
Registros Eletrônicos de Saúde , Saúde Pública , Povo Asiático , Bancos de Espécimes Biológicos , Genômica , HumanosRESUMO
Coronavirus disease 2019 (COVID-19) has exposed health care disparities in minority groups including Hispanics/Latinxs (HL). Studies of COVID-19 risk factors for HL have relied on county-level data. We investigated COVID-19 risk factors in HL using individual-level, electronic health records in a Los Angeles health system between March 9, 2020, and August 31, 2020. Of 9,287 HL tested for SARS-CoV-2, 562 were positive. HL constituted an increasing percentage of all COVID-19 positive individuals as disease severity escalated. Multiple risk factors identified in Non-Hispanic/Latinx whites (NHL-W), like renal disease, also conveyed risk in HL. Pre-existing nonrheumatic mitral valve disorder was a risk factor for HL hospitalization but not for NHL-W COVID-19 or HL influenza hospitalization, suggesting it may be a specific HL COVID-19 risk. Admission laboratory values also suggested that HL presented with a greater inflammatory response. COVID-19 risk factors for HL can help guide equitable government policies and identify at-risk populations.
RESUMO
Single-cell RNA-sequencing (scRNA-seq) is a set of technologies used to profile gene expression at the level of individual cells. Although the throughput of scRNA-seq experiments is steadily growing in terms of the number of cells, large datasets are not yet commonly generated owing to prohibitively high costs. Integrating multiple datasets into one can improve power in scRNA-seq experiments, and efficient integration is very important for downstream analyses such as identifying cell-type-specific eQTLs. State-of-the-art scRNA-seq integration methods are based on the mutual nearest neighbor paradigm and fail to both correct for batch effects and maintain the local structure of the datasets. In this paper, we propose a novel scRNA-seq dataset integration method called BATMAN (BATch integration via minimum-weight MAtchiNg). Across multiple simulations and real datasets, we show that our method significantly outperforms state-of-the-art tools with respect to existing metrics for batch effects by up to 80% while retaining cell-to-cell relationships.
RESUMO
With the continuing coronavirus disease 2019 (COVID-19) pandemic coupled with phased reopening, it is critical to identify risk factors associated with susceptibility and severity of disease in a diverse population to help shape government policies, guide clinical decision making, and prioritize future COVID-19 research. In this retrospective case-control study, we used de-identified electronic health records (EHR) from the University of California Los Angeles (UCLA) Health System between March 9th, 2020 and June 14th, 2020 to identify risk factors for COVID-19 susceptibility (severe acute respiratory distress syndrome coronavirus 2 (SARS-CoV-2) PCR test positive), inpatient admission, and severe outcomes (treatment in an intensive care unit or intubation). Of the 26,602 individuals tested by PCR for SARS-CoV-2, 992 were COVID-19 positive (3.7% of Tested), 220 were admitted in the hospital (22% of COVID-19 positive), and 77 had a severe outcome (35% of Inpatient). Consistent with previous studies, males and individuals older than 65 years old had increased risk of inpatient admission. Notably, individuals self-identifying as Hispanic or Latino constituted an increasing percentage of COVID-19 patients as disease severity escalated, comprising 24% of those testing positive, but 40% of those with a severe outcome, a disparity that remained after correcting for medical comorbidities. Cardiovascular disease, hypertension, and renal disease were premorbid risk factors present before SARS-CoV-2 PCR testing associated with COVID-19 susceptibility. Less well-established risk factors for COVID-19 susceptibility included pre-existing dementia (odds ratio (OR) 5.2 [3.2-8.3], p=2.6 x 10-10), mental health conditions (depression OR 2.1 [1.6-2.8], p=1.1 x 10-6) and vitamin D deficiency (OR 1.8 [1.4-2.2], p=5.7 x 10-6). Renal diseases including end-stage renal disease and anemia due to chronic renal disease were the predominant premorbid risk factors for COVID-19 inpatient admission. Other less established risk factors for COVID-19 inpatient admission included previous renal transplant (OR 9.7 [2.8-39], p=3.2x10-4) and disorders of the immune system (OR 6.0 [2.3, 16], p=2.7x10-4). Prior use of oral steroid medications was associated with decreased COVID-19 positive testing risk (OR 0.61 [0.45, 0.81], p=4.3x10-4), but increased inpatient admission risk (OR 4.5 [2.3, 8.9], p=1.8x10-5). We did not observe that prior use of angiotensin converting enzyme inhibitors or angiotensin receptor blockers increased the risk of testing positive for SARS-CoV-2, being admitted to the hospital, or having a severe outcome. This study involving direct EHR extraction identified known and less well-established demographics, and prior diagnoses and medications as risk factors for COVID-19 susceptibility and inpatient admission. Knowledge of these risk factors including marked ethnic disparities observed in disease severity should guide government policies, identify at-risk populations, inform clinical decision making, and prioritize future COVID-19 research.
RESUMO
Transcriptome-wide association studies using predicted expression have identified thousands of genes whose locally regulated expression is associated with complex traits and diseases. In this work, we show that linkage disequilibrium induces significant gene-trait associations at non-causal genes as a function of the expression quantitative trait loci weights used in expression prediction. We introduce a probabilistic framework that models correlation among transcriptome-wide association study signals to assign a probability for every gene in the risk region to explain the observed association signal. Importantly, our approach remains accurate when expression data for causal genes are not available in the causal tissue by leveraging expression prediction from other tissues. Our approach yields credible sets of genes containing the causal gene at a nominal confidence level (for example, 90%) that can be used to prioritize genes for functional assays. We illustrate our approach by using an integrative analysis of lipid traits, where our approach prioritizes genes with strong evidence for causality.