RESUMO
Rare coding variants that substantially affect function provide insights into the biology of a gene1-3. However, ascertaining the frequency of such variants requires large sample sizes4-8. Here we present a catalogue of human protein-coding variation, derived from exome sequencing of 983,578 individuals across diverse populations. In total, 23% of the Regeneron Genetics Center Million Exome (RGC-ME) data come from individuals of African, East Asian, Indigenous American, Middle Eastern and South Asian ancestry. The catalogue includes more than 10.4 million missense and 1.1 million predicted loss-of-function (pLOF) variants. We identify individuals with rare biallelic pLOF variants in 4,848 genes, 1,751 of which have not been previously reported. From precise quantitative estimates of selection against heterozygous loss of function (LOF), we identify 3,988 LOF-intolerant genes, including 86 that were previously assessed as tolerant and 1,153 that lack established disease annotation. We also define regions of missense depletion at high resolution. Notably, 1,482 genes have regions that are depleted of missense variants despite being tolerant of pLOF variants. Finally, we estimate that 3% of individuals have a clinically actionable genetic variant, and that 11,773 variants reported in ClinVar with unknown significance are likely to be deleterious cryptic splice sites. To facilitate variant interpretation and genetics-informed precision medicine, we make this resource of coding variation from the RGC-ME dataset publicly accessible through a variant allele frequency browser.
Assuntos
Exoma , Variação Genética , Proteínas , Humanos , Alelos , Exoma/genética , Sequenciamento do Exoma , Frequência do Gene , Variação Genética/genética , Heterozigoto , Mutação com Perda de Função/genética , Mutação de Sentido Incorreto/genética , Fases de Leitura Aberta/genética , Proteínas/genética , Sítios de Splice de RNA/genética , Medicina de PrecisãoRESUMO
The Mexico City Prospective Study is a prospective cohort of more than 150,000 adults recruited two decades ago from the urban districts of Coyoacán and Iztapalapa in Mexico City1. Here we generated genotype and exome-sequencing data for all individuals and whole-genome sequencing data for 9,950 selected individuals. We describe high levels of relatedness and substantial heterogeneity in ancestry composition across individuals. Most sequenced individuals had admixed Indigenous American, European and African ancestry, with extensive admixture from Indigenous populations in central, southern and southeastern Mexico. Indigenous Mexican segments of the genome had lower levels of coding variation but an excess of homozygous loss-of-function variants compared with segments of African and European origin. We estimated ancestry-specific allele frequencies at 142 million genomic variants, with an effective sample size of 91,856 for Indigenous Mexican ancestry at exome variants, all available through a public browser. Using whole-genome sequencing, we developed an imputation reference panel that outperforms existing panels at common variants in individuals with high proportions of central, southern and southeastern Indigenous Mexican ancestry. Our work illustrates the value of genetic studies in diverse populations and provides foundational imputation and allele frequency resources for future genetic studies in Mexico and in the United States, where the Hispanic/Latino population is predominantly of Mexican descent.
Assuntos
Sequenciamento do Exoma , Genoma Humano , Genótipo , Hispânico ou Latino , Adulto , Humanos , África/etnologia , América/etnologia , Europa (Continente)/etnologia , Frequência do Gene/genética , Genética Populacional , Genoma Humano/genética , Técnicas de Genotipagem , Hispânico ou Latino/genética , Homozigoto , Mutação com Perda de Função/genética , México , Estudos ProspectivosRESUMO
The complex network of specialized cells and molecules in the immune system has evolved to defend against pathogens, but inadvertent immune system attacks on "self" result in autoimmune disease. Both genetic regulation of immune cell levels and their relationships with autoimmunity are largely undetermined. Here, we report genetic contributions to quantitative levels of 95 cell types encompassing 272 immune traits, in a cohort of 1,629 individuals from four clustered Sardinian villages. We first estimated trait heritability, showing that it can be substantial, accounting for up to 87% of the variance (mean 41%). Next, by assessing â¼8.2 million variants that we identified and confirmed in an extended set of 2,870 individuals, 23 independent variants at 13 loci associated with at least one trait. Notably, variants at three loci (HLA, IL2RA, and SH2B3/ATXN2) overlap with known autoimmune disease associations. These results connect specific cellular phenotypes to specific genetic variants, helping to explicate their involvement in disease.
Assuntos
Citometria de Fluxo/métodos , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla , Doenças do Sistema Imunitário/genética , Polimorfismo de Nucleotídeo Único , Humanos , FenótipoRESUMO
Clonal haematopoiesis involves the expansion of certain blood cell lineages and has been associated with ageing and adverse health outcomes1-5. Here we use exome sequence data on 628,388 individuals to identify 40,208 carriers of clonal haematopoiesis of indeterminate potential (CHIP). Using genome-wide and exome-wide association analyses, we identify 24 loci (21 of which are novel) where germline genetic variation influences predisposition to CHIP, including missense variants in the lymphocytic antigen coding gene LY75, which are associated with reduced incidence of CHIP. We also identify novel rare variant associations with clonal haematopoiesis and telomere length. Analysis of 5,041 health traits from the UK Biobank (UKB) found relationships between CHIP and severe COVID-19 outcomes, cardiovascular disease, haematologic traits, malignancy, smoking, obesity, infection and all-cause mortality. Longitudinal and Mendelian randomization analyses revealed that CHIP is associated with solid cancers, including non-melanoma skin cancer and lung cancer, and that CHIP linked to DNMT3A is associated with the subsequent development of myeloid but not lymphoid leukaemias. Additionally, contrary to previous findings from the initial 50,000 UKB exomes6, our results in the full sample do not support a role for IL-6 inhibition in reducing the risk of cardiovascular disease among CHIP carriers. Our findings demonstrate that CHIP represents a complex set of heterogeneous phenotypes with shared and unique germline genetic causes and varied clinical implications.
Assuntos
COVID-19 , Doenças Cardiovasculares , Humanos , Hematopoiese Clonal/genética , Doenças Cardiovasculares/epidemiologia , Doenças Cardiovasculares/genéticaRESUMO
Critical COVID-19 is caused by immune-mediated inflammatory lung injury. Host genetic variation influences the development of illness requiring critical care1 or hospitalization2-4 after infection with SARS-CoV-2. The GenOMICC (Genetics of Mortality in Critical Care) study enables the comparison of genomes from individuals who are critically ill with those of population controls to find underlying disease mechanisms. Here we use whole-genome sequencing in 7,491 critically ill individuals compared with 48,400 controls to discover and replicate 23 independent variants that significantly predispose to critical COVID-19. We identify 16 new independent associations, including variants within genes that are involved in interferon signalling (IL10RB and PLSCR1), leucocyte differentiation (BCL11A) and blood-type antigen secretor status (FUT2). Using transcriptome-wide association and colocalization to infer the effect of gene expression on disease severity, we find evidence that implicates multiple genes-including reduced expression of a membrane flippase (ATP11A), and increased expression of a mucin (MUC1)-in critical disease. Mendelian randomization provides evidence in support of causal roles for myeloid cell adhesion molecules (SELE, ICAM5 and CD209) and the coagulation factor F8, all of which are potentially druggable targets. Our results are broadly consistent with a multi-component model of COVID-19 pathophysiology, in which at least two distinct mechanisms can predispose to life-threatening disease: failure to control viral replication; or an enhanced tendency towards pulmonary inflammation and intravascular coagulation. We show that comparison between cases of critical illness and population controls is highly efficient for the detection of therapeutically relevant mechanisms of disease.
Assuntos
COVID-19 , Estado Terminal , Genoma Humano , Interações Hospedeiro-Patógeno , Sequenciamento Completo do Genoma , Transportadores de Cassetes de Ligação de ATP , COVID-19/genética , COVID-19/mortalidade , COVID-19/patologia , COVID-19/virologia , Moléculas de Adesão Celular , Cuidados Críticos , Estado Terminal/mortalidade , Selectina E , Fator VIII , Fucosiltransferases , Genoma Humano/genética , Estudo de Associação Genômica Ampla , Interações Hospedeiro-Patógeno/genética , Humanos , Subunidade beta de Receptor de Interleucina-10 , Lectinas Tipo C , Mucina-1 , Proteínas do Tecido Nervoso , Proteínas de Transferência de Fosfolipídeos , Receptores de Superfície Celular , Proteínas Repressoras , SARS-CoV-2/patogenicidade , Galactosídeo 2-alfa-L-FucosiltransferaseRESUMO
Gene-based burden tests are a popular and powerful approach for analysis of exome-wide association studies. These approaches combine sets of variants within a gene into a single burden score that is then tested for association. Typically, a range of burden scores are calculated and tested across a range of annotation classes and frequency bins. Correlation between these tests can complicate the multiple testing correction and hamper interpretation of the results. We introduce a method called the sparse burden association test (SBAT) that tests the joint set of burden scores under the assumption that causal burden scores act in the same effect direction. The method simultaneously assesses the significance of the model fit and selects the set of burden scores that best explain the association at the same time. Using simulated data, we show that the method is well calibrated and highlight scenarios where the test outperforms existing gene-based tests. We apply the method to 73 quantitative traits from the UK Biobank, showing that SBAT is a valuable additional gene-based test when combined with other existing approaches. This test is implemented in the REGENIE software.
Assuntos
Estudo de Associação Genômica Ampla , Humanos , Estudo de Associação Genômica Ampla/métodos , Análise dos Mínimos Quadrados , Software , Modelos Genéticos , Exoma/genética , Variação Genética , Simulação por ComputadorRESUMO
A major goal in human genetics is to use natural variation to understand the phenotypic consequences of altering each protein-coding gene in the genome. Here we used exome sequencing1 to explore protein-altering variants and their consequences in 454,787 participants in the UK Biobank study2. We identified 12 million coding variants, including around 1 million loss-of-function and around 1.8 million deleterious missense variants. When these were tested for association with 3,994 health-related traits, we found 564 genes with trait associations at P ≤ 2.18 × 10-11. Rare variant associations were enriched in loci from genome-wide association studies (GWAS), but most (91%) were independent of common variant signals. We discovered several risk-increasing associations with traits related to liver disease, eye disease and cancer, among others, as well as risk-lowering associations for hypertension (SLC9A3R2), diabetes (MAP3K15, FAM234A) and asthma (SLC27A3). Six genes were associated with brain imaging phenotypes, including two involved in neural development (GBE1, PLD1). Of the signals available and powered for replication in an independent cohort, 81% were confirmed; furthermore, association signals were generally consistent across individuals of European, Asian and African ancestry. We illustrate the ability of exome sequencing to identify gene-trait associations, elucidate gene function and pinpoint effector genes that underlie GWAS signals at scale.
Assuntos
Bancos de Espécimes Biológicos , Bases de Dados Genéticas , Sequenciamento do Exoma , Exoma/genética , África/etnologia , Ásia/etnologia , Asma/genética , Diabetes Mellitus/genética , Europa (Continente)/etnologia , Oftalmopatias/genética , Feminino , Predisposição Genética para Doença/genética , Variação Genética , Estudo de Associação Genômica Ampla , Humanos , Hipertensão/genética , Hepatopatias/genética , Masculino , Mutação , Neoplasias/genética , Característica Quantitativa Herdável , Reino UnidoRESUMO
Genome-wide association studies have contributed extensively to the discovery of disease-associated common variants. However, the genetic contribution to complex traits is still largely difficult to interpret. We report a genome-wide association study of 2394 cases and 2393 controls for age-related macular degeneration (AMD) via whole-genome sequencing, with 46.9 million genetic variants. Our study reveals significant single-variant association signals at four loci and independent gene-based signals in CFH, C2, C3, and NRTN. Using data from the Exome Aggregation Consortium (ExAC) for a gene-based test, we demonstrate an enrichment of predicted rare loss-of-function variants in CFH, CFI, and an as-yet unreported gene in AMD, ORMDL2. Our method of using a large variant list without individual-level genotypes as an external reference provides a flexible and convenient approach to leverage the publicly available variant datasets to augment the search for rare variant associations, which can explain additional disease risk in AMD.
Assuntos
Estudo de Associação Genômica Ampla , Degeneração Macular , Humanos , Estudo de Associação Genômica Ampla/métodos , Degeneração Macular/genética , Genótipo , Testes Genéticos , Sequenciamento Completo do Genoma , Polimorfismo de Nucleotídeo Único/genética , Predisposição Genética para Doença , Fator H do Complemento/genéticaRESUMO
The UK Biobank is a prospective study of 502,543 individuals, combining extensive phenotypic and genotypic data with streamlined access for researchers around the world1. Here we describe the release of exome-sequence data for the first 49,960 study participants, revealing approximately 4 million coding variants (of which around 98.6% have a frequency of less than 1%). The data include 198,269 autosomal predicted loss-of-function (LOF) variants, a more than 14-fold increase compared to the imputed sequence. Nearly all genes (more than 97%) had at least one carrier with a LOF variant, and most genes (more than 69%) had at least ten carriers with a LOF variant. We illustrate the power of characterizing LOF variants in this population through association analyses across 1,730 phenotypes. In addition to replicating established associations, we found novel LOF variants with large effects on disease traits, including PIEZO1 on varicose veins, COL6A1 on corneal resistance, MEPE on bone density, and IQGAP2 and GMPR on blood cell traits. We further demonstrate the value of exome sequencing by surveying the prevalence of pathogenic variants of clinical importance, and show that 2% of this population has a medically actionable variant. Furthermore, we characterize the penetrance of cancer in carriers of pathogenic BRCA1 and BRCA2 variants. Exome sequences from the first 49,960 participants highlight the promise of genome sequencing in large population-based studies and are now accessible to the scientific community.
Assuntos
Bases de Dados Genéticas , Sequenciamento do Exoma , Exoma/genética , Mutação com Perda de Função/genética , Fenótipo , Idoso , Densidade Óssea/genética , Colágeno Tipo VI/genética , Demografia , Feminino , Genes BRCA1 , Genes BRCA2 , Genótipo , Humanos , Canais Iônicos/genética , Masculino , Pessoa de Meia-Idade , Neoplasias/genética , Penetrância , Fragmentos de Peptídeos/genética , Reino Unido , Varizes/genética , Proteínas Ativadoras de ras GTPase/genéticaRESUMO
Persistent opioid use after surgery is a common morbidity outcome associated with subsequent opioid use disorder, overdose, and death. While phenotypic associations have been described, genetic associations remain unidentified. Here, we conducted the largest genetic study of persistent opioid use after surgery, comprising ~40,000 non-Hispanic, European-ancestry Michigan Genomics Initiative participants (3198 cases and 36,321 surgically exposed controls). Our study primarily focused on the reproducibility and reliability of 72 genetic studies of opioid use disorder phenotypes. Nominal associations (p < 0.05) occurred at 12 of 80 unique (r2 < 0.8) signals from these studies. Six occurred in OPRM1 (most significant: rs79704991-T, OR = 1.17, p = 8.7 × 10-5), with two surviving multiple testing correction. Other associations were rs640561-LRRIQ3 (p = 0.015), rs4680-COMT (p = 0.016), rs9478495 (p = 0.017, intergenic), rs10886472-GRK5 (p = 0.028), rs9291211-SLC30A9/BEND4 (p = 0.043), and rs112068658-KCNN1 (p = 0.048). Two highly referenced genes, OPRD1 and DRD2/ANKK1, had no signals in MGI. Associations at previously identified OPRM1 variants suggest common biology between persistent opioid use and opioid use disorder, further demonstrating connections between opioid dependence and addiction phenotypes. Lack of significant associations at other variants challenges previous studies' reliability.
RESUMO
Genotype imputation is an integral tool in genome-wide association studies, in which it facilitates meta-analysis, increases power, and enables fine-mapping. With the increasing availability of whole-genome-sequence datasets, investigators have access to a multitude of reference-panel choices for genotype imputation. In principle, combining all sequenced whole genomes into a single large panel would provide the best imputation performance, but this is often cumbersome or impossible due to privacy restrictions. Here, we describe meta-imputation, a method that allows imputation results generated using different reference panels to be combined into a consensus imputed dataset. Our meta-imputation method requires small changes to the output of existing imputation tools to produce necessary inputs, which are then combined using dynamically estimated weights that are tailored to each individual and genome segment. In the scenarios we examined, the method consistently outperforms imputation using a single reference panel and achieves accuracy comparable to imputation using a combined reference panel.
Assuntos
Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Genoma , Estudo de Associação Genômica Ampla/métodos , Genótipo , Humanos , Polimorfismo de Nucleotídeo Único/genética , Projetos de PesquisaRESUMO
BACKGROUND: Exome sequencing in hundreds of thousands of persons may enable the identification of rare protein-coding genetic variants associated with protection from human diseases like liver cirrhosis, providing a strategy for the discovery of new therapeutic targets. METHODS: We performed a multistage exome sequencing and genetic association analysis to identify genes in which rare protein-coding variants were associated with liver phenotypes. We conducted in vitro experiments to further characterize associations. RESULTS: The multistage analysis involved 542,904 persons with available data on liver aminotransferase levels, 24,944 patients with various types of liver disease, and 490,636 controls without liver disease. We found that rare coding variants in APOB, ABCB4, SLC30A10, and TM6SF2 were associated with increased aminotransferase levels and an increased risk of liver disease. We also found that variants in CIDEB, which encodes a structural protein found in hepatic lipid droplets, had a protective effect. The burden of rare predicted loss-of-function variants plus missense variants in CIDEB (combined carrier frequency, 0.7%) was associated with decreased alanine aminotransferase levels (beta per allele, -1.24 U per liter; 95% confidence interval [CI], -1.66 to -0.83; P = 4.8×10-9) and with 33% lower odds of liver disease of any cause (odds ratio per allele, 0.67; 95% CI, 0.57 to 0.79; P = 9.9×10-7). Rare coding variants in CIDEB were associated with a decreased risk of liver disease across different underlying causes and different degrees of severity, including cirrhosis of any cause (odds ratio per allele, 0.50; 95% CI, 0.36 to 0.70). Among 3599 patients who had undergone bariatric surgery, rare coding variants in CIDEB were associated with a decreased nonalcoholic fatty liver disease activity score (beta per allele in score units, -0.98; 95% CI, -1.54 to -0.41 [scores range from 0 to 8, with higher scores indicating more severe disease]). In human hepatoma cell lines challenged with oleate, CIDEB small interfering RNA knockdown prevented the buildup of large lipid droplets. CONCLUSIONS: Rare germline mutations in CIDEB conferred substantial protection from liver disease. (Funded by Regeneron Pharmaceuticals.).
Assuntos
Proteínas Reguladoras de Apoptose , Mutação em Linhagem Germinativa , Hepatopatias , Proteínas Reguladoras de Apoptose/genética , Proteínas Reguladoras de Apoptose/metabolismo , Predisposição Genética para Doença/genética , Predisposição Genética para Doença/prevenção & controle , Humanos , Fígado/metabolismo , Hepatopatias/genética , Hepatopatias/metabolismo , Hepatopatias/prevenção & controle , Transaminases/genética , Sequenciamento do ExomaRESUMO
Linkage analysis, a class of methods for detecting co-segregation of genomic segments and traits in families, was used to map disease-causing genes for decades before genotyping arrays and dense SNP genotyping enabled genome-wide association studies in population samples. Population samples often contain related individuals, but the segregation of alleles within families is rarely used because traditional linkage methods are computationally inefficient for larger datasets. Here, we describe Population Linkage, a novel application of Haseman-Elston regression as a method of moments estimator of variance components and their standard errors. We achieve additional computational efficiency by using modern methods for detection of IBD segments and variance component estimation, efficient preprocessing of input data, and minimizing redundant numerical calculations. We also refined variance component models to account for the biases in population-scale methods for IBD segment detection. We ran Population Linkage on four blood lipid traits in over 70,000 individuals from the HUNT and SardiNIA studies, successfully detecting 25 known genetic signals. One notable linkage signal that appeared in both was for low-density lipoprotein (LDL) cholesterol levels in the region near the gene APOE (LOD = 29.3, variance explained = 4.1%). This is the region where the missense variants rs7412 and rs429358, which together make up the ε2, ε3, and ε4 alleles each account for 2.4% and 0.8% of variation in circulating LDL cholesterol. Our results show the potential for linkage analysis and other large-scale applications of method of moments variance components estimation.
Assuntos
Estudo de Associação Genômica Ampla , Modelos Genéticos , Humanos , Fenótipo , LDL-Colesterol/genética , Ligação Genética , Apolipoproteínas E/genéticaRESUMO
Platelets play a key role in thrombosis and hemostasis. Platelet count (PLT) and mean platelet volume (MPV) are highly heritable quantitative traits, with hundreds of genetic signals previously identified, mostly in European ancestry populations. We here utilize whole genome sequencing (WGS) from NHLBI's Trans-Omics for Precision Medicine initiative (TOPMed) in a large multi-ethnic sample to further explore common and rare variation contributing to PLT (n = 61 200) and MPV (n = 23 485). We identified and replicated secondary signals at MPL (rs532784633) and PECAM1 (rs73345162), both more common in African ancestry populations. We also observed rare variation in Mendelian platelet-related disorder genes influencing variation in platelet traits in TOPMed cohorts (not enriched for blood disorders). For example, association of GP9 with lower PLT and higher MPV was partly driven by a pathogenic Bernard-Soulier syndrome variant (rs5030764, p.Asn61Ser), and the signals at TUBB1 and CD36 were partly driven by loss of function variants not annotated as pathogenic in ClinVar (rs199948010 and rs571975065). However, residual signal remained for these gene-based signals after adjusting for lead variants, suggesting that additional variants in Mendelian genes with impacts in general population cohorts remain to be identified. Gene-based signals were also identified at several genome-wide association study identified loci for genes not annotated for Mendelian platelet disorders (PTPRH, TET2, CHEK2), with somatic variation driving the result at TET2. These results highlight the value of WGS in populations of diverse genetic ancestry to identify novel regulatory and coding signals, even for well-studied traits like platelet traits.
Assuntos
Estudo de Associação Genômica Ampla , Medicina de Precisão , Plaquetas , Humanos , National Heart, Lung, and Blood Institute (U.S.) , Fenótipo , Polimorfismo de Nucleotídeo Único , Medicina de Precisão/métodos , Estados UnidosRESUMO
Whole-genome sequencing (WGS), a powerful tool for detecting novel coding and non-coding disease-causing variants, has largely been applied to clinical diagnosis of inherited disorders. Here we leveraged WGS data in up to 62,653 ethnically diverse participants from the NHLBI Trans-Omics for Precision Medicine (TOPMed) program and assessed statistical association of variants with seven red blood cell (RBC) quantitative traits. We discovered 14 single variant-RBC trait associations at 12 genomic loci, which have not been reported previously. Several of the RBC trait-variant associations (RPN1, ELL2, MIDN, HBB, HBA1, PIEZO1, and G6PD) were replicated in independent GWAS datasets imputed to the TOPMed reference panel. Most of these discovered variants are rare/low frequency, and several are observed disproportionately among non-European Ancestry (African, Hispanic/Latino, or East Asian) populations. We identified a 3 bp indel p.Lys2169del (g.88717175_88717177TCT[4]) (common only in the Ashkenazi Jewish population) of PIEZO1, a gene responsible for the Mendelian red cell disorder hereditary xerocytosis (MIM: 194380), associated with higher mean corpuscular hemoglobin concentration (MCHC). In stepwise conditional analysis and in gene-based rare variant aggregated association analysis, we identified several of the variants in HBB, HBA1, TMPRSS6, and G6PD that represent the carrier state for known coding, promoter, or splice site loss-of-function variants that cause inherited RBC disorders. Finally, we applied base and nuclease editing to demonstrate that the sentinel variant rs112097551 (nearest gene RPN1) acts through a cis-regulatory element that exerts long-range control of the gene RUVBL1 which is essential for hematopoiesis. Together, these results demonstrate the utility of WGS in ethnically diverse population-based samples and gene editing for expanding knowledge of the genetic architecture of quantitative hematologic traits and suggest a continuum between complex trait and Mendelian red cell disorders.
Assuntos
Eritrócitos/metabolismo , Eritrócitos/patologia , Estudo de Associação Genômica Ampla , National Heart, Lung, and Blood Institute (U.S.)/organização & administração , Fenótipo , Adulto , Idoso , Cromossomos Humanos Par 16/genética , Conjuntos de Dados como Assunto , Feminino , Edição de Genes , Variação Genética/genética , Células HEK293 , Humanos , Masculino , Pessoa de Meia-Idade , Controle de Qualidade , Reprodutibilidade dos Testes , Estados UnidosRESUMO
Many common and rare variants associated with hematologic traits have been discovered through imputation on large-scale reference panels. However, the majority of genome-wide association studies (GWASs) have been conducted in Europeans, and determining causal variants has proved challenging. We performed a GWAS of total leukocyte, neutrophil, lymphocyte, monocyte, eosinophil, and basophil counts generated from 109,563,748 variants in the autosomes and the X chromosome in the Trans-Omics for Precision Medicine (TOPMed) program, which included data from 61,802 individuals of diverse ancestry. We discovered and replicated 7 leukocyte trait associations, including (1) the association between a chromosome X, pseudo-autosomal region (PAR), noncoding variant located between cytokine receptor genes (CSF2RA and CLRF2) and lower eosinophil count; and (2) associations between single variants found predominantly among African Americans at the S1PR3 (9q22.1) and HBB (11p15.4) loci and monocyte and lymphocyte counts, respectively. We further provide evidence indicating that the newly discovered eosinophil-lowering chromosome X PAR variant might be associated with reduced susceptibility to common allergic diseases such as atopic dermatitis and asthma. Additionally, we found a burden of very rare FLT3 (13q12.2) variants associated with monocyte counts. Together, these results emphasize the utility of whole-genome sequencing in diverse samples in identifying associations missed by European-ancestry-driven GWASs.
Assuntos
Asma/epidemiologia , Biomarcadores/metabolismo , Dermatite Atópica/epidemiologia , Leucócitos/patologia , Polimorfismo de Nucleotídeo Único , Doença Pulmonar Obstrutiva Crônica/epidemiologia , Locos de Características Quantitativas , Asma/genética , Asma/metabolismo , Asma/patologia , Dermatite Atópica/genética , Dermatite Atópica/metabolismo , Dermatite Atópica/patologia , Predisposição Genética para Doença , Genoma Humano , Estudo de Associação Genômica Ampla , Humanos , National Heart, Lung, and Blood Institute (U.S.) , Fenótipo , Prognóstico , Proteoma/análise , Proteoma/metabolismo , Doença Pulmonar Obstrutiva Crônica/genética , Doença Pulmonar Obstrutiva Crônica/metabolismo , Doença Pulmonar Obstrutiva Crônica/patologia , Reino Unido/epidemiologia , Estados Unidos/epidemiologia , Sequenciamento Completo do GenomaRESUMO
Detecting and estimating DNA sample contamination are important steps to ensure high-quality genotype calls and reliable downstream analysis. Existing methods rely on population allele frequency information for accurate estimation of contamination rates. Correctly specifying population allele frequencies for each individual in early stage of sequence analysis is impractical or even impossible for large-scale sequencing centers that simultaneously process samples from multiple studies across diverse populations. On the other hand, incorrectly specified allele frequencies may result in substantial bias in estimated contamination rates. For example, we observed that existing methods often fail to identify 10% contaminated samples at a typical 3% contamination exclusion threshold when genetic ancestry is misspecified. Such an incomplete screening of contaminated samples substantially inflates the estimated rate of genotyping errors even in deeply sequenced genomes and exomes. We propose a robust statistical method that accurately estimates DNA contamination and is agnostic to genetic ancestry of the intended or contaminating sample. Our method integrates the estimation of genetic ancestry and DNA contamination in a unified likelihood framework by leveraging individual-specific allele frequencies projected from reference genotypes onto principal component coordinates. Our method can also be used for estimating genetic ancestries, similar to LASER or TRACE, but simultaneously accounting for potential contamination. We demonstrate that our method robustly estimates contamination rates and genetic ancestries across populations and contamination scenarios. We further demonstrate that, in the presence of contamination, genetic ancestry inference can be substantially biased with existing methods that ignore contamination, while our method corrects for such biases.
Assuntos
Contaminação por DNA , DNA/genética , Genótipo , Técnicas de Genotipagem/normas , Alelos , Exoma/genética , Frequência do Gene/genética , Genética Populacional , Humanos , Polimorfismo de Nucleotídeo Único/genética , Análise de Sequência de DNARESUMO
SUMMARY: Expression quantitative trait loci (eQTLs) characterize the associations between genetic variation and gene expression to provide insights into tissue-specific gene regulation. Interactive visualization of tissue-specific eQTLs or splice QTLs (sQTLs) can facilitate our understanding of functional variants relevant to disease-related traits. However, combining the multi-dimensional nature of eQTLs/sQTLs into a concise and informative visualization is challenging. Existing QTL visualization tools provide useful ways to summarize the unprecedented scale of transcriptomic data but are not necessarily tailored to answer questions about the functional interpretations of trait-associated variants or other variants of interest. We developed FIVEx, an interactive eQTL/sQTL browser with an intuitive interface tailored to the functional interpretation of associated variants. It features the ability to navigate seamlessly between different data views while providing relevant tissue- and locus-specific information to offer users a better understanding of population-scale multi-tissue transcriptomic profiles. Our implementation of the FIVEx browser on the EBI eQTL catalogue, encompassing 16 publicly available RNA-seq studies, provides important insights for understanding potential tissue-specific regulatory mechanisms underlying trait-associated signals. AVAILABILITY AND IMPLEMENTATION: A FIVEx instance visualizing EBI eQTL catalogue data can be found at https://fivex.sph.umich.edu. Its source code is open source under an MIT license at https://github.com/statgen/fivex. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.