ABSTRACT
Population isolates such as those in Finland benefit genetic research because deleterious alleles are often concentrated on a small number of low-frequency variants (0.1% ≤ minor allele frequency < 5%). These variants survived the founding bottleneck rather than being distributed over a large number of ultrarare variants. Although this effect is well established in Mendelian genetics, its value in common disease genetics is less explored1,2. FinnGen aims to study the genome and national health register data of 500,000 Finnish individuals. Given the relatively high median age of participants (63 years) and the substantial fraction of hospital-based recruitment, FinnGen is enriched for disease end points. Here we analyse data from 224,737 participants from FinnGen and study 15 diseases that have previously been investigated in large genome-wide association studies (GWASs). We also include meta-analyses of biobank data from Estonia and the United Kingdom. We identified 30 new associations, primarily low-frequency variants, enriched in the Finnish population. A GWAS of 1,932 diseases also identified 2,733 genome-wide significant associations (893 phenome-wide significant (PWS), P < 2.6 × 10-11) at 2,496 (771 PWS) independent loci with 807 (247 PWS) end points. Among these, fine-mapping implicated 148 (73 PWS) coding variants associated with 83 (42 PWS) end points. Moreover, 91 (47 PWS) had an allele frequency of <5% in non-Finnish European individuals, of which 62 (32 PWS) were enriched by more than twofold in Finland. These findings demonstrate the power of bottlenecked populations to find entry points into the biology of common diseases through low-frequency, high impact variants.
Subject(s)
Disease , Gene Frequency , Phenotype , Humans , Middle Aged , Disease/genetics , Estonia , Finland , Gene Frequency/genetics , Genetic Predisposition to Disease/genetics , Genome-Wide Association Study , Meta-Analysis as Topic , United Kingdom , White People/geneticsABSTRACT
The eQTL Catalogue is an open database of uniformly processed human molecular quantitative trait loci (QTLs). We are continuously updating the resource to further increase its utility for interpreting genetic associations with complex traits. Over the past two years, we have increased the number of uniformly processed studies from 21 to 31 and added X chromosome QTLs for 19 compatible studies. We have also implemented Leafcutter to directly identify splice-junction usage QTLs in all RNA sequencing datasets. Finally, to improve the interpretability of transcript-level QTLs, we have developed static QTL coverage plots that visualise the association between the genotype and average RNA sequencing read coverage in the region for all 1.7 million fine mapped associations. To illustrate the utility of these updates to the eQTL Catalogue, we performed colocalisation analysis between vitamin D levels in the UK Biobank and all molecular QTLs in the eQTL Catalogue. Although most GWAS loci colocalised both with eQTLs and transcript-level QTLs, we found that visual inspection could sometimes be used to distinguish primary splicing QTLs from those that appear to be secondary consequences of large-effect gene expression QTLs. While these visually confirmed primary splicing QTLs explain just 6/53 of the colocalising signals, they are significantly less pleiotropic than eQTLs and identify a prioritised causal gene in 4/6 cases.
Subject(s)
Multifactorial Inheritance , Quantitative Trait Loci , Humans , Quantitative Trait Loci/genetics , Genotype , Base Sequence , Genome-Wide Association Study , Polymorphism, Single NucleotideABSTRACT
This corrects the article DOI: 10.1038/nature22403.
ABSTRACT
Technology utilizing human induced pluripotent stem cells (iPS cells) has enormous potential to provide improved cellular models of human disease. However, variable genetic and phenotypic characterization of many existing iPS cell lines limits their potential use for research and therapy. Here we describe the systematic generation, genotyping and phenotyping of 711 iPS cell lines derived from 301 healthy individuals by the Human Induced Pluripotent Stem Cells Initiative. Our study outlines the major sources of genetic and phenotypic variation in iPS cells and establishes their suitability as models of complex human traits and cancer. Through genome-wide profiling we find that 5-46% of the variation in different iPS cell phenotypes, including differentiation capacity and cellular morphology, arises from differences between individuals. Additionally, we assess the phenotypic consequences of genomic copy-number alterations that are repeatedly observed in iPS cells. In addition, we present a comprehensive map of common regulatory variants affecting the transcriptome of human pluripotent cells.
Subject(s)
Genetic Variation/genetics , Induced Pluripotent Stem Cells/metabolism , Cells, Cultured , Cellular Reprogramming/genetics , DNA Copy Number Variations/genetics , Gene Expression Regulation/genetics , Genotype , Humans , Organ Specificity , Phenotype , Quality Control , Quantitative Trait Loci/genetics , Transcriptome/geneticsABSTRACT
The action of benzoic acid in the food and beverage industries is compromised by the ability of spoilage yeasts to cope with this food preservative. Benzoic acid occurs naturally in many plants and is an intermediate compound in the biosynthesis of many secondary metabolites. The understanding of the mechanisms underlying the response and resistance to benzoic acid stress in the eukaryotic model yeast is thus crucial to design more suitable strategies to deal with this toxic lipophilic weak acid. In this study, the Saccharomyces cerevisiae multidrug transporter Tpo1 was demonstrated to confer resistance to benzoic acid. TPO1 transcript levels were shown to be up-regulated in yeast cells suddenly exposed to this stress agent. This up-regulation is under the control of the Gcn4 and Stp1 transcription factors, involved in the response to amino acid availability, but not under the regulation of the multidrug resistance transcription factors Pdr1 and Pdr3 that have binding sites in TPO1 promoter region. Benzoic acid stress was further shown to affect the intracellular pool of amino acids and polyamines. The observed decrease in the concentration of these nitrogenous compounds, registered upon benzoic acid stress exposure, was not found to be dependent on Tpo1, although the limitation of yeast cells on nitrogenous compounds was found to activate Tpo1 expression. Altogether, the results described in this study suggest that Tpo1 is one of the key players standing in the crossroad between benzoic acid stress response and tolerance and the control of the intracellular concentration of nitrogenous compounds. Also, results can be useful to guide the design of more efficient preservation strategies and the biotechnological synthesis of benzoic acid or benzoic acid-derived compounds.
Subject(s)
Antiporters/metabolism , Basic-Leucine Zipper Transcription Factors/metabolism , Benzoic Acid/pharmacology , Nuclear Proteins/metabolism , Organic Cation Transport Proteins/metabolism , RNA-Binding Proteins/metabolism , Saccharomyces cerevisiae Proteins/metabolism , Saccharomyces cerevisiae/drug effects , Saccharomyces cerevisiae/genetics , Transcription Factors/metabolism , Amino Acids , Antiporters/genetics , Basic-Leucine Zipper Transcription Factors/genetics , Binding Sites , Drug Resistance, Multiple, Fungal/genetics , Drug Tolerance , Food Preservatives , Gene Expression Regulation, Fungal , Nuclear Proteins/genetics , Organic Cation Transport Proteins/genetics , Polyamines , RNA-Binding Proteins/genetics , Saccharomyces cerevisiae/growth & development , Saccharomyces cerevisiae/metabolism , Saccharomyces cerevisiae Proteins/genetics , Trans-Activators , Transcription Factors/genetics , Transcriptional Activation , Up-RegulationABSTRACT
Identifying causal genes underlying genome-wide association studies (GWASs) is a fundamental problem in human genetics. Although colocalization with gene expression quantitative trait loci (eQTLs) is often used to prioritize GWAS target genes, systematic benchmarking has been limited due to unavailability of large ground truth datasets. Here, we re-analyzed plasma protein QTL data from 3,301 individuals of the INTERVAL cohort together with 131 eQTL Catalog datasets. Focusing on variants located within or close to the affected protein identified 793 proteins with at least one cis-pQTL where we could assume that the most likely causal gene was the gene coding for the protein. We then benchmarked the ability of cis-eQTLs to recover these causal genes by comparing three Bayesian colocalization methods (coloc.susie, coloc.abf, and CLPP) and five Mendelian randomization (MR) approaches (three varieties of inverse-variance weighted MR, MR-RAPS, and MRLocus). We found that assigning fine-mapped pQTLs to their closest protein coding genes outperformed all colocalization methods regarding both precision (71.9%) and recall (76.9%). Furthermore, the colocalization method with the highest recall (coloc.susie - 46.3%) also had the lowest precision (45.1%). Combining evidence from multiple conditionally distinct colocalizing QTLs with MR increased precision to 81%, but this was accompanied by a large reduction in recall to 7.1%. Furthermore, the choice of the MR method greatly affected performance, with the standard inverse-variance-weighted MR often producing many false positives. Our results highlight that linking GWAS variants to target genes remains challenging with eQTL evidence alone, and prioritizing novel targets requires triangulation of evidence from multiple sources.
Subject(s)
Genome-Wide Association Study , Polymorphism, Single Nucleotide , Quantitative Trait Loci , Humans , Polymorphism, Single Nucleotide/genetics , Bayes Theorem , Mendelian Randomization Analysis/methods , Gene Expression RegulationABSTRACT
The proteome holds great potential as an intermediate layer between the genome and phenome. Previous protein quantitative trait locus studies have focused mainly on describing the effects of common genetic variations on the proteome. Here, we assessed the impact of the common and rare genetic variations as well as the copy number variants (CNVs) on 326 plasma proteins measured in up to 500 individuals. We identified 184 cis and 94 trans signals for 157 protein traits, which were further fine-mapped to credible sets for 101 cis and 87 trans signals for 151 proteins. Rare genetic variation contributed to the levels of 7 proteins, with 5 cis and 14 trans associations. CNVs were associated with the levels of 11 proteins (7 cis and 5 trans), examples including a 3q12.1 deletion acting as a hub for multiple trans associations; and a CNV overlapping NAIP, a sensor component of the NAIP-NLRC4 inflammasome which is affecting pro-inflammatory cytokine interleukin 18 levels. In summary, this work presents a comprehensive resource of genetic variation affecting the plasma protein levels and provides the interpretation of identified effects.
Subject(s)
Genome-Wide Association Study , Proteome , Humans , Proteome/genetics , Estonia , Polymorphism, Single Nucleotide , Quantitative Trait Loci/genetics , Blood Proteins/genetics , DNA Copy Number Variations/geneticsABSTRACT
Genome sequencing efforts have led to the discovery of tens of millions of protein missense variants found in the human population with the majority of these having no annotated role and some likely contributing to trait variation and disease. Sequence-based artificial intelligence approaches have become highly accurate at predicting variants that are detrimental to the function of proteins but they do not inform on mechanisms of disruption. Here we combined sequence and structure-based methods to perform proteome-wide prediction of deleterious variants with information on their impact on protein stability, protein-protein interactions and small-molecule binding pockets. AlphaFold2 structures were used to predict approximately 100,000 small-molecule binding pockets and stability changes for over 200 million variants. To inform on protein-protein interfaces we used AlphaFold2 to predict structures for nearly 500,000 protein complexes. We illustrate the value of mechanism-aware variant effect predictions to study the relation between protein stability and abundance and the structural properties of interfaces underlying trans protein quantitative trait loci (pQTLs). We characterised the distribution of mechanistic impacts of protein variants found in patients and experimentally studied example disease linked variants in FGFR1.
ABSTRACT
In this issue of Cell Genomics, Garcia-Perez et al.1 report a comprehensive and careful association analysis between gene expression and splicing measured by the GTEx Consortium2 in 46 human tissues and 21 demographic and clinical traits.
ABSTRACT
BACKGROUND: Ischemic stroke (IS) is a major health risk without generally usable effective measures of primary prevention. Early warning signals that are easy to detect and widely available can save lives. Estonia has one nation-wide Electronic Health Record (EHR) database for the storage of medical information of patients from hospitals and primary care providers. METHODS: We extracted structured and unstructured data from the EHRs of participants of the Estonian Biobank (EstBB) and evaluated different formats of input data to understand how this continuously growing dataset should be prepared for best prediction. The utility of the EHR database for finding blood- and urine-based biomarkers for IS was demonstrated by applying different analytical and machine learning (ML) methods. RESULTS: Several early trends in common clinical laboratory parameter changes (set of red blood indices, lymphocyte/neutrophil ratio, etc.) were established for IS prediction. The developed ML models predicted the future occurrence of IS with very high accuracy and Random Forests was proved as the most applicable method to EHR data. CONCLUSIONS: We conclude that the EHR database and the risk factors uncovered are valuable resources in screening the population for risk of IS as well as constructing disease risk scores and refining prediction models for IS by ML.
Subject(s)
Electronic Health Records , Ischemic Stroke , Humans , Estonia/epidemiology , Risk Factors , BiomarkersABSTRACT
The role of NLRP1 inflammasome activation and subsequent production of IL-1 family cytokines in the development of atopic dermatitis (AD) is not clearly understood. Staphylococcus aureus is known to be associated with increased mRNA levels of IL1 family cytokines in the skin and more severe AD. In this study, the altered expression of IL-1 family cytokines and inflammasome-related genes was confirmed, and a positive relationship between mRNA levels of inflammasome sensor NLRP1 and IL1B or IL18 was determined. Enhanced expression of the NLRP1 and PYCARD proteins and increased caspase-1 activity were detected in the skin of patients with AD. The genetic association of IL18R1 and IL18RAP with AD was confirmed, and the involvement of various immune cell types was predicted using published GWAS and expression quantitative trait loci datasets. In keratinocytes, the inoculation with S. aureus led to the increased secretion of IL-1ß and IL-18, whereas small interfering RNA silencing of NLRP1 inhibited the production of these cytokines. Our results suggest that skin colonization with S. aureus may cause the activation of the NLRP1 inflammasome in keratinocytes, which leads to the secretion of IL-1ß and IL-18 and thereby may contribute to the pathogenesis of AD, particularly in the presence of genetic variations in the IL-18 pathway.
Subject(s)
Dermatitis, Atopic , Methicillin-Resistant Staphylococcus aureus , Humans , Inflammasomes/metabolism , Dermatitis, Atopic/genetics , Dermatitis, Atopic/metabolism , Interleukin-18/genetics , Staphylococcus aureus/metabolism , Cytokines/metabolism , RNA, Messenger , NLR ProteinsABSTRACT
Perturbing expression is a powerful way to understand the role of individual genes, but can be challenging in important models. CRISPR-Cas screens in human induced pluripotent stem cells (iPSCs) are of limited efficiency due to DNA break-induced stress, while the less stressful silencing with an inactive Cas9 has been considered less effective so far. Here, we developed the dCas9-KRAB-MeCP2 fusion protein for screening in iPSCs from multiple donors. We found silencing in a 200 bp window around the transcription start site in polyclonal pools to be as effective as using wild-type Cas9 for identifying essential genes, but with much reduced cell numbers. Whole-genome screens to identify ARID1A-dependent dosage sensitivity revealed the PSMB2 gene, and enrichment of proteasome genes among the hits. This selective dependency was replicated with a proteasome inhibitor, indicating a targetable drug-gene interaction. Many more plausible targets in challenging cell models can be efficiently identified with our approach.
Subject(s)
Induced Pluripotent Stem Cells , Humans , Induced Pluripotent Stem Cells/metabolism , CRISPR-Cas Systems/genetics , Genome , DNA-Binding Proteins/genetics , DNA-Binding Proteins/metabolism , Transcription Factors/genetics , Transcription Factors/metabolismABSTRACT
Splicing quantitative trait loci (QTLs) have been implicated as a common mechanism underlying complex trait associations. However, utilising splicing QTLs in target discovery and prioritisation has been challenging due to extensive data normalisation which often renders the direction of the genetic effect as well as its magnitude difficult to interpret. This is further complicated by the fact that strong expression QTLs often manifest as weak splicing QTLs and vice versa, making it difficult to uniquely identify the underlying molecular mechanism at each locus. We find that these ambiguities can be mitigated by visualising the association between the genotype and average RNA sequencing read coverage in the region. Here, we generate these QTL coverage plots for 1.7 million molecular QTL associations in the eQTL Catalogue identified with five quantification methods. We illustrate the utility of these QTL coverage plots by performing colocalisation between vitamin D levels in the UK Biobank and all molecular QTLs in the eQTL Catalogue. We find that while visually confirmed splicing QTLs explain just 6/53 of the colocalising signals, they are significantly less pleiotropic than eQTLs and identify a prioritised causal gene in 4/6 cases. All our association summary statistics and QTL coverage plots are freely available at https://www.ebi.ac.uk/eqtl/.
ABSTRACT
Many eukaryotic genes can give rise to different alternative transcripts depending on stage of development, cell type, and physiological cues. Current transcriptome-wide sequencing technologies highlight the remarkable extent of this regulation in metazoans and allow for RNA isoforms to be profiled in increasingly small biological samples and with a growing confidence. Understanding biological functions of sample-specific transcripts is a major challenge in genomics and RNA processing fields. Here we describe simple bioinformatics workflows that facilitate this task by streamlining reference-guided annotation of novel transcripts. A key part of our protocol is the R package factR that rapidly matches custom-assembled transcripts to their likely host genes, deduces the sequence and domain structure of novel protein products, and predicts sensitivity of newly identified RNA isoforms to nonsense-mediated decay.
Subject(s)
RNA Isoforms , Transcriptome , Alternative Splicing , Gene Expression Profiling/methods , High-Throughput Nucleotide Sequencing , Molecular Sequence Annotation , Nonsense Mediated mRNA Decay , RNA Isoforms/genetics , Sequence Analysis, RNAABSTRACT
Identifying cellular functions dysregulated by disease-associated variants could implicate novel pathways for drug targeting or modulation in cell therapies. However, follow-up studies can be challenging if disease-relevant cell types are difficult to sample. Variants associated with immune diseases point toward the role of CD4+ regulatory T cells (Treg cells). We mapped genetic regulation (quantitative trait loci [QTL]) of gene expression and chromatin activity in Treg cells, and we identified 133 colocalizing loci with immune disease variants. Colocalizations of immune disease genome-wide association study (GWAS) variants with expression QTLs (eQTLs) controlling the expression of CD28 and STAT5A, involved in Treg cell activation and interleukin-2 (IL-2) signaling, support the contribution of Treg cells to the pathobiology of immune diseases. Finally, we identified seven known drug targets suitable for drug repurposing and suggested 63 targets with drug tractability evidence among the GWAS signals that colocalized with Treg cell QTLs. Our study is the first in-depth characterization of immune disease variant effects on Treg cell gene expression modulation and dysregulation of Treg cell function.
ABSTRACT
Invasive bacterial disease is a major cause of morbidity and mortality in African children. Despite being caused by diverse pathogens, children with sepsis are clinically indistinguishable from one another. In spite of this, most genetic susceptibility loci for invasive infection that have been discovered to date are pathogen specific and are not therefore suggestive of a shared genetic architecture of bacterial sepsis. Here, we utilise probabilistic diagnostic models to identify children with a high probability of invasive bacterial disease among critically unwell Kenyan children with Plasmodium falciparum parasitaemia. We construct a joint dataset including 1445 bacteraemia cases and 1143 severe malaria cases, and population controls, among critically unwell Kenyan children that have previously been genotyped for human genetic variation. Using these data, we perform a cross-trait genome-wide association study of invasive bacterial infection, weighting cases according to their probability of bacterial disease. In doing so, we identify and validate a novel risk locus for invasive infection secondary to multiple bacterial pathogens, that has no apparent effect on malaria risk. The locus identified modifies splicing of BIRC6 in stimulated monocytes, implicating regulation of apoptosis and autophagy in the pathogenesis of sepsis in Kenyan children.
Bacterial infections are a major cause of severe illness and death in African children. Understanding which children are at risk of life-threatening infection and why, is key to designing new tools to help protect them. Some risk is likely inherited, but scientists do not know which genes are responsible. Genome-wide association studies (GWAS) may be one way to identify bacterial infection risk genes. GWAS look for genetic differences associated with a particular disease. But previous GWAS studies have failed to find genes linked with bacterial infections in African children because they were too small. Malaria is another frequent cause of life-threatening illness in African children. It can be hard for clinicians to determine if a child's illness is caused by malaria, a bacterial infection, or both. Many children in Africa have malaria parasites in their blood, but they do not always cause disease. Most children with suspected severe malaria are treated with antibiotics in case of bacterial infection. Clinicians may then conduct further testing to determine the illness's actual cause. Scientists may be able to use this data on children with suspected malaria to study bacterial infections. Gilchrist et al. show that children with an unusual alteration in the BIRC6 gene are at increased risk of bacterial infections. In the experiments, Gilchrist et al. used computer modeling to identify a subset of children with likely bacterial infections among 2,200 children admitted to a hospital in Kenya with a high fever and malaria parasites. By combining information on this subset of children with data on children with confirmed bacterial infections and healthy children, Gilchrist created a sample of 5,400 children for a GWAS. The analyses found that children with a variation in the BIRC6 gene on chromosome 2 had a higher risk of bacterial infections. This genetic change is linked with the production of a modified form of BIRC6 in infection-fighting immune cells called monocytes. More studies will help scientists understand how this change might contribute to severe bacterial infections. Learning more may help scientists develop new treatment strategies and identify children most at risk.
Subject(s)
Bacteremia , Bacterial Infections , Malaria , Bacteremia/microbiology , Child , Genome-Wide Association Study , Humans , Inhibitor of Apoptosis Proteins , Kenya/epidemiology , Malaria/complications , Malaria/epidemiologyABSTRACT
Many gene expression quantitative trait locus (eQTL) studies have published their summary statistics, which can be used to gain insight into complex human traits by downstream analyses, such as fine mapping and co-localization. However, technical differences between these datasets are a barrier to their widespread use. Consequently, target genes for most genome-wide association study (GWAS) signals have still not been identified. In the present study, we present the eQTL Catalogue ( https://www.ebi.ac.uk/eqtl ), a resource of quality-controlled, uniformly re-computed gene expression and splicing QTLs from 21 studies. We find that, for matching cell types and tissues, the eQTL effect sizes are highly reproducible between studies. Although most QTLs were shared between most bulk tissues, we identified a greater diversity of cell-type-specific QTLs from purified cell types, a subset of which also manifested as new disease co-localizations. Our summary statistics are freely available to enable the systematic interpretation of human GWAS associations across many cell types and tissues.
Subject(s)
Databases, Genetic , Gene Expression Regulation/genetics , Quantitative Trait Loci/genetics , Quantitative Trait, Heritable , CD4-Positive T-Lymphocytes/cytology , Datasets as Topic , Genome-Wide Association Study , Humans , Multifactorial Inheritance/genetics , Polymorphism, Single Nucleotide/geneticsABSTRACT
Understanding the causal processes that contribute to disease onset and progression is essential for developing novel therapies. Although trans-acting expression quantitative trait loci (trans-eQTLs) can directly reveal cellular processes modulated by disease variants, detecting trans-eQTLs remains challenging due to their small effect sizes. Here, we analysed gene expression and genotype data from six blood cell types from 226 to 710 individuals. We used co-expression modules inferred from gene expression data with five methods as traits in trans-eQTL analysis to limit multiple testing and improve interpretability. In addition to replicating three established associations, we discovered a novel trans-eQTL near SLC39A8 regulating a module of metallothionein genes in LPS-stimulated monocytes. Interestingly, this effect was mediated by a transient cis-eQTL present only in early LPS response and lost before the trans effect appeared. Our analyses highlight how co-expression combined with functional enrichment analysis improves the identification and prioritisation of trans-eQTLs when applied to emerging cell-type-specific datasets.