RESUMEN
Human humoral immune responses to SARS-CoV-2 vaccines exhibit substantial inter-individual variability and have been linked to vaccine efficacy. To elucidate the underlying mechanism behind this variability, we conducted a genome-wide association study (GWAS) on the anti-spike IgG serostatus of UK Biobank participants who were previously uninfected by SARS-CoV-2 and had received either the first dose (n = 54,066) or the second dose (n = 46,232) of COVID-19 vaccines. Our analysis revealed significant genome-wide associations between the IgG antibody serostatus following the initial vaccine and human leukocyte antigen (HLA) class II alleles. Specifically, the HLA-DRB1∗13:02 allele (MAF = 4.0%, OR = 0.75, p = 2.34e-16) demonstrated the most statistically significant protective effect against IgG seronegativity. This protective effect was driven by an alteration from arginine (Arg) to glutamic acid (Glu) at position 71 on HLA-DRß1 (p = 1.88e-25), leading to a change in the electrostatic potential of pocket 4 of the peptide binding groove. Notably, the impact of HLA alleles on IgG responses was cell type specific, and we observed a shared genetic predisposition between IgG status and susceptibility/severity of COVID-19. These results were replicated within independent cohorts where IgG serostatus was assayed by two different antibody serology tests. Our findings provide insights into the biological mechanism underlying individual variation in responses to COVID-19 vaccines and highlight the need to consider the influence of constitutive genetics when designing vaccination strategies for optimizing protection and control of infectious disease across diverse populations.
Asunto(s)
COVID-19 , Inmunoglobulina G , Humanos , Formación de Anticuerpos/genética , Vacunas contra la COVID-19 , Estudio de Asociación del Genoma Completo , COVID-19/genética , COVID-19/prevención & control , SARS-CoV-2 , VacunaciónRESUMEN
Explicitly sharing individual level data in genomics studies has many merits comparing to sharing summary statistics, including more strict QCs, common statistical analyses, relative identification and improved statistical power in GWAS, but it is hampered by privacy or ethical constraints. In this study, we developed encG-reg, a regression approach that can detect relatives of various degrees based on encrypted genomic data, which is immune of ethical constraints. The encryption properties of encG-reg are based on the random matrix theory by masking the original genotypic matrix without sacrificing precision of individual-level genotype data. We established a connection between the dimension of a random matrix, which masked genotype matrices, and the required precision of a study for encrypted genotype data. encG-reg has false positive and false negative rates equivalent to sharing original individual level data, and is computationally efficient when searching relatives. We split the UK Biobank into their respective centers, and then encrypted the genotype data. We observed that the relatives estimated using encG-reg was equivalently accurate with the estimation by KING, which is a widely used software but requires original genotype data. In a more complex application, we launched a finely devised multi-center collaboration across 5 research institutes in China, covering 9 cohorts of 54,092 GWAS samples. encG-reg again identified true relatives existing across the cohorts with even different ethnic backgrounds and genotypic qualities. Our study clearly demonstrates that encrypted genomic data can be used for data sharing without loss of information or data sharing barrier.
Asunto(s)
Estudio de Asociación del Genoma Completo , Privacidad , Humanos , Estudio de Asociación del Genoma Completo/métodos , Genotipo , Programas Informáticos , GenómicaRESUMEN
ABSTRACT: Platelet count reduction occurs throughout pregnancy, with 5% to 12% of pregnant women being diagnosed with gestational thrombocytopenia (GT), characterized by a more marked decrease in platelet count during pregnancy. However, the underlying biological mechanism behind these phenomena remains unclear. Here, we used sequencing data from noninvasive prenatal testing of 100 186 Chinese pregnant individuals and conducted, to our knowledge, the hitherto largest-scale genome-wide association studies on platelet counts during 5 periods of pregnancy (the first, second, and third trimesters, delivery, and the postpartum period) as well as 2 GT statuses (GT platelet count < 150 × 109/L and severe GT platelet count < 100 × 109/L). Our analysis revealed 138 genome-wide significant loci, explaining 10.4% to 12.1% of the observed variation. Interestingly, we identified previously unknown changes in genetic effects on platelet counts during pregnancy for variants present in PEAR1 and CBL, with PEAR1 variants specifically associated with a faster decline in platelet counts. Furthermore, we found that variants present in PEAR1 and TUBB1 increased susceptibility to GT and severe GT. Our study provides insight into the genetic basis of platelet counts and GT in pregnancy, highlighting the critical role of PEAR1 in decreasing platelet counts during pregnancy and the occurrence of GT. Those with pregnancies carrying specific variants associated with declining platelet counts may experience a more pronounced decrease, thereby elevating the risk of GT. These findings lay the groundwork for further investigation into the biological mechanisms and causal implications of GT.
Asunto(s)
Complicaciones Hematológicas del Embarazo , Trombocitopenia , Embarazo , Femenino , Humanos , Recuento de Plaquetas , Estudio de Asociación del Genoma Completo , Complicaciones Hematológicas del Embarazo/genética , Complicaciones Hematológicas del Embarazo/diagnóstico , Trombocitopenia/complicaciones , Periodo Posparto , Receptores de Superficie CelularRESUMEN
Haseman-Elston regression (HE-reg) has been known as a classic tool for detecting an additive genetic variance component. However, in this study we find that HE-reg can capture GxE under certain conditions, so we derive and reinterpret the analytical solution of HE-reg. In the presence of GxE, it leads to a natural discrepancy between linkage and association results, the latter of which is not able to capture GxE if the environment is unknown. Considering linkage and association as symmetric designs, we investigate how the symmetry can and cannot hold in the absence and presence of GxE, and consequently we propose a pair of statistical tests, Symmetry Test I and Symmetry Test II, both of which can be tested using summary statistics. Test statistics, and their statistical power issues are also investigated for Symmetry Tests I and II. Increasing the number of sib pairs is important to improve statistical power for detecting GxE.
Asunto(s)
Interacción Gen-Ambiente , Genotipo , Modelos Genéticos , Humanos , Ligamiento Genético , Análisis de Regresión , Simulación por Computador , Modelos EstadísticosRESUMEN
BACKGROUND: Amyotrophic lateral sclerosis (ALS) is a neurodegenerative disorder that is accompanied by muscle weakness and muscle atrophy, typically resulting in death within 3-5 years from the disease occurrence. Though the cause of ALS remains unclear, increasing evidence has suggested that inflammation is involved in the pathogenesis of ALS. Thus, we performed two-sample Mendelian randomization (MR) analyses to estimate the associations of circulating levels of cytokines and growth factors with the risk of ALS. METHODS: Genetic instrumental variables for circulating cytokines and growth factors were identified from a genome-wide association study (GWAS) of 8293 European participants. Summary statistics of ALS were obtained from a GWAS including 20,806 ALS cases and 59,804 controls of European ancestry. We used the inverse-variance weighted (IVW) method as the primary analysis. To test the robustness of our results, we further performed the simple-median method, weighted-median method, MR-Egger regression, and MR pleiotropy residual sum and outlier test. Finally, a reverse MR analysis was performed to assess the possibility of reverse causation between ALS and the cytokines that we identified. RESULTS: After Bonferroni correction, genetically predicted circulating level of basic fibroblast growth factor (FGF-basic) was suggestively associated with a lower risk of ALS [odds ratio (OR): 0.74, 95% confidence interval (95% CI): 0.60-0.92, P = 0.007]. We also observed suggestive evidence that interferon gamma-induced protein 10 (IP-10) was associated with a 10% higher risk of ALS (OR: 1.10, 95% CI: 1.03-1.17, P = 0.005) in the primary study. The results of sensitivity analyses were consistent. CONCLUSIONS: Our systematic MR analyses provided suggestive evidence to support causal associations of circulating FGF-basic and IP-10 with the risk of ALS. More studies are warranted to explore how these cytokines may affect the development of ALS.
Asunto(s)
Esclerosis Amiotrófica Lateral , Citocinas , Humanos , Citocinas/genética , Esclerosis Amiotrófica Lateral/epidemiología , Esclerosis Amiotrófica Lateral/genética , Quimiocina CXCL10 , Estudio de Asociación del Genoma Completo , Análisis de la Aleatorización Mendeliana , Polimorfismo de Nucleótido SimpleRESUMEN
BACKGROUND: Birth weight is considered not only to undermine future growth, but also to induce lifelong diseases; the aim of this study is to explore the relationship between birth weight and adult bone mass. METHODS: We performed multivariable regression analyses to assess the association of birth weight with bone parameters measured by dual-energy X-ray absorptiometry (DXA) and by quantitative ultrasound (QUS), independently. We also implemented a systemic Mendelian randomization (MR) analysis to explore the causal association between them with both fetal-specific and maternal-specific instrumental variables. RESULTS: In the observational analyses, we found that higher birth weight could increase the adult bone area (lumbar spine, ß-coefficient= 0.17, P < 2.00 × 10-16; lateral spine, ß-coefficient = 0.02, P = 0.04), decrease bone mineral content-adjusted bone area (BMCadjArea) (lumbar spine, ß-coefficient= - 0.01, P = 2.27 × 10-14; lateral spine, ß-coefficient = - 0.05, P = 0.001), and decrease adult bone mineral density (BMD) (lumbar spine, ß-coefficient = - 0.04, P = 0.007; lateral spine; ß-coefficient = - 0.03, P = 0.02; heel, ß-coefficient = - 0.06, P < 2.00 × 10-16), and we observed that the effect of birth weight on bone size was larger than that on BMC. In MR analyses, the higher fetal-specific genetically determined birth weight was identified to be associated with higher bone area (lumbar spine; ß-coefficient = 0.15, P = 1.26 × 10-6, total hip, ß-coefficient = 0.15, P = 0.005; intertrochanteric area, ß-coefficient = 0.13, P = 0.0009; trochanter area, ß-coefficient = 0.11, P = 0.03) but lower BMD (lumbar spine, ß-coefficient = - 0.10, P = 0.01; lateral spine, ß-coefficient = - 0.12, P = 0.0003, and heel ß-coefficient = - 0.11, P = 3.33 × 10-13). In addition, we found that the higher maternal-specific genetically determined offspring birth weight was associated with lower offspring adult heel BMD (ß-coefficient = - 0.001, P = 0.04). CONCLUSIONS: The observational analyses suggested that higher birth weight was associated with the increased adult bone area but decreased BMD. By leveraging the genetic instrumental variables with maternal- and fetal-specific effects on birth weight, the observed relationship could be reflected by both the direct fetal and indirect maternal genetic effects.
Asunto(s)
Densidad Ósea , Vértebras Lumbares , Absorciometría de Fotón , Adulto , Peso al Nacer , Densidad Ósea/genética , Humanos , Vértebras Lumbares/diagnóstico por imagen , Análisis de la Aleatorización MendelianaRESUMEN
SUMMARY: The rapid progresses of high-throughput sequencing technology-based omics and mass spectrometry-based proteomics, such as data-independent acquisition and its penetration to clinical studies have generated increasing number of proteomic datasets containing hundreds to thousands of samples. To analyze these quantitative proteomic datasets and other omics (e.g. transcriptomics and metabolomics) datasets more efficiently and conveniently, we present a web server-based software tool ProteomeExpert implemented in Docker, which offers various analysis tools for experimental design, data mining, interpretation and visualization of quantitative proteomic datasets. ProteomeExpert can be deployed on an operating system with Docker installed or with R language environment. AVAILABILITY AND IMPLEMENTATION: The Docker image of ProteomeExpert is freely available from https://hub.docker.com/r/lifeinfo/proteomeexpert. The source code of ProteomeExpert is also openly accessible at http://www.github.com/guomics-lab/ProteomeExpert/. In addition, a demo server is provided at https://proteomic.shinyapps.io/peserver/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
RESUMEN
Educational attainment is strongly influenced by social and other environmental factors, but genetic factors are estimated to account for at least 20% of the variation across individuals. Here we report the results of a genome-wide association study (GWAS) for educational attainment that extends our earlier discovery sample of 101,069 individuals to 293,723 individuals, and a replication study in an independent sample of 111,349 individuals from the UK Biobank. We identify 74 genome-wide significant loci associated with the number of years of schooling completed. Single-nucleotide polymorphisms associated with educational attainment are disproportionately found in genomic regions regulating gene expression in the fetal brain. Candidate genes are preferentially expressed in neural tissue, especially during the prenatal period, and enriched for biological pathways involved in neural development. Our findings demonstrate that, even for a behavioural phenotype that is mostly environmentally determined, a well-powered GWAS identifies replicable associated genetic variants that suggest biologically relevant pathways. Because educational attainment is measured in large numbers of individuals, it will continue to be useful as a proxy phenotype in efforts to characterize the genetic influences of related phenotypes, including cognition and neuropsychiatric diseases.
Asunto(s)
Encéfalo/metabolismo , Escolaridad , Feto/metabolismo , Regulación de la Expresión Génica/genética , Estudio de Asociación del Genoma Completo , Polimorfismo de Nucleótido Simple/genética , Enfermedad de Alzheimer/genética , Trastorno Bipolar/genética , Cognición , Biología Computacional , Interacción Gen-Ambiente , Humanos , Anotación de Secuencia Molecular , Esquizofrenia/genética , Reino UnidoRESUMEN
Batch effects are unwanted data variations that may obscure biological signals, leading to bias or errors in subsequent data analyses. Effective evaluation and elimination of batch effects are necessary for omics data analysis. In order to facilitate the evaluation and correction of batch effects, here we present BatchSever, an open-source R/Shiny based user-friendly interactive graphical web platform for batch effects analysis. In BatchServer, we introduced autoComBat, a modified version of ComBat, which is the most widely adopted tool for batch effect correction. BatchServer uses PVCA (Principal Variance Component Analysis) and UMAP (Manifold Approximation and Projection) for evaluation and visualization of batch effects. We demonstrate its applications in multiple proteomics and transcriptomic data sets. BatchServer is provided at https://lifeinfor.shinyapps.io/batchserver/ as a web server. The source codes are freely available at https://github.com/guomics-lab/batch_server.
Asunto(s)
Biología Computacional , Programas InformáticosRESUMEN
BACKGROUNDS: Early and accurate diagnosis of pediatric pneumonia in primary health care can reduce the chance of long-term respiratory diseases, related hospitalizations and mortality while lowering medical costs. The aim of this study was to assess the value of blood biomarkers, clinical symptoms and their combination in assisting discrimination of pneumonia from upper respiratory tract infection (URTI) in children. METHODS: Both univariate and multivariate logistic regressions were used to build the pneumonia screening model based on a retrospective cohort, comprised of 5211 children (age ≤ 18 years). The electronic health records of the patients, who had inpatient admission or outpatient visits between February 15, 2012 to September 30, 2018, were extracted from the hospital information system of Zhejiang Provincial People's Hospital, Hangzhou, Zhejiang Province, China. The children who were diagnosed with pneumonia and URTI were enrolled and their clinical features and levels of blood biomarkers were compared. Using the area under the ROC curve, both two screening models were evaluated under 80% (training) versus 20% (test) cross-validation data split for their accuracy. RESULTS: In the retrospective cohort, 2548 of 5211 children were diagnosed with the defined pneumonia. The univariate screening model reached predicted AUCs of 0.76 for lymphocyte/monocyte ratio (LMR) and 0.71 for neutrophil/lymphocyte ratio (NLR) when identified overall pneumonia from URTI, attaining the best performance among the biomarker candidates. In subgroup analysis, LMR and NLR attained AUCs of 0.80 and 0.86 to differentiate viral pneumonia from URTI, and AUCs of 0.77 and 0.71 to discriminate bacterial pneumonia from URTI respectively. After integrating LMR and NLR with three clinical symptoms of fever, cough and rhinorrhea, the multivariate screening model obtained increased predictive values, reaching validated AUCs of 0.84, 0.95 and 0.86 for distinguishing pneumonia, viral pneumonia and bacterial pneumonia from URTI respectively. CONCLUSIONS: Our study demonstrated that combining LMR and NLR with critical clinical characteristics reached promising accuracy in differentiating pneumonia from URTI, thus could be considered as a useful screening tool to assist the diagnosis of pneumonia, in particular, in community healthcare centers. Further researches could be conducted to evaluate the model's clinical utility and cost-effectiveness in primary care scenarios to facilitate pneumonia diagnosis, especially in rural settings.
Asunto(s)
Neutrófilos , Neumonía Bacteriana , Adolescente , Niño , Estudios Transversales , Humanos , Linfocitos , Monocitos , Pronóstico , Estudios RetrospectivosRESUMEN
Understanding the genomic basis of adaptation in maize is important for gene discovery and the improvement of breeding germplasm, but much remains a mystery in spite of significant population genetics and archaeological research. Identifying the signals underpinning adaptation are challenging as adaptation often coincided with genetic drift, and the base genomic diversity of the species in massive. In this study, tGBS technology was used to genotype 1,143 diverse maize accessions including landraces collected from 20 countries and elite breeding lines of tropical lowland, highland, subtropical/midaltitude and temperate ecological zones. Based on 355,442 high-quality single nucleotide polymorphisms, 13 genomic regions were detected as being under selection using the bottom-up searching strategy, EigenGWAS. Of the 13 selection regions, 10 were first reported, two were associated with environmental parameters via EnvGWAS, and 146 genes were enriched. Combining large-scale genomic and ecological data in this diverse maize panel, our study supports a polygenic adaptation model of maize and offers a framework to enhance our understanding of both the mechanistic basis and the evolutionary consequences of maize domestication and adaptation. The regions identified here are promising candidates for further, targeted exploration to identify beneficial alleles and haplotypes for deployment in maize breeding.
Asunto(s)
Adaptación Fisiológica/genética , Cruzamiento , Ambiente , Sitios Genéticos , Estudio de Asociación del Genoma Completo , Bases de Datos Genéticas , Ecotipo , Genotipo , Geografía , Modelos Genéticos , Anotación de Secuencia Molecular , Filogenia , Polimorfismo de Nucleótido Simple/genética , Análisis de Componente Principal , Análisis de Secuencia de ADN , Zea mays/genéticaRESUMEN
Genetic risk prediction has several potential applications in medical research and clinical practice and could be used, for example, to stratify a heterogeneous population of patients by their predicted genetic risk. However, for polygenic traits, such as psychiatric disorders, the accuracy of risk prediction is low. Here we use a multivariate linear mixed model and apply multi-trait genomic best linear unbiased prediction for genetic risk prediction. This method exploits correlations between disorders and simultaneously evaluates individual risk for each disorder. We show that the multivariate approach significantly increases the prediction accuracy for schizophrenia, bipolar disorder, and major depressive disorder in the discovery as well as in independent validation datasets. By grouping SNPs based on genome annotation and fitting multiple random effects, we show that the prediction accuracy could be further improved. The gain in prediction accuracy of the multivariate approach is equivalent to an increase in sample size of 34% for schizophrenia, 68% for bipolar disorder, and 76% for major depressive disorders using single trait models. Because our approach can be readily applied to any number of GWAS datasets of correlated traits, it is a flexible and powerful tool to maximize prediction accuracy. With current sample size, risk predictors are not useful in a clinical setting but already are a valuable research tool, for example in experimental designs comparing cases with high and low polygenic risk.
Asunto(s)
Genética Médica/métodos , Trastornos Mentales/genética , Herencia Multifactorial/genética , Medición de Riesgo/métodos , Trastorno Bipolar/genética , Trastorno Depresivo Mayor/genética , Pruebas Genéticas/métodos , Humanos , Modelos Lineales , Análisis Multivariante , Polimorfismo de Nucleótido Simple/genética , Esquizofrenia/genéticaRESUMEN
We introduce a liability-threshold mixed linear model (LTMLM) association statistic for case-control studies and show that it has a well-controlled false-positive rate and more power than existing mixed-model methods for diseases with low prevalence. Existing mixed-model methods suffer a loss in power under case-control ascertainment, but no solution has been proposed. Here, we solve this problem by using a χ(2) score statistic computed from posterior mean liabilities (PMLs) under the liability-threshold model. Each individual's PML is conditional not only on that individual's case-control status but also on every individual's case-control status and the genetic relationship matrix (GRM) obtained from the data. The PMLs are estimated with a multivariate Gibbs sampler; the liability-scale phenotypic covariance matrix is based on the GRM, and a heritability parameter is estimated via Haseman-Elston regression on case-control phenotypes and then transformed to the liability scale. In simulations of unrelated individuals, the LTMLM statistic was correctly calibrated and achieved higher power than existing mixed-model methods for diseases with low prevalence, and the magnitude of the improvement depended on sample size and severity of case-control ascertainment. In a Wellcome Trust Case Control Consortium 2 multiple sclerosis dataset with >10,000 samples, LTMLM was correctly calibrated and attained a 4.3% improvement (p = 0.005) in χ(2) statistics over existing mixed-model methods at 75 known associated SNPs, consistent with simulations. Larger increases in power are expected at larger sample sizes. In conclusion, case-control studies of diseases with low prevalence can achieve power higher than that in existing mixed-model methods.
Asunto(s)
Estudios de Asociación Genética , Modelos Genéticos , Modelos Teóricos , Estudios de Casos y Controles , Mapeo Cromosómico , Simulación por Computador , Humanos , Esclerosis Múltiple/genética , Esclerosis Múltiple/patología , Fenotipo , Polimorfismo de Nucleótido Simple , Tamaño de la MuestraRESUMEN
In our previous work, we proposed a genomic prediction method combing identical-by-state-based Haseman-Elston regression and best linear prediction with additive variance component only (HEBLP|A herein), the most essential component of genetic variation. Since the dominance effects contribute significantly in heterosis, it is desirable to incorporate the HEBLP with dominance variance component that is expected to enhance the predictive accuracy as we move to the further development: HEBLP|AD, a paralleled implementation of genomic prediction compared with genomic best linear unbiased prediction (GBLUP). The simulation results indicated that when the dominance effects contributed to a large proportion of genetic variation, HEBLP|AD and GBLUP|AD, having similar accuracy, both outperformed HEBLP|A; but when the dominance variation was none or little, HEBLP|A, HEBLP|AD, and GBLUP|AD had similar predictability. The analysis of real data from Arabidopsis thaliana F2 population also demonstrated the latter situation. In summary, HEBLP|AD performed stable whether a trait was controlled by dominance effects or not.
Asunto(s)
Arabidopsis/genética , Genes Dominantes , Genómica/métodos , Sitios de Carácter Cuantitativo , Simulación por Computador , Marcadores Genéticos , Análisis de los Mínimos Cuadrados , Modelos GenéticosRESUMEN
BACKGROUND: Lithium is a first-line treatment in bipolar disorder, but individual response is variable. Previous studies have suggested that lithium response is a heritable trait. However, no genetic markers of treatment response have been reproducibly identified. METHODS: Here, we report the results of a genome-wide association study of lithium response in 2563 patients collected by 22 participating sites from the International Consortium on Lithium Genetics (ConLiGen). Data from common single nucleotide polymorphisms (SNPs) were tested for association with categorical and continuous ratings of lithium response. Lithium response was measured using a well established scale (Alda scale). Genotyped SNPs were used to generate data at more than 6 million sites, using standard genomic imputation methods. Traits were regressed against genotype dosage. Results were combined across two batches by meta-analysis. FINDINGS: A single locus of four linked SNPs on chromosome 21 met genome-wide significance criteria for association with lithium response (rs79663003, p=1·37â×â10(-8); rs78015114, p=1·31â×â10(-8); rs74795342, p=3·31â×â10(-9); and rs75222709, p=3·50â×â10(-9)). In an independent, prospective study of 73 patients treated with lithium monotherapy for a period of up to 2 years, carriers of the response-associated alleles had a significantly lower rate of relapse than carriers of the alternate alleles (p=0·03268, hazard ratio 3·8, 95% CI 1·1-13·0). INTERPRETATION: The response-associated region contains two genes for long, non-coding RNAs (lncRNAs), AL157359.3 and AL157359.4. LncRNAs are increasingly appreciated as important regulators of gene expression, particularly in the CNS. Confirmed biomarkers of lithium response would constitute an important step forward in the clinical management of bipolar disorder. Further studies are needed to establish the biological context and potential clinical utility of these findings. FUNDING: Deutsche Forschungsgemeinschaft, National Institute of Mental Health Intramural Research Program.
Asunto(s)
Trastorno Bipolar/genética , Compuestos de Litio/uso terapéutico , Polimorfismo de Nucleótido Simple/genética , Trastorno Bipolar/tratamiento farmacológico , Femenino , Variación Genética , Estudio de Asociación del Genoma Completo , Genotipo , Receptores del Factor Neurotrófico Derivado de la Línea Celular Glial/genética , Humanos , Masculino , Persona de Mediana Edad , Fenotipo , Estudios Prospectivos , Resultado del TratamientoRESUMEN
BACKGROUND: Predicting risk of disease from genotypes is being increasingly proposed for a variety of diagnostic and prognostic purposes. Genome-wide association studies (GWAS) have identified a large number of genome-wide significant susceptibility loci for Crohn's disease (CD) and ulcerative colitis (UC), two subtypes of inflammatory bowel disease (IBD). Recent studies have demonstrated that including only loci that are significantly associated with disease in the prediction model has low predictive power and that power can substantially be improved using a polygenic approach. METHODS: We performed a comprehensive analysis of risk prediction models using large case-control cohorts genotyped for 909,763 GWAS SNPs or 123,437 SNPs on the custom designed Immunochip using four prediction methods (polygenic score, best linear genomic prediction, elastic-net regularization and a Bayesian mixture model). We used the area under the curve (AUC) to assess prediction performance for discovery populations with different sample sizes and number of SNPs within cross-validation. RESULTS: On average, the Bayesian mixture approach had the best prediction performance. Using cross-validation we found little differences in prediction performance between GWAS and Immunochip, despite the GWAS array providing a 10 times larger effective genome-wide coverage. The prediction performance using Immunochip is largely due to the power of the initial GWAS for its marker selection and its low cost that enabled larger sample sizes. The predictive ability of the genomic risk score based on Immunochip was replicated in external data, with AUC of 0.75 for CD and 0.70 for UC. CD patients with higher risk scores demonstrated clinical characteristics typically associated with a more severe disease course including ileal location and earlier age at diagnosis. CONCLUSIONS: Our analyses demonstrate that the power of genomic risk prediction for IBD is mainly due to strongly associated SNPs with considerable effect sizes. Additional SNPs that are only tagged by high-density GWAS arrays and low or rare-variants over-represented in the high-density region on the Immunochip contribute little to prediction accuracy. Although a quantitative assessment of IBD risk for an individual is not currently possible, we show sufficient power of genomic risk scores to stratify IBD risk among individuals at diagnosis.
Asunto(s)
Colitis Ulcerosa/genética , Enfermedad de Crohn/genética , Predisposición Genética a la Enfermedad , Genotipo , Medición de Riesgo/métodos , Teorema de Bayes , Estudios de Casos y Controles , Estudios de Cohortes , Humanos , Modelos Genéticos , Polimorfismo de Nucleótido Simple , Valor Predictivo de las PruebasRESUMEN
KEY MESSAGE: We propose a novel computational method for genomic selection that combines identical-by-state (IBS)-based Haseman-Elston (HE) regression and best linear prediction (BLP), called HE-BLP. Genomic best linear unbiased prediction (GBLUP) has been widely used in whole-genome prediction for breeding programs. To determine the total genetic variance of a training population, a linear mixed model (LMM) should be solved via restricted maximum likelihood (REML), whose computational complexity is the cube of the sample size. We proposed a novel computational method combining identical-by-state (IBS)-based Haseman-Elston (HE) regression and best linear prediction (BLP), called HE-BLP. With this method, the total genetic variance can be estimated by solving a simple HE linear regression, which has a computational complex of the sample size squared; therefore, it is suitable for large-scale genomic data, except those with which environmental effects need to be estimated simultaneously, because it does not allow for this estimation. In Monte Carlo simulation studies, the estimated heritability based on HE was identical to that based on REML, and the prediction accuracy via HE-BLP and traditional GBLUP was also quite similar when quantitative trait loci (QTLs) were randomly distributed along the genome and their effects followed a normal distribution. In addition, the kernel row number (KRN) trait in a maize IBM population was used to evaluate the performance of the two methods; the results showed similar prediction accuracy of breeding values despite slightly different estimated heritability via HE and REML, probably due to the underlying genetic architecture. HE-BLP can be a future genomic selection method choice for even larger sets of genomic data in certain special cases where environmental effects can be ignored. The software for HE regression and the simulation program is available online in the Genetic Analysis Repository (GEAR; https://github.com/gc5k/GEAR/wiki).
Asunto(s)
Genómica/métodos , Modelos Genéticos , Fitomejoramiento , Zea mays/genética , Simulación por Computador , Variación Genética , Funciones de Verosimilitud , Modelos Lineales , Método de Montecarlo , Sitios de Carácter Cuantitativo , Programas InformáticosRESUMEN
We have recently developed analysis methods (GREML) to estimate the genetic variance of a complex trait/disease and the genetic correlation between two complex traits/diseases using genome-wide single nucleotide polymorphism (SNP) data in unrelated individuals. Here we use analytical derivations and simulations to quantify the sampling variance of the estimate of the proportion of phenotypic variance captured by all SNPs for quantitative traits and case-control studies. We also derive the approximate sampling variance of the estimate of a genetic correlation in a bivariate analysis, when two complex traits are either measured on the same or different individuals. We show that the sampling variance is inversely proportional to the number of pairwise contrasts in the analysis and to the variance in SNP-derived genetic relationships. For bivariate analysis, the sampling variance of the genetic correlation additionally depends on the harmonic mean of the proportion of variance explained by the SNPs for the two traits and the genetic correlation between the traits, and depends on the phenotypic correlation when the traits are measured on the same individuals. We provide an online tool for calculating the power of detecting genetic (co)variation using genome-wide SNP data. The new theory and online tool will be helpful to plan experimental designs to estimate the missing heritability that has not yet been fully revealed through genome-wide association studies, and to estimate the genetic overlap between complex traits (diseases) in particular when the traits (diseases) are not measured on the same samples.
Asunto(s)
Polimorfismo de Nucleótido Simple/genética , Estudios de Casos y Controles , Estudio de Asociación del Genoma Completo/métodos , Humanos , Modelos Genéticos , Fenotipo , Sitios de Carácter Cuantitativo/genética , Programas InformáticosRESUMEN
As custom arrays are cheaper than generic GWAS arrays, larger sample size is achievable for gene discovery. Custom arrays can tag more variants through denser genotyping of SNPs at associated loci, but at the cost of losing genome-wide coverage. Balancing this trade-off is important for maximizing experimental designs. We quantified both the gain in captured SNP-heritability at known candidate regions and the loss due to imperfect genome-wide coverage for inflammatory bowel disease using immunochip (iChip) and imputed GWAS data on 61,251 and 38.550 samples, respectively. For Crohn's disease (CD), the iChip and GWAS data explained 19 and 26% of variation in liability, respectively, and SNPs in the densely genotyped iChip regions explained 13% of the SNP-heritability for both the iChip and GWAS data. For ulcerative colitis (UC), the iChip and GWAS data explained 15 and 19% of variation in liability, respectively, and the dense iChip regions explained 10 and 9% of the SNP-heritability in the iChip and the GWAS data. From bivariate analyses, estimates of the genetic correlation in risk between CD and UC were 0.75 (SE 0.017) and 0.62 (SE 0.042) for the iChip and GWAS data, respectively. We also quantified the SNP-heritability of genomic regions that did or did not contain the previous 163 GWAS hits for CD and UC, and SNP-heritability of the overlapping loci between the densely genotyped iChip regions and the 163 GWAS hits. For both diseases, over different genomic partitioning, the densely genotyped regions on the iChip tagged at least as much variation in liability as in the corresponding regions in the GWAS data, however a certain amount of tagged SNP-heritability in the GWAS data was lost using the iChip due to the low coverage at unselected regions. These results imply that custom arrays with a GWAS backbone will facilitate more gene discovery, both at associated and novel loci.
Asunto(s)
Predisposición Genética a la Enfermedad , Estudio de Asociación del Genoma Completo , Enfermedades Inflamatorias del Intestino/genética , Patrón de Herencia/genética , Análisis de Secuencia por Matrices de Oligonucleótidos , Cromosomas Humanos/genética , Colitis Ulcerosa/genética , Enfermedad de Crohn/genética , Femenino , Frecuencia de los Genes/genética , Humanos , Masculino , Polimorfismo de Nucleótido Simple/genética , Tamaño de la MuestraRESUMEN
Identification of multifactor gene-gene (G×G) and gene-environment (G×E) interactions underlying complex traits poses one of the great challenges to today's genetic study. Development of the generalized multifactor dimensionality reduction (GMDR) method provides a practicable solution to problems in detection of interactions. To exploit the opportunities brought by the availability of diverse data, it is in high demand to develop the corresponding GMDR software that can handle a breadth of phenotypes, such as continuous, count, dichotomous, polytomous nominal, ordinal, survival and multivariate, and various kinds of study designs, such as unrelated case-control, family-based and pooled unrelated and family samples, and also allows adjustment for covariates. We developed a versatile GMDR package to implement this serial of GMDR analyses for various scenarios (e.g., unified analysis of unrelated and family samples) and large-scale (e.g., genome-wide) data. This package includes other desirable features such as data management and preprocessing. Permutation testing strategies are also built in to evaluate the threshold or empirical p values. In addition, its performance is scalable to the computational resources. The software is available at http://www.soph.uab.edu/ssg/software or http://ibi.zju.edu.cn/software.