ABSTRACT
Large-scale, multi-ethnic whole-genome sequencing (WGS) studies, such as the National Human Genome Research Institute Genome Sequencing Program's Centers for Common Disease Genomics (CCDG), play an important role in increasing diversity for genetic research. Before performing association analyses, assessing Hardy-Weinberg equilibrium (HWE) is a crucial step in quality control procedures to remove low quality variants and ensure valid downstream analyses. Diverse WGS studies contain ancestrally heterogeneous samples; however, commonly used HWE methods assume that the samples are homogeneous. Therefore, directly applying these to the whole dataset can yield statistically invalid results. To account for this heterogeneity, HWE can be tested on subsets of samples that have genetically homogeneous ancestries and the results aggregated at each variant. To facilitate valid HWE subset testing, we developed a semi-supervised learning approach that predicts homogeneous ancestries based on the genotype. This method provides a convenient tool for estimating HWE in the presence of population structure and missing self-reported race and ethnicities in diverse WGS studies. In addition, assessing HWE within the homogeneous ancestries provides reliable HWE estimates that will directly benefit downstream analyses, including association analyses in WGS studies. We applied our proposed method on the CCDG dataset, predicting homogeneous genetic ancestry groups for 60,545 multi-ethnic WGS samples to assess HWE within each group.
Subject(s)
Supervised Machine Learning , Whole Genome Sequencing , Humans , Whole Genome Sequencing/methods , Genome, Human , Genetics, Population/methods , Ethnicity/genetics , Genome-Wide Association Study/methods , Polymorphism, Single Nucleotide , GenotypeABSTRACT
Large-scale whole-genome sequencing studies have enabled analysis of noncoding rare-variant (RV) associations with complex human diseases and traits. Variant-set analysis is a powerful approach to study RV association. However, existing methods have limited ability in analyzing the noncoding genome. We propose a computationally efficient and robust noncoding RV association detection framework, STAARpipeline, to automatically annotate a whole-genome sequencing study and perform flexible noncoding RV association analysis, including gene-centric analysis and fixed window-based and dynamic window-based non-gene-centric analysis by incorporating variant functional annotations. In gene-centric analysis, STAARpipeline uses STAAR to group noncoding variants based on functional categories of genes and incorporate multiple functional annotations. In non-gene-centric analysis, STAARpipeline uses SCANG-STAAR to incorporate dynamic window sizes and multiple functional annotations. We apply STAARpipeline to identify noncoding RV sets associated with four lipid traits in 21,015 discovery samples from the Trans-Omics for Precision Medicine (TOPMed) program and replicate several of them in an additional 9,123 TOPMed samples. We also analyze five non-lipid TOPMed traits.
Subject(s)
Genome-Wide Association Study , Genome , Humans , Genome-Wide Association Study/methods , Whole Genome Sequencing/methods , Phenotype , Genetic VariationABSTRACT
In genome-wide association studies, ordinal categorical phenotypes are widely used to measure human behaviors, satisfaction, and preferences. However, because of the lack of analysis tools, methods designed for binary or quantitative traits are commonly used inappropriately to analyze categorical phenotypes. To accurately model the dependence of an ordinal categorical phenotype on covariates, we propose an efficient mixed model association test, proportional odds logistic mixed model (POLMM). POLMM is computationally efficient to analyze large datasets with hundreds of thousands of samples, can control type I error rates at a stringent significance level regardless of the phenotypic distribution, and is more powerful than alternative methods. In contrast, the standard linear mixed model approaches cannot control type I error rates for rare variants when the phenotypic distribution is unbalanced, although they performed well when testing common variants. We applied POLMM to 258 ordinal categorical phenotypes on array genotypes and imputed samples from 408,961 individuals in UK Biobank. In total, we identified 5,885 genome-wide significant variants, of which, 424 variants (7.2%) are rare variants with MAF < 0.01.
Subject(s)
Computer Simulation , Genome-Wide Association Study , Models, Genetic , Phenotype , Biological Specimen Banks , Child , Female , Humans , Male , Research Design , United KingdomABSTRACT
SUMMARY: Amidst the continuing spread of coronavirus disease-19 (COVID-19), real-time data analysis and visualization remain critical the general public to track the pandemic's impact and to inform policy making by officials. Multiple metrics permit the evaluation of the spread, infection and mortality of infectious diseases. For example, numbers of new cases and deaths provide easily interpretable measures of absolute impact within a given population and time frame, while the effective reproduction rate provides an epidemiological measure of the rate of spread. By evaluating multiple metrics concurrently, users can leverage complementary insights into the impact and current state of the pandemic when formulating prevention and safety plans for oneself and others. We describe COVID-19 Spread Mapper, a unified framework for estimating and quantifying the uncertainty in the smoothed daily effective reproduction number, case rate and death rate in a region using log-linear models. We apply this framework to characterize COVID-19 impact at multiple geographic resolutions, including by US county and state as well as by country, demonstrating the variation across resolutions and the need for harmonized efforts to control the pandemic. We provide an open-source online dashboard for real-time analysis and visualization of multiple key metrics, which are critical to evaluate the impact of COVID-19 and make informed policy decisions. AVAILABILITY AND IMPLEMENTATION: Our model and tool are publicly available as implemented in R and hosted at https://metrics.covid19-analysis.org/. The source code is freely available from https://github.com/lin-lab/COVID19-Rt and https://github.com/lin-lab/COVID19-Viz. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
COVID-19 , Humans , COVID-19/epidemiology , SARS-CoV-2 , Pandemics/prevention & control , SoftwareABSTRACT
The etiology of most complex diseases involves genetic variants, environmental factors, and gene-environment interaction (G × E) effects. Compared with marginal genetic association studies, G × E analysis requires more samples and detailed measure of environmental exposures, and this limits the possible discoveries. Large-scale population-based biobanks with detailed phenotypic and environmental information, such as UK-Biobank, can be ideal resources for identifying G × E effects. However, due to the large computation cost and the presence of case-control imbalance, existing methods often fail. Here we propose a scalable and accurate method, SPAGE (SaddlePoint Approximation implementation of G × E analysis), that is applicable for genome-wide scale phenome-wide G × E studies. SPAGE fits a genotype-independent logistic model only once across the genome-wide analysis in order to reduce computation cost, and SPAGE uses a saddlepoint approximation (SPA) to calibrate the test statistics for analysis of phenotypes with unbalanced case-control ratios. Simulation studies show that SPAGE is 33-79 times faster than the Wald test and 72-439 times faster than the Firth's test, and SPAGE can control type I error rates at the genome-wide significance level even when case-control ratios are extremely unbalanced. Through the analysis of UK-Biobank data of 344,341 white British European-ancestry samples, we show that SPAGE can efficiently analyze large samples while controlling for unbalanced case-control ratios.
Subject(s)
Biological Specimen Banks , Gene-Environment Interaction , Genetic Diseases, Inborn/genetics , Genome-Wide Association Study , Polymorphism, Single Nucleotide , Quantitative Trait, Heritable , Case-Control Studies , Female , Genetic Diseases, Inborn/epidemiology , Humans , Logistic Models , Male , Phenomics , Phenotype , United Kingdom/epidemiologyABSTRACT
MOTIVATION: Population stratification (PS) is a major confounder in genome-wide association studies (GWAS) and can lead to false-positive associations. To adjust for PS, principal component analysis (PCA)-based ancestry prediction has been widely used. Simple projection (SP) based on principal component loadings and the recently developed data augmentation, decomposition and Procrustes (ADP) transformation, such as LASER and TRACE, are popular methods for predicting PC scores. However, the predicted PC scores from SP can be biased toward NULL. On the other hand, ADP has a high computation cost because it requires running PCA separately for each study sample on the augmented dataset. RESULTS: We develop and propose two alternative approaches: bias-adjusted projection (AP) and online ADP (OADP). Using random matrix theory, AP asymptotically estimates and adjusts for the bias of SP. OADP uses a computationally efficient online singular value decomposition algorithm, which can greatly reduce the computation cost of ADP. We carried out extensive simulation studies to show that these alternative approaches are unbiased and the computation speed can be 16-16 000 times faster than ADP. We applied our approaches to the UK Biobank data of 488 366 study samples with 2492 samples from the 1000 Genomes data as the reference. AP and OADP required 0.82 and 21 CPU hours, respectively, while the projected computation time of ADP was 1628 CPU hours. Furthermore, when inferring sub-European ancestry, SP clearly showed bias, unlike the proposed approaches. AVAILABILITY AND IMPLEMENTATION: The OADP and AP methods, as well as SP and ADP, have been implemented in the open-source Python software FRAPOSA, available at github.com/daviddaiweizhang/fraposa. CONTACT: leeshawn@umich.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
Genome-Wide Association Study , Software , Algorithms , Principal Component AnalysisABSTRACT
With the availability of large-scale biobanks, genome-wide scale phenome-wide association studies are being instrumental in discovering novel genetic variants associated with clinical phenotypes. As increasing number of such association results from different biobanks become available, methods to meta-analyse those association results is of great interest. Because the binary phenotypes in biobank-based studies are mostly unbalanced in their case-control ratios, very few methods can provide well-calibrated tests for associations. For example, traditional Z-score-based meta-analysis often results in conservative or anticonservative Type I error rates in such unbalanced scenarios. We propose two meta-analysis strategies that can efficiently combine association results from biobank-based studies with such unbalanced phenotypes, using the saddlepoint approximation-based score test method. Our first method involves sharing the overall genotype counts from each study, and the second method involves sharing an approximation of the distribution of the score test statistic from each study using cubic Hermite splines. We compare our proposed methods with a traditional Z-score-based meta-analysis strategy using numerical simulations and real data applications, and demonstrate the superior performance of our proposed methods in terms of Type I error control.
Subject(s)
Biological Specimen Banks , Genome-Wide Association Study , Computer Simulation , Genotype , Humans , Models, Genetic , Numerical Analysis, Computer-Assisted , United KingdomABSTRACT
The availability of electronic health record (EHR)-based phenotypes allows for genome-wide association analyses in thousands of traits and has great potential to enable identification of genetic variants associated with clinical phenotypes. We can interpret the phenome-wide association study (PheWAS) result for a single genetic variant by observing its association across a landscape of phenotypes. Because a PheWAS can test thousands of binary phenotypes, and most of them have unbalanced or often extremely unbalanced case-control ratios (1:10 or 1:600, respectively), existing methods cannot provide an accurate and scalable way to test for associations. Here, we propose a computationally fast score-test-based method that estimates the distribution of the test statistic by using the saddlepoint approximation. Our method is much (â¼100 times) faster than the state-of-the-art Firth's test. It can also adjust for covariates and control type I error rates even when the case-control ratio is extremely unbalanced. Through application to PheWAS data from the Michigan Genomics Initiative, we show that the proposed method can control type I error rates while replicating previously known association signals even for traits with a very small number of cases and a large number of controls.
Subject(s)
Algorithms , Genome-Wide Association Study , Computer Simulation , Gene Frequency/genetics , Genomics , Humans , Numerical Analysis, Computer-Assisted , Phenotype , Polymorphism, Single Nucleotide/genetics , Reproducibility of Results , Time FactorsABSTRACT
The KRAS mutation is the most common oncogenic driver in patients with non-small cell lung cancer (NSCLC). However, a detailed understanding of how self-reported race and/or ethnicity (SIRE), genetically inferred ancestry (GIA), and their interaction affect KRAS mutation is largely unknown. Here, we investigated the associations between SIRE, quantitative GIA, and KRAS mutation and its allele-specific subtypes in a multi-ethnic cohort of 3,918 patients from the Boston Lung Cancer Survival cohort and the Chinese OrigiMed cohort with an independent validation cohort of 1,450 patients with NSCLC. This comprehensive analysis included detailed covariates such as age at diagnosis, sex, clinical stage, cancer histology, and smoking status. We report that SIRE is significantly associated with KRAS mutations, modified by sex, with SIRE-Asian patients showing lower rates of KRAS mutation, transversion substitution, and the allele-specific subtype KRASG12C compared to SIRE-White patients after adjusting for potential confounders. Moreover, GIA was found to correlate with KRAS mutations, where patients with a higher proportion of European ancestry had an increased risk of KRAS mutations, especially more transition substitutions and KRASG12D. Notably, among SIRE-White patients, an increase in European ancestry was linked to a higher likelihood of KRAS mutations, whereas an increase in admixed American ancestry was associated with a reduced likelihood, suggesting that quantitative GIA offers additional information beyond SIRE. The association of SIRE, GIA, and their interplay with KRAS driver mutations in NSCLC highlights the importance of incorporating both into population-based cancer research, aiming to refine clinical decision-making processes and mitigate health disparities.
Subject(s)
Alleles , Carcinoma, Non-Small-Cell Lung , Lung Neoplasms , Mutation , Proto-Oncogene Proteins p21(ras) , Humans , Carcinoma, Non-Small-Cell Lung/genetics , Carcinoma, Non-Small-Cell Lung/ethnology , Carcinoma, Non-Small-Cell Lung/pathology , Proto-Oncogene Proteins p21(ras)/genetics , Lung Neoplasms/genetics , Lung Neoplasms/ethnology , Lung Neoplasms/pathology , Male , Female , Middle Aged , Aged , Prevalence , Ethnicity/genetics , Racial Groups/genetics , Genetic Predisposition to DiseaseABSTRACT
Meta-analysis of whole genome sequencing/whole exome sequencing (WGS/WES) studies provides an attractive solution to the problem of collecting large sample sizes for discovering rare variants associated with complex phenotypes. Existing rare variant meta-analysis approaches are not scalable to biobank-scale WGS data. Here we present MetaSTAAR, a powerful and resource-efficient rare variant meta-analysis framework for large-scale WGS/WES studies. MetaSTAAR accounts for relatedness and population structure, can analyze both quantitative and dichotomous traits and boosts the power of rare variant tests by incorporating multiple variant functional annotations. Through meta-analysis of four lipid traits in 30,138 ancestrally diverse samples from 14 studies of the Trans Omics for Precision Medicine (TOPMed) Program, we show that MetaSTAAR performs rare variant meta-analysis at scale and produces results comparable to using pooled data. Additionally, we identified several conditionally significant rare variant associations with lipid traits. We further demonstrate that MetaSTAAR is scalable to biobank-scale cohorts through meta-analysis of TOPMed WGS data and UK Biobank WES data of ~200,000 samples.
Subject(s)
Genome-Wide Association Study , Lipids , Genome-Wide Association Study/methods , Whole Genome Sequencing/methods , Exome Sequencing , Phenotype , Lipids/geneticsABSTRACT
Large scale screening is a critical tool in the life sciences, but is often limited by reagents, samples, or cost. An important recent example is the challenge of achieving widespread COVID-19 testing in the face of substantial resource constraints. To tackle this challenge, screening methods must efficiently use testing resources. However, given the global nature of the pandemic, they must also be simple (to aid implementation) and flexible (to be tailored for each setting). Here we propose HYPER, a group testing method based on hypergraph factorization. We provide theoretical characterizations under a general statistical model, and carefully evaluate HYPER with alternatives proposed for COVID-19 under realistic simulations of epidemic spread and viral kinetics. We find that HYPER matches or outperforms the alternatives across a broad range of testing-constrained environments, while also being simpler and more flexible. We provide an online tool to aid lab implementation: http://hyper.covid19-analysis.org .
Subject(s)
COVID-19 , COVID-19 Testing , Humans , Mass Screening , Pandemics/prevention & control , SARS-CoV-2ABSTRACT
With decades of electronic health records linked to genetic data, large biobanks provide unprecedented opportunities for systematically understanding the genetics of the natural history of complex diseases. Genome-wide survival association analysis can identify genetic variants associated with ages of onset, disease progression and lifespan. We propose an efficient and accurate frailty model approach for genome-wide survival association analysis of censored time-to-event (TTE) phenotypes by accounting for both population structure and relatedness. Our method utilizes state-of-the-art optimization strategies to reduce the computational cost. The saddlepoint approximation is used to allow for analysis of heavily censored phenotypes (>90%) and low frequency variants (down to minor allele count 20). We demonstrate the performance of our method through extensive simulation studies and analysis of five TTE phenotypes, including lifespan, with heavy censoring rates (90.9% to 99.8%) on ~400,000 UK Biobank participants with white British ancestry and ~180,000 individuals in FinnGen. We further analyzed 871 TTE phenotypes in the UK Biobank and presented the genome-wide scale phenome-wide association results with the PheWeb browser.
Subject(s)
Biological Specimen Banks , Frailty , Frailty/genetics , Genome-Wide Association Study/methods , Humans , Phenomics , PhenotypeABSTRACT
Modeling infectious disease dynamics has been critical throughout the COVID-19 pandemic. Of particular interest are the incidence, prevalence, and effective reproductive number (Rt). Estimating these quantities is challenging due to under-ascertainment, unreliable reporting, and time lags between infection, onset, and testing. We propose a Multilevel Epidemic Regression Model to Account for Incomplete Data (MERMAID) to jointly estimate Rt, ascertainment rates, incidence, and prevalence over time in one or multiple regions. Specifically, MERMAID allows for a flexible regression model of Rt that can incorporate geographic and time-varying covariates. To account for under-ascertainment, we (a) model the ascertainment probability over time as a function of testing metrics and (b) jointly model data on confirmed infections and population-based serological surveys. To account for delays between infection, onset, and reporting, we model stochastic lag times as missing data, and develop an EM algorithm to estimate the model parameters. We evaluate the performance of MERMAID in simulation studies, and assess its robustness by conducting sensitivity analyses in a range of scenarios of model misspecifications. We apply the proposed method to analyze COVID-19 daily confirmed infection counts, PCR testing data, and serological survey data across the United States. Based on our model, we estimate an overall COVID-19 prevalence of 12.5% (ranging from 2.4% in Maine to 20.2% in New York) and an overall ascertainment rate of 45.5% (ranging from 22.5% in New York to 81.3% in Rhode Island) in the United States from March to December 2020. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
ABSTRACT
Large-scale whole-genome sequencing studies have enabled the analysis of rare variants (RVs) associated with complex phenotypes. Commonly used RV association tests have limited scope to leverage variant functions. We propose STAAR (variant-set test for association using annotation information), a scalable and powerful RV association test method that effectively incorporates both variant categories and multiple complementary annotations using a dynamic weighting scheme. For the latter, we introduce 'annotation principal components', multidimensional summaries of in silico variant annotations. STAAR accounts for population structure and relatedness and is scalable for analyzing very large cohort and biobank whole-genome sequencing studies of continuous and dichotomous traits. We applied STAAR to identify RVs associated with four lipid traits in 12,316 discovery and 17,822 replication samples from the Trans-Omics for Precision Medicine Program. We discovered and replicated new RV associations, including disruptive missense RVs of NPC1L1 and an intergenic region near APOC1P1 associated with low-density lipoprotein cholesterol.
Subject(s)
Genetic Predisposition to Disease/genetics , Genetic Variation/genetics , Genome/genetics , Cholesterol, LDL/genetics , Computer Simulation , Genome-Wide Association Study/methods , Humans , Models, Genetic , Molecular Sequence Annotation/methods , Phenotype , Whole Genome Sequencing/methodsABSTRACT
With the development of high-throughput technologies, principal component analysis (PCA) in the high-dimensional regime is of great interest. Most of the existing theoretical and methodological results for high-dimensional PCA are based on the spiked population model in which all the population eigenvalues are equal except for a few large ones. Due to the presence of local correlation among features, however, this assumption may not be satisfied in many real-world datasets. To address this issue, we investigate the asymptotic behavior of PCA under the generalized spiked population model. Based on our theoretical results, we propose a series of methods for the consistent estimation of population eigenvalues, angles between the sample and population eigenvectors, correlation coefficients between the sample and population principal component (PC) scores, and the shrinkage bias adjustment for the predicted PC scores. Using numerical experiments and real data examples from the genetics literature, we show that our methods can greatly reduce bias and improve prediction accuracy.
ABSTRACT
In genome-wide association studies (GWAS) for thousands of phenotypes in large biobanks, most binary traits have substantially fewer cases than controls. Both of the widely used approaches, the linear mixed model and the recently proposed logistic mixed model, perform poorly; they produce large type I error rates when used to analyze unbalanced case-control phenotypes. Here we propose a scalable and accurate generalized mixed model association test that uses the saddlepoint approximation to calibrate the distribution of score test statistics. This method, SAIGE (Scalable and Accurate Implementation of GEneralized mixed model), provides accurate P values even when case-control ratios are extremely unbalanced. SAIGE uses state-of-art optimization strategies to reduce computational costs; hence, it is applicable to GWAS for thousands of phenotypes by large biobanks. Through the analysis of UK Biobank data of 408,961 samples from white British participants with European ancestry for > 1,400 binary phenotypes, we show that SAIGE can efficiently analyze large sample data, controlling for unbalanced case-control ratios and sample relatedness.
Subject(s)
Genome-Wide Association Study/methods , Case-Control Studies , Computer Simulation , Humans , Linear Models , Logistic Models , Models, Genetic , Phenotype , Polymorphism, Single NucleotideABSTRACT
To identify genetic variation underlying atrial fibrillation, the most common cardiac arrhythmia, we performed a genome-wide association study of >1,000,000 people, including 60,620 atrial fibrillation cases and 970,216 controls. We identified 142 independent risk variants at 111 loci and prioritized 151 functional candidate genes likely to be involved in atrial fibrillation. Many of the identified risk variants fall near genes where more deleterious mutations have been reported to cause serious heart defects in humans (GATA4, MYH6, NKX2-5, PITX2, TBX5)1, or near genes important for striated muscle function and integrity (for example, CFL2, MYH7, PKP2, RBM20, SGCG, SSPN). Pathway and functional enrichment analyses also suggested that many of the putative atrial fibrillation genes act via cardiac structural remodeling, potentially in the form of an 'atrial cardiomyopathy'2, either during fetal heart development or as a response to stress in the adult heart.