RESUMEN
Large-scale whole-genome sequencing studies have enabled analysis of noncoding rare-variant (RV) associations with complex human diseases and traits. Variant-set analysis is a powerful approach to study RV association. However, existing methods have limited ability in analyzing the noncoding genome. We propose a computationally efficient and robust noncoding RV association detection framework, STAARpipeline, to automatically annotate a whole-genome sequencing study and perform flexible noncoding RV association analysis, including gene-centric analysis and fixed window-based and dynamic window-based non-gene-centric analysis by incorporating variant functional annotations. In gene-centric analysis, STAARpipeline uses STAAR to group noncoding variants based on functional categories of genes and incorporate multiple functional annotations. In non-gene-centric analysis, STAARpipeline uses SCANG-STAAR to incorporate dynamic window sizes and multiple functional annotations. We apply STAARpipeline to identify noncoding RV sets associated with four lipid traits in 21,015 discovery samples from the Trans-Omics for Precision Medicine (TOPMed) program and replicate several of them in an additional 9,123 TOPMed samples. We also analyze five non-lipid TOPMed traits.
Asunto(s)
Estudio de Asociación del Genoma Completo , Genoma , Humanos , Estudio de Asociación del Genoma Completo/métodos , Secuenciación Completa del Genoma/métodos , Fenotipo , Variación GenéticaRESUMEN
SUMMARY: Amidst the continuing spread of coronavirus disease-19 (COVID-19), real-time data analysis and visualization remain critical the general public to track the pandemic's impact and to inform policy making by officials. Multiple metrics permit the evaluation of the spread, infection and mortality of infectious diseases. For example, numbers of new cases and deaths provide easily interpretable measures of absolute impact within a given population and time frame, while the effective reproduction rate provides an epidemiological measure of the rate of spread. By evaluating multiple metrics concurrently, users can leverage complementary insights into the impact and current state of the pandemic when formulating prevention and safety plans for oneself and others. We describe COVID-19 Spread Mapper, a unified framework for estimating and quantifying the uncertainty in the smoothed daily effective reproduction number, case rate and death rate in a region using log-linear models. We apply this framework to characterize COVID-19 impact at multiple geographic resolutions, including by US county and state as well as by country, demonstrating the variation across resolutions and the need for harmonized efforts to control the pandemic. We provide an open-source online dashboard for real-time analysis and visualization of multiple key metrics, which are critical to evaluate the impact of COVID-19 and make informed policy decisions. AVAILABILITY AND IMPLEMENTATION: Our model and tool are publicly available as implemented in R and hosted at https://metrics.covid19-analysis.org/. The source code is freely available from https://github.com/lin-lab/COVID19-Rt and https://github.com/lin-lab/COVID19-Viz. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
COVID-19 , Humanos , COVID-19/epidemiología , SARS-CoV-2 , Pandemias/prevención & control , Programas InformáticosRESUMEN
Gene-based association tests aggregate genotypes across multiple variants for each gene, providing an interpretable gene-level analysis framework for genome-wide association studies (GWAS). Early gene-based test applications often focused on rare coding variants; a more recent wave of gene-based methods, e.g. TWAS, use eQTLs to interrogate regulatory associations. Regulatory variants are expected to be particularly valuable for gene-based analysis, since most GWAS associations to date are non-coding. However, identifying causal genes from regulatory associations remains challenging and contentious. Here, we present a statistical framework and computational tool to integrate heterogeneous annotations with GWAS summary statistics for gene-based analysis, applied with comprehensive coding and tissue-specific regulatory annotations. We compare power and accuracy identifying causal genes across single-annotation, omnibus, and annotation-agnostic gene-based tests in simulation studies and an analysis of 128 traits from the UK Biobank, and find that incorporating heterogeneous annotations in gene-based association analysis increases power and performance identifying causal genes.
Asunto(s)
Estudio de Asociación del Genoma Completo/métodos , Anotación de Secuencia Molecular/métodos , Algoritmos , Estudio de Asociación del Genoma Completo/normas , Humanos , Anotación de Secuencia Molecular/normas , Polimorfismo Genético , Sitios de Carácter Cuantitativo , Reproducibilidad de los ResultadosRESUMEN
A key aim for current genome-wide association studies (GWAS) is to interrogate the full spectrum of genetic variation underlying human traits, including rare variants, across populations. Deep whole-genome sequencing is the gold standard to fully capture genetic variation, but remains prohibitively expensive for large sample sizes. Array genotyping interrogates a sparser set of variants, which can be used as a scaffold for genotype imputation to capture a wider set of variants. However, imputation quality depends crucially on reference panel size and genetic distance from the target population. Here, we consider sequencing a subset of GWAS participants and imputing the rest using a reference panel that includes both sequenced GWAS participants and an external reference panel. We investigate how imputation quality and GWAS power are affected by the number of participants sequenced for admixed populations (African and Latino Americans) and European population isolates (Sardinians and Finns), and identify powerful, cost-effective GWAS designs given current sequencing and array costs. For populations that are well-represented in existing reference panels, we find that array genotyping alone is cost-effective and well-powered to detect common- and rare-variant associations. For poorly represented populations, sequencing a subset of participants is often most cost-effective, and can substantially increase imputation quality and GWAS power.
Asunto(s)
Genoma Humano , Estudio de Asociación del Genoma Completo , Secuenciación Completa del Genoma , Análisis Costo-Beneficio , Frecuencia de los Genes/genética , Estudio de Asociación del Genoma Completo/economía , Genotipo , Humanos , Fenotipo , Polimorfismo de Nucleótido Simple/genética , Secuenciación Completa del Genoma/economíaRESUMEN
MOTIVATION: Gene set enrichment analysis has been shown to be effective in identifying relevant biological pathways underlying complex diseases. Existing approaches lack the ability to quantify the enrichment levels accurately, hence preventing the enrichment information to be further utilized in both upstream and downstream analyses. A modernized and rigorous approach for gene set enrichment analysis that emphasizes both hypothesis testing and enrichment estimation is much needed. RESULTS: We propose a novel computational method, Bayesian Analysis of Gene Set Enrichment (BAGSE), for gene set enrichment analysis. BAGSE is built on a Bayesian hierarchical model and fully accounts for the uncertainty embedded in the association evidence of individual genes. We adopt an empirical Bayes inference framework to fit the proposed hierarchical model by implementing an efficient EM algorithm. Through simulation studies, we illustrate that BAGSE yields accurate enrichment quantification while achieving similar power as the state-of-the-art methods. Further simulation studies show that BAGSE can effectively utilize the enrichment information to improve the power in gene discovery. Finally, we demonstrate the application of BAGSE in analyzing real data from a differential expression experiment and a transcriptome-wide association study. Our results indicate that the proposed statistical framework is effective in aiding the discovery of potentially causal pathways and gene networks. AVAILABILITY AND IMPLEMENTATION: BAGSE is implemented using the C++ programing language and is freely available from https://github.com/xqwen/bagse/. Simulated and real data used in this paper are also available at the Github repository for reproducibility purposes. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Algoritmos , Transcriptoma , Teorema de Bayes , Probabilidad , Reproducibilidad de los ResultadosRESUMEN
PURPOSE: Erythropoietic protoporphyria (EPP), characterized by painful cutaneous photosensitivity, results from pathogenic variants in ferrochelatase (FECH). For 96% of patients, EPP results from coinheriting a rare pathogenic variant in trans of a common hypomorphic variant c.315-48T>C (minor allele frequency 0.05). The estimated prevalence of EPP derived from the number of diagnosed individuals in Europe is 0.00092%, but this may be conservative due to underdiagnosis. No study has estimated EPP prevalence using large genetic data sets. METHODS: Disease-associated FECH variants were identified in the UK Biobank, a data set of 500,953 individuals including 49,960 exome sequences. EPP prevalence was then estimated. The association of FECH variants with EPP-related traits was assessed. RESULTS: Analysis of pathogenic FECH variants in the UK Biobank provides evidence that EPP prevalence is 0.0059% (95% confidence interval [CI]: 0.0042-0.0076%), 1.7-3.0 times more common than previously thought in the UK. In homozygotes for the common c.315-48T>C FECH variant, there was a novel decrement in both erythrocyte mean corpuscular volume (MCV) and hemoglobin. CONCLUSION: The prevalence of EPP has been underestimated secondary to underdiagnosis. The common c.315-48T>C allele is associated with both MCV and hemoglobin, an association that could be important both for those with and without EPP.
Asunto(s)
Protoporfiria Eritropoyética , Bancos de Muestras Biológicas , Europa (Continente) , Ferroquelatasa/genética , Humanos , Mutación , Protoporfiria Eritropoyética/diagnóstico , Protoporfiria Eritropoyética/epidemiología , Protoporfiria Eritropoyética/genética , Reino Unido/epidemiologíaRESUMEN
BACKGROUND: Identifying county-level characteristics associated with high coronavirus 2019 (COVID-19) burden can help allow for data-driven, equitable allocation of public health intervention resources and reduce burdens on health care systems. METHODS: Synthesizing data from various government and nonprofit institutions for all 3142 United States (US) counties, we studied county-level characteristics that were associated with cumulative and weekly case and death rates through 12/21/2020. We used generalized linear mixed models to model cumulative and weekly (40 repeated measures per county) cases and deaths. Cumulative and weekly models included state fixed effects and county-specific random effects. Weekly models additionally allowed covariate effects to vary by season and included US Census region-specific B-splines to adjust for temporal trends. RESULTS: Rural counties, counties with more minorities and white/non-white segregation, and counties with more people with no high school diploma and with medical comorbidities were associated with higher cumulative COVID-19 case and death rates. In the spring, urban counties and counties with more minorities and white/non-white segregation were associated with increased weekly case and death rates. In the fall, rural counties were associated with larger weekly case and death rates. In the spring, summer, and fall, counties with more residents with socioeconomic disadvantage and medical comorbidities were associated greater weekly case and death rates. CONCLUSIONS: These county-level associations are based off complete data from the entire country, come from a single modeling framework that longitudinally analyzes the US COVID-19 pandemic at the county-level, and are applicable to guiding government resource allocation policies to different US counties.
Asunto(s)
COVID-19 , Segregación Social , Humanos , Pandemias , Población Rural , SARS-CoV-2 , Estados Unidos/epidemiologíaRESUMEN
Summary: Estimating linkage disequilibrium (LD) is essential for a wide range of summary statistics-based association methods for genome-wide association studies. Large genetic datasets, e.g. the TOPMed WGS project and UK Biobank, enable more accurate and comprehensive LD estimates, but increase the computational burden of LD estimation. Here, we describe emeraLD (Efficient Methods for Estimation and Random Access of LD), a computational tool that leverages sparsity and haplotype structure to estimate LD up to 2 orders of magnitude faster than current tools. Availability and implementation: emeraLD is implemented in C++, and is open source under GPLv3. Source code and documentation are freely available at http://github.com/statgen/emeraLD. Supplementary information: Supplementary data are available at Bioinformatics online.
Asunto(s)
Estudio de Asociación del Genoma Completo , Desequilibrio de Ligamiento , Programas Informáticos , Biología Computacional , HaplotiposRESUMEN
Meta-analysis of whole genome sequencing/whole exome sequencing (WGS/WES) studies provides an attractive solution to the problem of collecting large sample sizes for discovering rare variants associated with complex phenotypes. Existing rare variant meta-analysis approaches are not scalable to biobank-scale WGS data. Here we present MetaSTAAR, a powerful and resource-efficient rare variant meta-analysis framework for large-scale WGS/WES studies. MetaSTAAR accounts for relatedness and population structure, can analyze both quantitative and dichotomous traits and boosts the power of rare variant tests by incorporating multiple variant functional annotations. Through meta-analysis of four lipid traits in 30,138 ancestrally diverse samples from 14 studies of the Trans Omics for Precision Medicine (TOPMed) Program, we show that MetaSTAAR performs rare variant meta-analysis at scale and produces results comparable to using pooled data. Additionally, we identified several conditionally significant rare variant associations with lipid traits. We further demonstrate that MetaSTAAR is scalable to biobank-scale cohorts through meta-analysis of TOPMed WGS data and UK Biobank WES data of ~200,000 samples.
Asunto(s)
Estudio de Asociación del Genoma Completo , Lípidos , Estudio de Asociación del Genoma Completo/métodos , Secuenciación Completa del Genoma/métodos , Secuenciación del Exoma , Fenotipo , Lípidos/genéticaRESUMEN
Modeling infectious disease dynamics has been critical throughout the COVID-19 pandemic. Of particular interest are the incidence, prevalence, and effective reproductive number (Rt). Estimating these quantities is challenging due to under-ascertainment, unreliable reporting, and time lags between infection, onset, and testing. We propose a Multilevel Epidemic Regression Model to Account for Incomplete Data (MERMAID) to jointly estimate Rt, ascertainment rates, incidence, and prevalence over time in one or multiple regions. Specifically, MERMAID allows for a flexible regression model of Rt that can incorporate geographic and time-varying covariates. To account for under-ascertainment, we (a) model the ascertainment probability over time as a function of testing metrics and (b) jointly model data on confirmed infections and population-based serological surveys. To account for delays between infection, onset, and reporting, we model stochastic lag times as missing data, and develop an EM algorithm to estimate the model parameters. We evaluate the performance of MERMAID in simulation studies, and assess its robustness by conducting sensitivity analyses in a range of scenarios of model misspecifications. We apply the proposed method to analyze COVID-19 daily confirmed infection counts, PCR testing data, and serological survey data across the United States. Based on our model, we estimate an overall COVID-19 prevalence of 12.5% (ranging from 2.4% in Maine to 20.2% in New York) and an overall ascertainment rate of 45.5% (ranging from 22.5% in New York to 81.3% in Rhode Island) in the United States from March to December 2020. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
RESUMEN
INTRODUCTION: Human metabolism and inflammation are closely related modulators of homeostasis and immunity. Metabolic profiling is a useful tool to understand the association between metabolism and inflammation at a systemic level. OBJECTIVE: To investigate the longitudinal associations between the concentration of plasma metabolites and biomarkers related to inflammation and oxidative stress. METHODS: We conducted a repeated cross-sectional analysis consisting of 8 short-term panels that included 88 healthy adult male welders in Massachusetts, USA. In each panel, we collected 1-6 repeated measurements of blood and urine. We used a human vascular injury panel assay and custom cytokine/chemokine assay to quantify inflammatory biomarker plasma levels, liquid chromatography-mass spectrometry to quantify the concentrations of 665 plasma metabolites, and a competitive enzyme-linked immunoassay to quantify urinary 8-OHdG and 8-isoprostane levels. We used linear mixed effects models to estimate the longitudinal association between each inflammatory and oxidative stress biomarker and each metabolite. RESULTS: At a 5% FDR threshold, we detected ≥1metabolite association for 8 unique inflammatory and oxidative stress biomarkers: urinary 8-isoprostane, plasma C-reactive protein (CRP), serum amyloid A (SAA), intercellular adhesion molecule 1, circulating vascular cell adhesion molecule-1, interleukin 8 (IL-8), interleukin 10 (IL-10) and vascular endothelial growth factor. Specifically, 3 metabolites in the androgenic steroids pathway were negatively associated with SAA; 3 dihydrosphingomyelins metabolites were positively associated with 1 or more of CRP, SAA, IL-8 and IL-10; 4 metabolites in acyl choline metabolism pathways were negatively associated with IL-8; 7 lysophospholipid metabolites were negatively associated with 1 or more of CRP, SAA and IL-8; 4 sphingomyelins were positively associated with CRP and/or SAA; and 10 metabolites in the xanthine pathway were positively associated with urinary 8-isoprostane. CONCLUSION: We found that metabolites in phospholipid groups had strong associations with multiple inflammatory biomarkers, especially CRP, SAA and IL-8. The mechanism of these associations warrants further investigation.
RESUMEN
Identifying areas with high COVID-19 burden and their characteristics can help improve vaccine distribution and uptake, reduce burdens on health care systems, and allow for better allocation of public health intervention resources. Synthesizing data from various government and nonprofit institutions of 3,142 United States (US) counties as of 12/21/2020, we studied county-level characteristics that are associated with cumulative case and death rates using regression analyses. Our results showed counties that are more rural, counties with more White/non-White segregation, and counties with higher percentages of people of color, in poverty, with no high school diploma, and with medical comorbidities such as diabetes and hypertension are associated with higher cumulative COVID-19 case and death rates. We identify the hardest hit counties in US using model-estimated case and death rates, which provide more reliable estimates of cumulative COVID-19 burdens than those using raw observed county-specific rates. Identification of counties with high disease burdens and understanding the characteristics of these counties can help inform policies to improve vaccine distribution, deployment and uptake, prevent overwhelming health care systems, and enhance testing access, personal protection equipment access, and other resource allocation efforts, all of which can help save more lives for vulnerable communities. SIGNIFICANCE STATEMENT: We found counties that are more rural, counties with more White/non-White segregation, and counties with higher percentages of people of color, in poverty, with no high school diploma, and with medical comorbidities such as diabetes and hypertension are associated with higher cumulative COVID-19 case and death rates. We also identified individual counties with high cumulative COVID-19 burden. Identification of counties with high disease burdens and understanding the characteristics of these counties can help inform policies to improve vaccine distribution, deployment and uptake, prevent overwhelming health care systems, and enhance testing access, personal protection equipment access, and other resource allocation efforts, all of which can help save more lives for vulnerable communities.
RESUMEN
We propose a new computational framework, probabilistic transcriptome-wide association study (PTWAS), to investigate causal relationships between gene expressions and complex traits. PTWAS applies the established principles from instrumental variables analysis and takes advantage of probabilistic eQTL annotations to delineate and tackle the unique challenges arising in TWAS. PTWAS not only confers higher power than the existing methods but also provides novel functionalities to evaluate the causal assumptions and estimate tissue- or cell-type-specific gene-to-trait effects. We illustrate the power of PTWAS by analyzing the eQTL data across 49 tissues from GTEx (v8) and GWAS summary statistics from 114 complex traits.