RESUMO
Large-scale, multi-ethnic whole-genome sequencing (WGS) studies, such as the National Human Genome Research Institute Genome Sequencing Program's Centers for Common Disease Genomics (CCDG), play an important role in increasing diversity for genetic research. Before performing association analyses, assessing Hardy-Weinberg equilibrium (HWE) is a crucial step in quality control procedures to remove low quality variants and ensure valid downstream analyses. Diverse WGS studies contain ancestrally heterogeneous samples; however, commonly used HWE methods assume that the samples are homogeneous. Therefore, directly applying these to the whole dataset can yield statistically invalid results. To account for this heterogeneity, HWE can be tested on subsets of samples that have genetically homogeneous ancestries and the results aggregated at each variant. To facilitate valid HWE subset testing, we developed a semi-supervised learning approach that predicts homogeneous ancestries based on the genotype. This method provides a convenient tool for estimating HWE in the presence of population structure and missing self-reported race and ethnicities in diverse WGS studies. In addition, assessing HWE within the homogeneous ancestries provides reliable HWE estimates that will directly benefit downstream analyses, including association analyses in WGS studies. We applied our proposed method on the CCDG dataset, predicting homogeneous genetic ancestry groups for 60,545 multi-ethnic WGS samples to assess HWE within each group.
Assuntos
Aprendizado de Máquina Supervisionado , Sequenciamento Completo do Genoma , Humanos , Sequenciamento Completo do Genoma/métodos , Genoma Humano , Genética Populacional/métodos , Etnicidade/genética , Estudo de Associação Genômica Ampla/métodos , Polimorfismo de Nucleotídeo Único , GenótipoRESUMO
Polygenic risk scores (PRS) enhance population risk stratification and advance personalized medicine, but existing methods face several limitations, encompassing issues related to computational burden, predictive accuracy, and adaptability to a wide range of genetic architectures. To address these issues, we propose Aggregated L0Learn using Summary-level data (ALL-Sum), a fast and scalable ensemble learning method for computing PRS using summary statistics from genome-wide association studies (GWAS). ALL-Sum leverages a L0L2 penalized regression and ensemble learning across tuning parameters to flexibly model traits with diverse genetic architectures. In extensive large-scale simulations across a wide range of polygenicity and GWAS sample sizes, ALL-Sum consistently outperformed popular alternative methods in terms of prediction accuracy, runtime, and memory usage by 10%, 20-fold, and threefold, respectively, and demonstrated robustness to diverse genetic architectures. We validated the performance of ALL-Sum in real data analysis of 11 complex traits using GWAS summary statistics from nine data sources, including the Global Lipids Genetics Consortium, Breast Cancer Association Consortium, and FinnGen Biobank, with validation in the UK Biobank. Our results show that on average, ALL-Sum obtained PRS with 25% higher accuracy on average, with 15 times faster computation and half the memory than the current state-of-the-art methods, and had robust performance across a wide range of traits and diseases. Furthermore, our method demonstrates stable prediction when using linkage disequilibrium computed from different data sources. ALL-Sum is available as a user-friendly R software package with publicly available reference data for streamlined analysis.
Assuntos
Estudo de Associação Genômica Ampla , Herança Multifatorial , Humanos , Herança Multifatorial/genética , Estudo de Associação Genômica Ampla/métodos , Aprendizado de Máquina , Predisposição Genética para Doença , Polimorfismo de Nucleotídeo ÚnicoRESUMO
Inflammation biomarkers can provide valuable insight into the role of inflammatory processes in many diseases and conditions. Sequencing based analyses of such biomarkers can also serve as an exemplar of the genetic architecture of quantitative traits. To evaluate the biological insight, which can be provided by a multi-ancestry, whole-genome based association study, we performed a comprehensive analysis of 21 inflammation biomarkers from up to 38 465 individuals with whole-genome sequencing from the Trans-Omics for Precision Medicine (TOPMed) program (with varying sample size by trait, where the minimum sample size was n = 737 for MMP-1). We identified 22 distinct single-variant associations across 6 traits-E-selectin, intercellular adhesion molecule 1, interleukin-6, lipoprotein-associated phospholipase A2 activity and mass, and P-selectin-that remained significant after conditioning on previously identified associations for these inflammatory biomarkers. We further expanded upon known biomarker associations by pairing the single-variant analysis with a rare variant set-based analysis that further identified 19 significant rare variant set-based associations with 5 traits. These signals were distinct from both significant single variant association signals within TOPMed and genetic signals observed in prior studies, demonstrating the complementary value of performing both single and rare variant analyses when analyzing quantitative traits. We also confirm several previously reported signals from semi-quantitative proteomics platforms. Many of these signals demonstrate the extensive allelic heterogeneity and ancestry-differentiated variant-trait associations common for inflammation biomarkers, a characteristic we hypothesize will be increasingly observed with well-powered, large-scale analyses of complex traits.
Assuntos
Biomarcadores , Estudo de Associação Genômica Ampla , Inflamação , Medicina de Precisão , Sequenciamento Completo do Genoma , Humanos , Medicina de Precisão/métodos , Inflamação/genética , Estudo de Associação Genômica Ampla/métodos , Sequenciamento Completo do Genoma/métodos , Polimorfismo de Nucleotídeo Único , Locos de Características Quantitativas , Predisposição Genética para Doença , Feminino , Interleucina-6/genéticaRESUMO
As countries in the world review interventions for containing the pandemic of coronavirus disease 2019 (COVID-19), important lessons can be drawn from the study of the full transmission dynamics of its causative agent-severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)- in Wuhan (China), where vigorous non-pharmaceutical interventions have suppressed the local outbreak of this disease1. Here we use a modelling approach to reconstruct the full-spectrum dynamics of COVID-19 in Wuhan between 1 January and 8 March 2020 across 5 periods defined by events and interventions, on the basis of 32,583 laboratory-confirmed cases1. Accounting for presymptomatic infectiousness2, time-varying ascertainment rates, transmission rates and population movements3, we identify two key features of the outbreak: high covertness and high transmissibility. We estimate 87% (lower bound, 53%) of the infections before 8 March 2020 were unascertained (potentially including asymptomatic and mildly symptomatic individuals); and a basic reproduction number (R0) of 3.54 (95% credible interval 3.40-3.67) in the early outbreak, much higher than that of severe acute respiratory syndrome (SARS) and Middle East respiratory syndrome (MERS)4,5. We observe that multipronged interventions had considerable positive effects on controlling the outbreak, decreasing the reproduction number to 0.28 (95% credible interval 0.23-0.33) and-by projection-reducing the total infections in Wuhan by 96.0% as of 8 March 2020. We also explore the probability of resurgence following the lifting of all interventions after 14 consecutive days of no ascertained infections; we estimate this probability at 0.32 and 0.06 on the basis of models with 87% and 53% unascertained cases, respectively-highlighting the risk posed by substantial covert infections when changing control measures. These results have important implications when considering strategies of continuing surveillance and interventions to eventually contain outbreaks of COVID-19.
Assuntos
Infecções por Coronavirus/transmissão , Modelos Biológicos , Pneumonia Viral/transmissão , COVID-19 , China/epidemiologia , Infecções por Coronavirus/epidemiologia , Infecções por Coronavirus/prevenção & controle , Monitoramento Epidemiológico , Feminino , Humanos , Masculino , Pandemias/prevenção & controle , Pneumonia Viral/epidemiologia , Pneumonia Viral/prevenção & controle , Reprodutibilidade dos Testes , Processos EstocásticosRESUMO
The US global leadership in science and technology has greatly benefitted from immigrants from other countries, most notably from China in the recent decades. However, feeling the pressure of potential federal investigations since the 2018 launch of the China Initiative, scientists of Chinese descent in the United States now face higher incentives to leave the United States and lower incentives to apply for federal grants. Analyzing data pertaining to institutional affiliations of more than 200 million scientific papers, we find a steady increase in the return migration of scientists of Chinese descent from the United States to China. We also conducted a survey of scientists of Chinese descent employed by US universities in tenured or tenure-track positions (n = 1,304), with results revealing general feelings of fear and anxiety that lead them to consider leaving the United States and/or stop applying for federal grants. If the situation is not corrected, American science will likely suffer the loss of scientific talent to China and other countries.
RESUMO
Attempts to identify and prioritize functional DNA elements in coding and non-coding regions, particularly through use of in silico functional annotation data, continue to increase in popularity. However, specific functional roles can vary widely from one variant to another, making it challenging to summarize different aspects of variant function with a one-dimensional rating. Here we propose multi-dimensional annotation-class integrative estimation (MACIE), an unsupervised multivariate mixed-model framework capable of integrating annotations of diverse origin to assess multi-dimensional functional roles for both coding and non-coding variants. Unlike existing one-dimensional scoring methods, MACIE views variant functionality as a composite attribute encompassing multiple characteristics and estimates the joint posterior functional probabilities of each genomic position. This estimate offers more comprehensive and interpretable information in the presence of multiple aspects of functionality. Applied to a variety of independent coding and non-coding datasets, MACIE demonstrates powerful and robust performance in discriminating between functional and non-functional variants. We also show an application of MACIE to fine-mapping and heritability enrichment analysis by using the lipids GWAS summary statistics data from the European Network for Genetic and Genomic Epidemiology Consortium.
Assuntos
Genoma Humano , Estudo de Associação Genômica Ampla , Genoma Humano/genética , Estudo de Associação Genômica Ampla/métodos , Genômica , Humanos , Anotação de Sequência Molecular , Polimorfismo de Nucleotídeo Único/genética , ProbabilidadeRESUMO
Large biobank-scale whole genome sequencing (WGS) studies are rapidly identifying a multitude of coding and non-coding variants. They provide an unprecedented resource for illuminating the genetic basis of human diseases. Variant functional annotations play a critical role in WGS analysis, result interpretation, and prioritization of disease- or trait-associated causal variants. Existing functional annotation databases have limited scope to perform online queries and functionally annotate the genotype data of large biobank-scale WGS studies. We develop the Functional Annotation of Variants Online Resources (FAVOR) to meet these pressing needs. FAVOR provides a comprehensive multi-faceted variant functional annotation online portal that summarizes and visualizes findings of all possible nine billion single nucleotide variants (SNVs) across the genome. It allows for rapid variant-, gene- and region-level queries of variant functional annotations. FAVOR integrates variant functional information from multiple sources to describe the functional characteristics of variants and facilitates prioritizing plausible causal variants influencing human phenotypes. Furthermore, we provide a scalable annotation tool, FAVORannotator, to functionally annotate large-scale WGS studies and efficiently store the genotype and their variant functional annotation data in a single file using the annotated Genomic Data Structure (aGDS) format, making downstream analysis more convenient. FAVOR and FAVORannotator are available at https://favor.genohub.org.
Assuntos
Genoma Humano , Software , Humanos , Anotação de Sequência Molecular , Genômica , Genótipo , Variação GenéticaRESUMO
Inference of relationships from whole-genome genetic data of a cohort is a crucial prerequisite for genome-wide association studies. Typically, relationships are inferred by computing the kinship coefficients (Ï) and the genome-wide probability of zero IBD sharing (π0) among all pairs of individuals. Current leading methods are based on pairwise comparisons, which may not scale up to very large cohorts (e.g., sample size >1 million). Here, we propose an efficient relationship inference method, RAFFI. RAFFI leverages the efficient RaPID method to call IBD segments first, then estimate the Ï and π0 from detected IBD segments. This inference is achieved by a data-driven approach that adjusts the estimation based on phasing quality and genotyping quality. Using simulations, we showed that RAFFI is robust against phasing/genotyping errors, admix events, and varying marker densities, and achieves higher accuracy compared to KING, the current leading method, especially for more distant relatives. When applied to the phased UK Biobank data with ~500K individuals, RAFFI is approximately 18 times faster than KING. We expect RAFFI will offer fast and accurate relatedness inference for even larger cohorts.
Assuntos
Estudo de Associação Genômica Ampla/estatística & dados numéricos , Técnicas de Genotipagem/estatística & dados numéricos , Modelos Genéticos , Bancos de Espécimes Biológicos , Genoma Humano/genética , Haplótipos/genética , Humanos , Linhagem , Polimorfismo de Nucleotídeo Único/genéticaRESUMO
Whole-genome sequencing (WGS) can improve assessment of low-frequency and rare variants, particularly in non-European populations that have been underrepresented in existing genomic studies. The genetic determinants of C-reactive protein (CRP), a biomarker of chronic inflammation, have been extensively studied, with existing genome-wide association studies (GWASs) conducted in >200,000 individuals of European ancestry. In order to discover novel loci associated with CRP levels, we examined a multi-ancestry population (n = 23,279) with WGS (â¼38× coverage) from the Trans-Omics for Precision Medicine (TOPMed) program. We found evidence for eight distinct associations at the CRP locus, including two variants that have not been identified previously (rs11265259 and rs181704186), both of which are non-coding and more common in individuals of African ancestry (â¼10% and â¼1% minor allele frequency, respectively, and rare or monomorphic in 1000 Genomes populations of East Asian, South Asian, and European ancestry). We show that the minor (G) allele of rs181704186 is associated with lower CRP levels and decreased transcriptional activity and protein binding in vitro, providing a plausible molecular mechanism for this African ancestry-specific signal. The individuals homozygous for rs181704186-G have a mean CRP level of 0.23 mg/L, in contrast to individuals heterozygous for rs181704186 with mean CRP of 2.97 mg/L and major allele homozygotes with mean CRP of 4.11 mg/L. This study demonstrates the utility of WGS in multi-ethnic populations to drive discovery of complex trait associations of large effect and to identify functional alleles in noncoding regulatory regions.
Assuntos
Povo Asiático/genética , População Negra/genética , Proteína C-Reativa/genética , Predisposição Genética para Doença , Polimorfismo de Nucleotídeo Único , População Branca/genética , Sequenciamento Completo do Genoma/métodos , Estudos de Coortes , Frequência do Gene , Estudo de Associação Genômica Ampla , Humanos , Desequilíbrio de LigaçãoRESUMO
Set-based association tests are widely popular in genetic association settings for their ability to aggregate weak signals and reduce multiple testing burdens. In particular, a class of set-based tests including the Higher Criticism, Berk-Jones, and other statistics have recently been popularized for reaching a so-called detection boundary when signals are rare and weak. Such tests have been applied in two subtly different settings: (a) associating a genetic variant set with a single phenotype and (b) associating a single genetic variant with a phenotype set. A significant issue in practice is the choice of test, especially when deciding between innovated and generalized type methods for detection boundary tests. Conflicting guidance is present in the literature. This work describes how correlation structures generate marked differences in relative operating characteristics for settings (a) and (b). The implications for study design are significant. We also develop novel power bounds that facilitate the aforementioned calculations and allow for analysis of individual testing settings. In more concrete terms, our investigation is motivated by translational expression quantitative trait loci (eQTL) studies in lung cancer. These studies involve both testing for groups of variants associated with a single gene expression (multiple explanatory factors) and testing whether a single variant is associated with a group of gene expressions (multiple outcomes). Results are supported by a collection of simulation studies and illustrated through lung cancer eQTL examples.
RESUMO
SUMMARY: Amidst the continuing spread of coronavirus disease-19 (COVID-19), real-time data analysis and visualization remain critical the general public to track the pandemic's impact and to inform policy making by officials. Multiple metrics permit the evaluation of the spread, infection and mortality of infectious diseases. For example, numbers of new cases and deaths provide easily interpretable measures of absolute impact within a given population and time frame, while the effective reproduction rate provides an epidemiological measure of the rate of spread. By evaluating multiple metrics concurrently, users can leverage complementary insights into the impact and current state of the pandemic when formulating prevention and safety plans for oneself and others. We describe COVID-19 Spread Mapper, a unified framework for estimating and quantifying the uncertainty in the smoothed daily effective reproduction number, case rate and death rate in a region using log-linear models. We apply this framework to characterize COVID-19 impact at multiple geographic resolutions, including by US county and state as well as by country, demonstrating the variation across resolutions and the need for harmonized efforts to control the pandemic. We provide an open-source online dashboard for real-time analysis and visualization of multiple key metrics, which are critical to evaluate the impact of COVID-19 and make informed policy decisions. AVAILABILITY AND IMPLEMENTATION: Our model and tool are publicly available as implemented in R and hosted at https://metrics.covid19-analysis.org/. The source code is freely available from https://github.com/lin-lab/COVID19-Rt and https://github.com/lin-lab/COVID19-Viz. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
COVID-19 , Humanos , COVID-19/epidemiologia , SARS-CoV-2 , Pandemias/prevenção & controle , SoftwareRESUMO
SUMMARY: We developed the variant-Set Test for Association using Annotation infoRmation (STAAR) workflow description language (WDL) workflow to facilitate the analysis of rare variants in whole genome sequencing association studies. The open-access STAAR workflow written in the WDL allows a user to perform rare variant testing for both gene-centric and genetic region approaches, enabling genome-wide, candidate and conditional analyses. It incorporates functional annotations into the workflow as introduced in the STAAR method in order to boost the rare variant analysis power. This tool was specifically developed and optimized to be implemented on cloud-based platforms such as BioData Catalyst Powered by Terra. It provides easy-to-use functionality for rare variant analysis that can be incorporated into an exhaustive whole genome sequencing analysis pipeline. AVAILABILITY AND IMPLEMENTATION: The workflow is freely available from https://dockstore.org/workflows/github.com/sheilagaynor/STAAR_workflow. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Computação em Nuvem , Software , Fluxo de Trabalho , Genoma , Estudo de Associação Genômica AmplaRESUMO
Sample sizes vary substantially across tissues in the Genotype-Tissue Expression (GTEx) project, where considerably fewer samples are available from certain inaccessible tissues, such as the substantia nigra (SSN), than from accessible tissues, such as blood. This severely limits power for identifying tissue-specific expression quantitative trait loci (eQTL) in undersampled tissues. Here we propose Surrogate Phenotype Regression Analysis (Spray) for leveraging information from a correlated surrogate outcome (eg, expression in blood) to improve inference on a partially missing target outcome (eg, expression in SSN). Rather than regarding the surrogate outcome as a proxy for the target outcome, Spray jointly models the target and surrogate outcomes within a bivariate regression framework. Unobserved values of either outcome are treated as missing data. We describe and implement an expectation conditional maximization algorithm for performing estimation in the presence of bilateral outcome missingness. Spray estimates the same association parameter estimated by standard eQTL mapping and controls the type I error even when the target and surrogate outcomes are truly uncorrelated. We demonstrate analytically and empirically, using simulations and GTEx data, that in comparison with marginally modeling the target outcome, jointly modeling the target and surrogate outcomes increases estimation precision and improves power.
Assuntos
Algoritmos , Locos de Características Quantitativas , Fenótipo , Análise de RegressãoRESUMO
OBJECTIVES: The study investigated tumor burden dynamics on computed tomography (CT) scans in patients with advanced non-small-cell lung cancer (NSCLC) during first-line pembrolizumab plus chemotherapy, to provide imaging markers for overall survival (OS). METHODS: The study included 133 patients treated with first-line pembrolizumab plus platinum-doublet chemotherapy. Serial CT scans during therapy were assessed for tumor burden dynamics during therapy, which were studied for the association with OS. RESULTS: There were 67 responders, with overall response rate of 50%. The tumor burden change at the best overall response ranged from - 100.0% to + 132.1% (median of - 30%). Higher response rates were associated with younger age (p < 0.001) and higher programmed cell death-1 (PD-L1) expression levels (p = 0.01). Eighty-three patients (62%) showed tumor burden below the baseline burden throughout therapy. Using an 8-week landmark analysis, OS was longer in patients with tumor burden below the baseline burden in the first 8 weeks than in those who experienced ≥ 0% increase (median OS: 26.8 vs. 7.6 months, hazard ratio (HR): 0.36, p < 0.001). Tumor burden remained below their baseline throughout therapy was associated with significantly reduced hazards of death (HR: 0.72, p = 0.03) in the extended Cox models, after adjusting for other clinical variables. Pseudoprogression was noted in only one patient (0.8%). CONCLUSIONS: Tumor burden staying below the baseline burden throughout the therapy was predictive of prolonged overall survival in patients with advanced NSCLC treated with first-line pembrolizumab plus chemotherapy, and may be used as a practical marker for therapeutic decisions in this widely used combination regimen. CLINICAL RELEVANCE STATEMENT: The analysis of tumor burden dynamics on serial CT scans in reference to the baseline burden can provide an additional objective guide for treatment decision making in patients treated with first-line pembrolizumab plus chemotherapy for their advanced NSCLC. KEY POINTS: ⢠Tumor burden remaining below baseline burden during therapy predicted longer survival during first-line pembrolizumab plus chemotherapy. ⢠Pseudoprogression was noted in 0.8%, demonstrating the rarity of the phenomenon. ⢠Tumor burden dynamics may serve as an objective marker for treatment benefit to guide treatment decisions during first-line pembrolizumab plus chemotherapy.
Assuntos
Carcinoma Pulmonar de Células não Pequenas , Neoplasias Pulmonares , Humanos , Carcinoma Pulmonar de Células não Pequenas/diagnóstico por imagem , Carcinoma Pulmonar de Células não Pequenas/tratamento farmacológico , Carcinoma Pulmonar de Células não Pequenas/metabolismo , Neoplasias Pulmonares/diagnóstico por imagem , Neoplasias Pulmonares/tratamento farmacológico , Neoplasias Pulmonares/metabolismo , Anticorpos Monoclonais Humanizados/uso terapêutico , Protocolos de Quimioterapia Combinada Antineoplásica/uso terapêuticoRESUMO
Rationale: Obstructive sleep apnea (OSA) is a common disorder associated with increased risk for cardiovascular disease, diabetes, and premature mortality. There is strong clinical and epidemiologic evidence supporting the importance of genetic factors influencing OSA but limited data implicating specific genes. Objectives: To search for rare variants contributing to OSA severity. Methods: Leveraging high-depth genomic sequencing data from the NHLBI Trans-Omics for Precision Medicine (TOPMed) program and imputed genotype data from multiple population-based studies, we performed linkage analysis in the CFS (Cleveland Family Study), followed by multistage gene-based association analyses in independent cohorts for apnea-hypopnea index (AHI) in a total of 7,708 individuals of European ancestry. Measurements and Main Results: Linkage analysis in the CFS identified a suggestive linkage peak on chromosome 7q31 (LOD = 2.31). Gene-based analysis identified 21 noncoding rare variants in CAV1 (Caveolin-1) associated with lower AHI after accounting for multiple comparisons (P = 7.4 × 10-8). These noncoding variants together significantly contributed to the linkage evidence (P < 10-3). Follow-up analysis revealed significant associations between these variants and increased CAV1 expression, and increased CAV1 expression in peripheral monocytes was associated with lower AHI (P = 0.024) and higher minimum overnight oxygen saturation (P = 0.007). Conclusions: Rare variants in CAV1, a membrane-scaffolding protein essential in multiple cellular and metabolic functions, are associated with higher CAV1 gene expression and lower OSA severity, suggesting a novel target for modulating OSA severity.
Assuntos
Apneia Obstrutiva do Sono , Humanos , Caveolina 1/genética , Apneia Obstrutiva do Sono/genética , Análise de Sequência de DNA , Sequenciamento de Nucleotídeos em Larga EscalaRESUMO
Set-based analysis that jointly tests the association of variants in a group has emerged as a popular tool for analyzing rare and low-frequency variants in sequencing studies. The existing set-based tests can suffer significant power loss when only a small proportion of variants are causal, and their powers can be sensitive to the number, effect sizes, and effect directions of the causal variants and the choices of weights. Here we propose an aggregated Cauchy association test (ACAT), a general, powerful, and computationally efficient p value combination method for boosting power in sequencing studies. First, by combining variant-level p values, we use ACAT to construct a set-based test (ACAT-V) that is particularly powerful in the presence of only a small number of causal variants in a variant set. Second, by combining different variant-set-level p values, we use ACAT to construct an omnibus test (ACAT-O) that combines the strength of multiple complimentary set-based tests, including the burden test, sequence kernel association test (SKAT), and ACAT-V. Through analysis of extensively simulated data and the whole-genome sequencing data from the Atherosclerosis Risk in Communities (ARIC) study, we demonstrate that ACAT-V complements the SKAT and the burden test, and that ACAT-O has a substantially more robust and higher power than those of the alternative tests.
Assuntos
Algoritmos , Doença/genética , Estudos de Associação Genética/métodos , Variação Genética , Genoma Humano , Modelos Genéticos , Análise de Sequência de DNA/métodos , Simulação por Computador , Interpretação Estatística de Dados , HumanosRESUMO
Whole-genome sequencing (WGS) studies are being widely conducted in order to identify rare variants associated with human diseases and disease-related traits. Classical single-marker association analyses for rare variants have limited power, and variant-set-based analyses are commonly used by researchers for analyzing rare variants. However, existing variant-set-based approaches need to pre-specify genetic regions for analysis; hence, they are not directly applicable to WGS data because of the large number of intergenic and intron regions that consist of a massive number of non-coding variants. The commonly used sliding-window method requires the pre-specification of fixed window sizes, which are often unknown as a priori, are difficult to specify in practice, and are subject to limitations given that the sizes of genetic-association regions are likely to vary across the genome and phenotypes. We propose a computationally efficient and dynamic scan-statistic method (Scan the Genome [SCANG]) for analyzing WGS data; this method flexibly detects the sizes and the locations of rare-variant association regions without the need to specify a prior, fixed window size. The proposed method controls for the genome-wise type I error rate and accounts for the linkage disequilibrium among genetic variants. It allows the detected sizes of rare-variant association regions to vary across the genome. Through extensive simulated studies that consider a wide variety of scenarios, we show that SCANG substantially outperforms several alternative methods for detecting rare-variant-associations while controlling for the genome-wise type I error rates. We illustrate SCANG by analyzing the WGS lipids data from the Atherosclerosis Risk in Communities (ARIC) study.
Assuntos
Algoritmos , Biologia Computacional/métodos , Variação Genética , Genoma Humano , Estudo de Associação Genômica Ampla , Sequenciamento Completo do Genoma/métodos , Humanos , Desequilíbrio de Ligação , Modelos GenéticosRESUMO
With advances in whole-genome sequencing (WGS) technology, more advanced statistical methods for testing genetic association with rare variants are being developed. Methods in which variants are grouped for analysis are also known as variant-set, gene-based, and aggregate unit tests. The burden test and sequence kernel association test (SKAT) are two widely used variant-set tests, which were originally developed for samples of unrelated individuals and later have been extended to family data with known pedigree structures. However, computationally efficient and powerful variant-set tests are needed to make analyses tractable in large-scale WGS studies with complex study samples. In this paper, we propose the variant-set mixed model association tests (SMMAT) for continuous and binary traits using the generalized linear mixed model framework. These tests can be applied to large-scale WGS studies involving samples with population structure and relatedness, such as in the National Heart, Lung, and Blood Institute's Trans-Omics for Precision Medicine (TOPMed) program. SMMATs share the same null model for different variant sets, and a virtue of this null model, which includes covariates only, is that it needs to be fit only once for all tests in each genome-wide analysis. Simulation studies show that all the proposed SMMATs correctly control type I error rates for both continuous and binary traits in the presence of population structure and relatedness. We also illustrate our tests in a real data example of analysis of plasma fibrinogen levels in the TOPMed program (n = 23,763), using the Analysis Commons, a cloud-based computing platform.
Assuntos
Estudos de Associação Genética , Modelos Genéticos , Sequenciamento Completo do Genoma , Cromossomos Humanos Par 4/genética , Computação em Nuvem , Feminino , Fibrinogênio/análise , Fibrinogênio/genética , Genética Populacional , Humanos , Masculino , National Heart, Lung, and Blood Institute (U.S.) , Medicina de Precisão , Projetos de Pesquisa , Fatores de Tempo , Estados UnidosRESUMO
Average arterial oxyhemoglobin saturation during sleep (AvSpO2S) is a clinically relevant measure of physiological stress associated with sleep-disordered breathing, and this measure predicts incident cardiovascular disease and mortality. Using high-depth whole-genome sequencing data from the National Heart, Lung, and Blood Institute (NHLBI) Trans-Omics for Precision Medicine (TOPMed) project and focusing on genes with linkage evidence on chromosome 8p23,1,2 we observed that six coding and 51 noncoding variants in a gene that encodes the GTPase-activating protein (DLC1) are significantly associated with AvSpO2S and replicated in independent subjects. The combined DLC1 association evidence of discovery and replication cohorts reaches genome-wide significance in European Americans (p = 7.9 × 10-7). A risk score for these variants, built on an independent dataset, explains 0.97% of the AvSpO2S variation and contributes to the linkage evidence. The 51 noncoding variants are enriched in regulatory features in a human lung fibroblast cell line and contribute to DLC1 expression variation. Mendelian randomization analysis using these variants indicates a significant causal effect of DLC1 expression in fibroblasts on AvSpO2S. Multiple sources of information, including genetic variants, gene expression, and methylation, consistently suggest that DLC1 is a gene associated with AvSpO2S.
Assuntos
Cromossomos Humanos Par 8/genética , Proteínas Ativadoras de GTPase/genética , Oxiemoglobinas/genética , Sono/genética , Proteínas Supressoras de Tumor/genética , Ligação Genética/genética , Estudo de Associação Genômica Ampla , Humanos , Sequenciamento Completo do Genoma/métodosRESUMO
A common complementary strategy in Genome-Wide Association Studies (GWAS) is to perform Gene Set Analysis (GSA), which tests for the association between one phenotype of interest and an entire set of Single Nucleotide Polymorphisms (SNPs) residing in selected genes. While there exist many tools for performing GSA, popular methods often include a number of ad-hoc steps that are difficult to justify statistically, provide complicated interpretations based on permutation inference, and demonstrate poor operating characteristics. Additionally, the lack of gold standard gene set lists can produce misleading results and create difficulties in comparing analyses even across the same phenotype. We introduce the Generalized Berk-Jones (GBJ) statistic for GSA, a permutation-free parametric framework that offers asymptotic power guarantees in certain set-based testing settings. To adjust for confounding introduced by different gene set lists, we further develop a GBJ step-down inference technique that can discriminate between gene sets driven to significance by single genes and those demonstrating group-level effects. We compare GBJ to popular alternatives through simulation and re-analysis of summary statistics from a large breast cancer GWAS, and we show how GBJ can increase power by incorporating information from multiple signals in the same gene. In addition, we illustrate how breast cancer pathway analysis can be confounded by the frequency of FGFR2 in pathway lists. Our approach is further validated on two other datasets of summary statistics generated from GWAS of height and schizophrenia.