Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 32
Filter
1.
EBioMedicine ; 105: 105195, 2024 Jun 12.
Article in English | MEDLINE | ID: mdl-38870545

ABSTRACT

BACKGROUND: Response to antipsychotic drugs (APD) varies greatly among individuals and is affected by genetic factors. This study aims to demonstrate genome-wide associations between copy number variants (CNVs) and response to APD in patients with schizophrenia. METHODS: A total of 3030 patients of Han Chinese ethnicity randomly received APD (aripiprazole, olanzapine, quetiapine, risperidone, ziprasidone, haloperidol and perphenazine) treatment for six weeks. This study is a secondary data analysis. Percentage change on the Positive and Negative Syndrome Scale (PANSS) reduction was used to assess APD efficacy, and more than 50% change was considered as APD response. Associations between CNV burden, gene set, CNV loci and CNV break-point and APD efficacy were analysed. FINDINGS: Higher CNV losses burden decreased the odds of 6-week APD response (OR = 0.66 [0.44, 0.98]). CNV losses in synaptic pathway involved in neurotransmitters were associated with 2-week PANSS reduction rate. CNV involved in sialylation (1p31.1 losses) and cellular metabolism (19q13.32 gains) associated with 6-week PANSS reduction rate at genome-wide significant level. Additional 36 CNVs associated with PANSS factors improvement. The OR of protective CNVs for 6-week APD response was 3.10 (95% CI: 1.33-7.19) and risk CNVs was 8.47 (95% CI: 1.92-37.43). CNV interacted with genetic risk score on APD efficacy (Beta = -1.53, SE = 0.66, P = 0.021). The area under curve to differ 6-week APD response attained 80.45% (95% CI: 78.07%-82.82%). INTERPRETATION: Copy number variants contributed to poor APD efficacy and synaptic pathway involved in neurotransmitter was highlighted. FUNDING: National Natural Science Foundation of China, National Key R&D Program of China, China Postdoctoral Science Foundation.

2.
PLoS Genet ; 20(1): e1011134, 2024 Jan.
Article in English | MEDLINE | ID: mdl-38241355

ABSTRACT

It has been well established that cancer cells can evade immune surveillance by mutating themselves. Understanding genetic alterations in cancer cells that contribute to immune regulation could lead to better immunotherapy patient stratification and identification of novel immune-oncology (IO) targets. In this report, we describe our effort of genome-wide association analyses across 22 TCGA cancer types to explore the associations between genetic alterations in cancer cells and 74 immune traits. Results showed that the tumor microenvironment (TME) is shaped by different gene mutations in different cancer types. Out of the key genes that drive multiple immune traits, top hit KEAP1 in lung adenocarcinoma (LUAD) was selected for validation. It was found that KEAP1 mutations can explain more than 10% of the variance for multiple immune traits in LUAD. Using public scRNA-seq data, further analysis confirmed that KEAP1 mutations activate the NRF2 pathway and promote a suppressive TME. The activation of the NRF2 pathway is negatively correlated with lower T cell infiltration and higher T cell exhaustion. Meanwhile, several immune check point genes, such as CD274 (PD-L1), are highly expressed in NRF2-activated cancer cells. By integrating multiple RNA-seq data, a NRF2 gene signature was curated, which predicts anti-PD1 therapy response better than CD274 gene alone in a mixed cohort of different subtypes of non-small cell lung cancer (NSCLC) including LUAD, highlighting the important role of KEAP1-NRF2 axis in shaping the TME in NSCLC. Finally, a list of overexpressed ligands in NRF2 pathway activated cancer cells were identified and could potentially be targeted for TME remodeling in LUAD.


Subject(s)
Adenocarcinoma of Lung , Carcinoma, Non-Small-Cell Lung , Lung Neoplasms , Humans , Kelch-Like ECH-Associated Protein 1/genetics , Genome-Wide Association Study , NF-E2-Related Factor 2/genetics , Lung Neoplasms/genetics , Adenocarcinoma of Lung/genetics , Tumor Microenvironment/genetics , Prognosis
3.
Am J Hum Genet ; 110(5): 762-773, 2023 05 04.
Article in English | MEDLINE | ID: mdl-37019109

ABSTRACT

The ongoing release of large-scale sequencing data in the UK Biobank allows for the identification of associations between rare variants and complex traits. SAIGE-GENE+ is a valid approach to conducting set-based association tests for quantitative and binary traits. However, for ordinal categorical phenotypes, applying SAIGE-GENE+ with treating the trait as quantitative or binarizing the trait can cause inflated type I error rates or power loss. In this study, we propose a scalable and accurate method for rare-variant association tests, POLMM-GENE, in which we used a proportional odds logistic mixed model to characterize ordinal categorical phenotypes while adjusting for sample relatedness. POLMM-GENE fully utilizes the categorical nature of phenotypes and thus can well control type I error rates while remaining powerful. In the analyses of UK Biobank 450k whole-exome-sequencing data for five ordinal categorical traits, POLMM-GENE identified 54 gene-phenotype associations.


Subject(s)
Exome , Genome-Wide Association Study , Genome-Wide Association Study/methods , Exome/genetics , Biological Specimen Banks , Phenotype , Data Analysis , United Kingdom
4.
Schizophr Bull ; 49(1): 208-217, 2023 01 03.
Article in English | MEDLINE | ID: mdl-36179110

ABSTRACT

BACKGROUND AND HYPOTHESIS: Complex schizophrenia symptoms were recently conceptualized as interactive symptoms within a network system. However, it remains unknown how a schizophrenia network changed during acute antipsychotic treatment. The present study aimed to evaluate the interactive change of schizophrenia symptoms under seven antipsychotics from individual time series. STUDY DESIGN: Data on 3030 schizophrenia patients were taken from a multicenter randomized clinical trial and used to estimate the partial correlation cross-sectional networks and longitudinal random slope networks based on multivariate multilevel model. Thirty symptoms assessed by The Positive and Negative Syndrome Scale clustered the networks. STUDY RESULTS: Five stable communities were detected in cross-sectional networks and random slope networks that describe symptoms change over time. Delusions, emotional withdrawal, and lack of spontaneity and flow of conversation featured as central symptoms, and conceptual disorganization, hostility, uncooperativeness, and difficulty in abstract thinking featured as bridge symptoms, all showing high centrality in the random slope network. Acute antipsychotic treatment changed the network structure (M-test = 0.116, P < .001) compared to baseline, and responsive subjects showed lower global strength after treatment (11.68 vs 14.18, S-test = 2.503, P < .001) compared to resistant subjects. Central symptoms and bridge symptoms kept higher centrality across random slope networks of different antipsychotics. Quetiapine treatment network showed improvement in excitement symptoms, the one featured as both central and bridge symptom. CONCLUSION: Our findings revealed the central symptoms, bridge symptoms, cochanging features, and individualized features under different antipsychotics of schizophrenia. This brings implications for future targeted drug development and search for pathophysiological mechanisms.


Subject(s)
Antipsychotic Agents , Schizophrenia , Humans , Antipsychotic Agents/pharmacology , Antipsychotic Agents/therapeutic use , Schizophrenia/drug therapy , Schizophrenia/diagnosis , Cross-Sectional Studies , Quetiapine Fumarate/therapeutic use
6.
Nat Genet ; 54(10): 1466-1469, 2022 10.
Article in English | MEDLINE | ID: mdl-36138231

ABSTRACT

Several biobanks, including UK Biobank (UKBB), are generating large-scale sequencing data. An existing method, SAIGE-GENE, performs well when testing variants with minor allele frequency (MAF) ≤ 1%, but inflation is observed in variance component set-based tests when restricting to variants with MAF ≤ 0.1% or 0.01%. Here, we propose SAIGE-GENE+ with greatly improved type I error control and computational efficiency to facilitate rare variant tests in large-scale data. We further show that incorporating multiple MAF cutoffs and functional annotations can improve power and thus uncover new gene-phenotype associations. In the analysis of UKBB whole exome sequencing data for 30 quantitative and 141 binary traits, SAIGE-GENE+ identified 551 gene-phenotype associations.


Subject(s)
Genome-Wide Association Study , Gene Frequency/genetics , Genome-Wide Association Study/methods , Phenotype , Exome Sequencing
7.
Bioinformatics ; 38(18): 4337-4343, 2022 09 15.
Article in English | MEDLINE | ID: mdl-35876838

ABSTRACT

MOTIVATION: In the genome-wide association analysis of population-based biobanks, most diseases have low prevalence, which results in low detection power. One approach to tackle the problem is using family disease history, yet existing methods are unable to address type I error inflation induced by increased correlation of phenotypes among closely related samples, as well as unbalanced phenotypic distribution. RESULTS: We propose a new method for genetic association test with family disease history, mixed-model-based Test with Adjusted Phenotype and Empirical saddlepoint approximation, which controls for increased phenotype correlation by adopting a two-variance-component mixed model, accounts for case-control imbalance by using empirical saddlepoint approximation, and is flexible to incorporate any existing adjusted phenotypes, such as phenotypes from the LT-FH method. We show through simulation studies and analysis of UK Biobank data of white British samples and the Korean Genome and Epidemiology Study of Korean samples that the proposed method is robust and yields better calibration compared to existing methods while gaining power for detection of variant-phenotype associations. AVAILABILITY AND IMPLEMENTATION: The summary statistics and code generated in this study are available at https://github.com/styvon/TAPE. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Genome-Wide Association Study , Polymorphism, Single Nucleotide , Genome-Wide Association Study/methods , Case-Control Studies , Phenotype , Computer Simulation
8.
Brief Bioinform ; 23(2)2022 03 10.
Article in English | MEDLINE | ID: mdl-35037014

ABSTRACT

Optimal methods could effectively improve the accuracy of predicting and identifying candidate driver genes. Various computational methods based on mutational frequency, network and function approaches have been developed to identify mutation driver genes in cancer genomes. However, a comprehensive evaluation of the performance levels of network-, function- and frequency-based methods is lacking. In the present study, we assessed and compared eight performance criteria for eight network-based, one function-based and three frequency-based algorithms using eight benchmark datasets. Under different conditions, the performance of approaches varied in terms of network, measurement and sample size. The frequency-based driverMAPS and network-based HotNet2 methods showed the best overall performance. Network-based algorithms using protein-protein interaction networks outperformed the function- and the frequency-based approaches. Precision, F1 score and Matthews correlation coefficient were low for most approaches. Thus, most of these algorithms require stringent cutoffs to correctly distinguish driver and non-driver genes. We constructed a website named Cancer Driver Catalog (http://159.226.67.237/sun/cancer_driver/), wherein we integrated the gene scores predicted by the foregoing software programs. This resource provides valuable guidance for cancer researchers and clinical oncologists prioritizing cancer driver gene candidates by using an optimal tool.


Subject(s)
Neoplasms , Oncogenes , Algorithms , Computational Biology/methods , Gene Regulatory Networks , Humans , Mutation , Neoplasms/genetics , Software
9.
Nucleic Acids Res ; 50(D1): D72-D82, 2022 01 07.
Article in English | MEDLINE | ID: mdl-34792166

ABSTRACT

Rapid advances in high-throughput sequencing technologies have led to the discovery of thousands of extrachromosomal circular DNAs (eccDNAs) in the human genome. Loss-of-function experiments are difficult to conduct on circular and linear chromosomes, as they usually overlap. Hence, it is challenging to interpret the molecular functions of eccDNAs. Here, we present CircleBase (http://circlebase.maolab.org), an integrated resource and analysis platform used to curate and interpret eccDNAs in multiple cell types. CircleBase identifies putative functional eccDNAs by incorporating sequencing datasets, computational predictions, and manual annotations. It classifies them into six sections including targeting genes, epigenetic regulations, regulatory elements, chromatin accessibility, chromatin interactions, and genetic variants. The eccDNA targeting and regulatory networks are displayed by informative visualization tools and then prioritized. Functional enrichment analyses revealed that the top-ranked cancer cell eccDNAs were enriched in oncogenic pathways such as the Ras and PI3K-Akt signaling pathways. In contrast, eccDNAs from healthy individuals were not significantly enriched. CircleBase provides a user-friendly interface for searching, browsing, and analyzing eccDNAs in various cell/tissue types. Thus, it is useful to screen for potential functional eccDNAs and interpret their molecular mechanisms in human cancers and other diseases.


Subject(s)
Chromosomes/genetics , DNA, Circular/genetics , Databases, Genetic , Extrachromosomal Inheritance/genetics , Cell Lineage/genetics , Cytoplasm/genetics , Genome, Human/genetics , High-Throughput Nucleotide Sequencing , Humans
10.
Blood Adv ; 5(14): 2839-2851, 2021 07 27.
Article in English | MEDLINE | ID: mdl-34283174

ABSTRACT

Individuals with monogenic disorders can experience variable phenotypes that are influenced by genetic variation. To investigate this in sickle cell disease (SCD), we performed whole-genome sequencing (WGS) of 722 individuals with hemoglobin HbSS or HbSß0-thalassemia from Baylor College of Medicine and from the St. Jude Children's Research Hospital Sickle Cell Clinical Research and Intervention Program (SCCRIP) longitudinal cohort study. We developed pipelines to identify genetic variants that modulate sickle hemoglobin polymerization in red blood cells and combined these with pain-associated variants to build a polygenic score (PGS) for acute vaso-occlusive pain (VOP). Overall, we interrogated the α-thalassemia deletion -α3.7 and 133 candidate single-nucleotide polymorphisms (SNPs) across 66 genes for associations with VOP in 327 SCCRIP participants followed longitudinally over 6 years. Twenty-one SNPs in 9 loci were associated with VOP, including 3 (BCL11A, MYB, and the ß-like globin gene cluster) that regulate erythrocyte fetal hemoglobin (HbF) levels and 6 (COMT, TBC1D1, KCNJ6, FAAH, NR3C1, and IL1A) that were associated previously with various pain syndromes. An unweighted PGS integrating all 21 SNPs was associated with the VOP event rate (estimate, 0.35; standard error, 0.04; P = 5.9 × 10-14) and VOP event occurrence (estimate, 0.42; standard error, 0.06; P = 4.1 × 10-13). These associations were stronger than those of any single locus. Our findings provide insights into the genetic modulation of VOP in children with SCD. More generally, we demonstrate the utility of WGS for investigating genetic contributions to the variable expression of SCD-associated morbidities.


Subject(s)
Anemia, Sickle Cell , Fetal Hemoglobin , Anemia, Sickle Cell/complications , Anemia, Sickle Cell/genetics , Child , Fetal Hemoglobin/genetics , Humans , Longitudinal Studies , Pain , Polymorphism, Single Nucleotide
11.
Front Genet ; 12: 682638, 2021.
Article in English | MEDLINE | ID: mdl-34211504

ABSTRACT

With the advances in genotyping technologies and electronic health records (EHRs), large biobanks have been great resources to identify novel genetic associations and gene-environment interactions on a genome-wide and even a phenome-wide scale. To date, several phenome-wide association studies (PheWAS) have been performed on biobank data, which provides comprehensive insights into many aspects of human genetics and biology. Although inspiring, PheWAS on large-scale biobank data encounters new challenges including computational burden, unbalanced phenotypic distribution, and genetic relationship. In this paper, we first discuss these new challenges and their potential impact on data analysis. Then, we summarize approaches that are scalable and robust in GWAS and PheWAS. This review can serve as a practical guide for geneticists, epidemiologists, and other medical researchers to identify genetic variations associated with health-related phenotypes in large-scale biobank data analysis. Meanwhile, it can also help statisticians to gain a comprehensive and up-to-date understanding of the current technical tool development.

12.
Am J Hum Genet ; 108(5): 825-839, 2021 05 06.
Article in English | MEDLINE | ID: mdl-33836139

ABSTRACT

In genome-wide association studies, ordinal categorical phenotypes are widely used to measure human behaviors, satisfaction, and preferences. However, because of the lack of analysis tools, methods designed for binary or quantitative traits are commonly used inappropriately to analyze categorical phenotypes. To accurately model the dependence of an ordinal categorical phenotype on covariates, we propose an efficient mixed model association test, proportional odds logistic mixed model (POLMM). POLMM is computationally efficient to analyze large datasets with hundreds of thousands of samples, can control type I error rates at a stringent significance level regardless of the phenotypic distribution, and is more powerful than alternative methods. In contrast, the standard linear mixed model approaches cannot control type I error rates for rare variants when the phenotypic distribution is unbalanced, although they performed well when testing common variants. We applied POLMM to 258 ordinal categorical phenotypes on array genotypes and imputed samples from 408,961 individuals in UK Biobank. In total, we identified 5,885 genome-wide significant variants, of which, 424 variants (7.2%) are rare variants with MAF < 0.01.


Subject(s)
Computer Simulation , Genome-Wide Association Study , Models, Genetic , Phenotype , Biological Specimen Banks , Child , Female , Humans , Male , Research Design , United Kingdom
13.
Blood ; 137(2): 155-167, 2021 01 14.
Article in English | MEDLINE | ID: mdl-33156908

ABSTRACT

The histone mark H3K27me3 and its reader/writer polycomb repressive complex 2 (PRC2) mediate widespread transcriptional repression in stem and progenitor cells. Mechanisms that regulate this activity are critical for hematopoietic development but are poorly understood. Here we show that the E3 ubiquitin ligase F-box only protein 11 (FBXO11) relieves PRC2-mediated repression during erythroid maturation by targeting its newly identified substrate bromo adjacent homology domain-containing 1 (BAHD1), an H3K27me3 reader that recruits transcriptional corepressors. Erythroblasts lacking FBXO11 are developmentally delayed, with reduced expression of maturation-associated genes, most of which harbor bivalent histone marks at their promoters. In FBXO11-/- erythroblasts, these gene promoters bind BAHD1 and fail to recruit the erythroid transcription factor GATA1. The BAHD1 complex interacts physically with PRC2, and depletion of either component restores FBXO11-deficient erythroid gene expression. Our studies identify BAHD1 as a novel effector of PRC2-mediated repression and reveal how a single E3 ubiquitin ligase eliminates PRC2 repression at many developmentally poised bivalent genes during erythropoiesis.


Subject(s)
Chromosomal Proteins, Non-Histone/metabolism , Erythropoiesis/physiology , F-Box Proteins/metabolism , Gene Expression Regulation/physiology , Polycomb Repressive Complex 2/metabolism , Protein-Arginine N-Methyltransferases/metabolism , Cell Line , Erythroblasts/metabolism , Humans , Proteolysis
14.
Am J Hum Genet ; 107(2): 222-233, 2020 08 06.
Article in English | MEDLINE | ID: mdl-32589924

ABSTRACT

With increasing biobanking efforts connecting electronic health records and national registries to germline genetics, the time-to-event data analysis has attracted increasing attention in the genetics studies of human diseases. In time-to-event data analysis, the Cox proportional hazards (PH) regression model is one of the most used approaches. However, existing methods and tools are not scalable when analyzing a large biobank with hundreds of thousands of samples and endpoints, and they are not accurate when testing low-frequency and rare variants. Here, we propose a scalable and accurate method, SPACox (a saddlepoint approximation implementation based on the Cox PH regression model), that is applicable for genome-wide scale time-to-event data analysis. SPACox requires fitting a Cox PH regression model only once across the genome-wide analysis and then uses a saddlepoint approximation (SPA) to calibrate the test statistics. Simulation studies show that SPACox is 76-252 times faster than other existing alternatives, such as gwasurvivr, 185-511 times faster than the standard Wald test, and more than 6,000 times faster than the Firth correction and can control type I error rates at the genome-wide significance level regardless of minor allele frequencies. Through the analysis of UK Biobank inpatient data of 282,871 white British European ancestry samples, we show that SPACox can efficiently analyze large sample sizes and accurately control type I error rates. We identified 611 loci associated with time-to-event phenotypes of 12 common diseases, of which 38 loci would be missed within a logistic regression framework with a binary phenotype defined as event occurrence status during the follow-up period.


Subject(s)
Genome-Wide Association Study/methods , Biological Specimen Banks , Case-Control Studies , Data Analysis , Gene Frequency/genetics , Humans , Logistic Models , Phenotype , Proportional Hazards Models , Sample Size , United Kingdom , White People/genetics
15.
Nat Genet ; 52(6): 634-639, 2020 06.
Article in English | MEDLINE | ID: mdl-32424355

ABSTRACT

With very large sample sizes, biobanks provide an exciting opportunity to identify genetic components of complex traits. To analyze rare variants, region-based multiple-variant aggregate tests are commonly used to increase power for association tests. However, because of the substantial computational cost, existing region-based tests cannot analyze hundreds of thousands of samples while accounting for confounders such as population stratification and sample relatedness. Here we propose a scalable generalized mixed-model region-based association test, SAIGE-GENE, that is applicable to exome-wide and genome-wide region-based analysis for hundreds of thousands of samples and can account for unbalanced case-control ratios for binary traits. Through extensive simulation studies and analysis of the HUNT study with 69,716 Norwegian samples and the UK Biobank data with 408,910 White British samples, we show that SAIGE-GENE can efficiently analyze large-sample data (N > 400,000) with type I error rates well controlled.


Subject(s)
Biological Specimen Banks/statistics & numerical data , Case-Control Studies , Exome , Linear Models , Genetic Markers , Humans , Lipoproteins, HDL/genetics , Models, Genetic , Multifactorial Inheritance , Norway , United Kingdom , Waist-Hip Ratio
17.
Neuron ; 105(6): 975-991.e7, 2020 03 18.
Article in English | MEDLINE | ID: mdl-31926610

ABSTRACT

Alzheimer's disease (AD) displays a long asymptomatic stage before dementia. We characterize AD stage-associated molecular networks by profiling 14,513 proteins and 34,173 phosphosites in the human brain with mass spectrometry, highlighting 173 protein changes in 17 pathways. The altered proteins are validated in two independent cohorts, showing partial RNA dependency. Comparisons of brain tissue and cerebrospinal fluid proteomes reveal biomarker candidates. Combining with 5xFAD mouse analysis, we determine 15 Aß-correlated proteins (e.g., MDK, NTN1, SMOC1, SLIT2, and HTRA1). 5xFAD shows a proteomic signature similar to symptomatic AD but exhibits activation of autophagy and interferon response and lacks human-specific deleterious events, such as downregulation of neurotrophic factors and synaptic proteins. Multi-omics integration prioritizes AD-related molecules and pathways, including amyloid cascade, inflammation, complement, WNT signaling, TGF-ß and BMP signaling, lipid metabolism, iron homeostasis, and membrane transport. Some Aß-correlated proteins are colocalized with amyloid plaques. Thus, the multilayer omics approach identifies protein networks during AD progression.


Subject(s)
Alzheimer Disease/metabolism , Brain/metabolism , Disease Progression , Metabolic Networks and Pathways , Proteome/metabolism , Proteomics , Aged , Aged, 80 and over , Animals , Biomarkers/metabolism , Female , Humans , Male , Mice , Mice, Mutant Strains , Middle Aged , Phosphoproteins/metabolism
18.
Biostatistics ; 21(1): 33-49, 2020 01 01.
Article in English | MEDLINE | ID: mdl-30007308

ABSTRACT

It has been well acknowledged that methods for secondary trait (ST) association analyses under a case-control design (ST$_{\text{CC}}$) should carefully consider the sampling process to avoid biased risk estimates. A similar situation also exists in the extreme phenotype sequencing (EPS) designs, which is to select subjects with extreme values of continuous primary phenotype for sequencing. EPS designs are commonly used in modern epidemiological and clinical studies such as the well-known National Heart, Lung, and Blood Institute Exome Sequencing Project. Although naïve generalized regression or ST$_{\text{CC}}$ method could be applied, their validity is questionable due to difference in statistical designs. Herein, we propose a general prospective likelihood framework to perform association testing for binary and continuous STs under EPS designs (STEPS), which can also incorporate covariates and interaction terms. We provide a computationally efficient and robust algorithm to obtain the maximum likelihood estimates. We also present two empirical mathematical formulas for power/sample size calculations to facilitate planning of binary/continuous STs association analyses under EPS designs. Extensive simulations and application to a genome-wide association study of benign ethnic neutropenia under an EPS design demonstrate the superiority of STEPS over all its alternatives above.


Subject(s)
Genetic Association Studies/methods , Models, Theoretical , Computer Simulation , Humans , Likelihood Functions , Phenotype
19.
Stat Methods Med Res ; 29(2): 466-480, 2020 02.
Article in English | MEDLINE | ID: mdl-30945605

ABSTRACT

In epidemiology cohort studies, exposure data are collected in sub-studies based on a primary outcome (PO) of interest, as with the extreme-value sampling design (EVSD), to investigate their correlation. Secondary outcomes (SOs) data are also readily available, enabling researchers to assess the correlations between the exposure and the SOs. However, when the EVSD is used, the data for SOs are not representative samples of a general population; thus, many commonly used statistical methods, such as the generalized linear model (GLM), are not valid. A prospective likelihood method has been developed to associate SOs with single-nucleotide polymorphisms under an extreme phenotype sequencing design. In this paper, we describe the application of the prospective likelihood method (STEVSD) to exposure-SO association analysis under an EVSD. We undertook extensive simulations to assess the performance of the STEVSD method in associating binary and continuous exposures with SOs, comparing it to the simple GLM method that ignores the EVSD. To demonstrate the cost-benefit of the STEVSD method, we also mimicked the design of two new retrospective studies, as would be done in actual practice, based on the PO of interest, which was the same as the SO in the EVSD study. We then analyzed these data by using the GLM method and compared its power to that of the STEVSD method. We demonstrated the usefulness of the STEVSD method by applying it to a benign ethnic neutropenia dataset. Our results indicate that the STEVSD method can control type I error well, whereas the GLM method cannot do so owing to its ignorance of EVSD, and that the STEVSD method is cost-effective because it has statistical power similar to that of two new retrospective studies that require collecting new exposure data for selected individuals.


Subject(s)
Cost-Benefit Analysis , Molecular Epidemiology/statistics & numerical data , Outcome Assessment, Health Care/statistics & numerical data , Bias , Data Interpretation, Statistical , Humans , Likelihood Functions , Polymorphism, Single Nucleotide , Retrospective Studies , Sampling Studies
20.
Am J Hum Genet ; 106(1): 3-12, 2020 01 02.
Article in English | MEDLINE | ID: mdl-31866045

ABSTRACT

In biobank data analysis, most binary phenotypes have unbalanced case-control ratios, and this can cause inflation of type I error rates. Recently, a saddle point approximation (SPA) based single-variant test has been developed to provide an accurate and scalable method to test for associations of such phenotypes. For gene- or region-based multiple-variant tests, a few methods exist that can adjust for unbalanced case-control ratios; however, these methods are either less accurate when case-control ratios are extremely unbalanced or not scalable for large data analyses. To address these problems, we propose SKAT- and SKAT-O- type region-based tests; in these tests, the single-variant score statistic is calibrated based on SPA and efficient resampling (ER). Through simulation studies, we show that the proposed method provides well-calibrated p values. In contrast, when the case-control ratio is 1:99, the unadjusted approach has greatly inflated type I error rates (90 times that of exome-wide sequencing α = 2.5 × 10-6). Additionally, the proposed method has similar computation time to the unadjusted approaches and is scalable for large sample data. In our application, the UK Biobank whole-exome sequence data analysis of 45,596 unrelated European samples and 791 PheCode phenotypes identified 10 rare-variant associations with p value < 10-7, including the associations between JAK2 and myeloproliferative disease, HOXB13 and cancer of prostate, and F11 and congenital coagulation defects. All analysis summary results are publicly available through a web-based visual server, and this availability can help facilitate the identification of the genetic basis of complex diseases.


Subject(s)
Biological Specimen Banks , Exome Sequencing/methods , Exome/genetics , Genome-Wide Association Study , Phenomics , Polymorphism, Single Nucleotide , Case-Control Studies , Computer Simulation , Humans , Numerical Analysis, Computer-Assisted , Phenotype , United Kingdom
SELECTION OF CITATIONS
SEARCH DETAIL
...