RESUMO
We describe a novel single sample gene set testing method for cancer transcriptomics data named tissue-adjusted pathway analysis of cancer (TPAC). The TPAC method leverages information about the normal tissue-specificity of human genes to compute a robust multivariate distance score that quantifies gene set dysregulation in each profiled tumor. Because the null distribution of the TPAC scores has an accurate gamma approximation, both population and sample-level inference is supported. As we demonstrate through an analysis of gene expression data for 21 solid human cancers from The Cancer Genome Atlas (TCGA) and associated normal tissue expression data from the Human Protein Atlas (HPA), TPAC gene set scores are more strongly associated with patient prognosis than the scores generated by existing single sample gene set testing methods.
Assuntos
Neoplasias , Humanos , Neoplasias/genética , Perfilação da Expressão Gênica/métodosRESUMO
Summary: Doublets are usually considered an unwanted artifact of single-cell RNA-sequencing (scRNA-seq) and are only identified in datasets for the sake of removal. However, if cells have a juxtacrine interaction with one another in situ and maintain this association through an scRNA-seq processing pipeline that only partially dissociates the tissue, these doublets can provide meaningful biological information regarding the intercellular signals and processes occurring in the analyzed tissue. This is especially true for cases such as the immune compartment of the tumor microenvironment, where the frequency and the type of immune cell juxtacrine interactions can be a prognostic indicator. We developed Cell type-specific Interaction Analysis using Doublets in scRNA-seq (CIcADA) as a pipeline for identifying and analyzing biologically meaningful doublets in scRNA-seq data. CIcADA identifies putative doublets using multi-label cell type scores and characterizes interaction dynamics through a comparison against synthetic doublets of the same cell type composition. In performing CIcADA on several scRNA-seq tumor datasets, we found that the identified doublets were consistently upregulating expression of immune response genes. Availability and implementation: An R package implementing the CIcADA method is in development and will be released on CRAN, but for now it is available at https://github.com/schiebout/CAMML.
RESUMO
Motivation: Doublets are usually considered an unwanted artifact of single-cell RNA-sequencing (scRNA-seq) and are only identified in datasets for the sake of removal. However, if cells have a juxtacrine attachment to one another in situ and maintain this association through an scRNA-seq processing pipeline that only partially dissociates the tissue, these doublets can provide meaningful biological information regarding the interactions and cell processes occurring in the analyzed tissue. This is especially true for cases such as the immune compartment of the tumor microenvironment, where the frequency and type of immune cell juxtacrine interactions can be a prognostic indicator. Results: We developed Cell type-specific Interaction Analysis using Doublets in scRNA-seq (CIcADA) as a pipeline for identifying and analyzing biological doublets in scRNA-seq data. CIcADA identifies putative doublets using multi-label cell type scores and characterizes interaction dynamics through a comparison against synthetic doublets of the same cell type composition. In performing CIcADA on several scRNA-seq tumor datasets, we found that the identified doublets were consistently upregulating expression of immune response genes. Contact: Courtney.T.Schiebout.GR@Dartmouth.edu , Hildreth.R.Frost@Dartmouth.edu.
RESUMO
Chimeric-antigen receptor (CAR) T-cell therapy has shown remarkable efficacy against hematologic tumors. Yet, CAR T-cell therapy has had little success against solid tumors due to obstacles presented by the tumor microenvironment (TME) of these cancers. Here, we show that CAR T cells armored with the engineered IL-2 superkine Super2 and IL-33 were able to promote tumor control as a single-agent therapy. IFNγ and perforin were dispensable for the effects of Super2- and IL-33-armored CAR T cells. Super2 and IL-33 synergized to shift leukocyte proportions in the TME and to recruit and activate a broad repertoire of endogenous innate and adaptive immune cells including tumor-specific T cells. However, depletion of CD8+ T cells or NK cells did not disrupt tumor control, suggesting that broad immune activation compensated for loss of individual cell subsets. Thus, we have shown that Super2 and IL-33 CAR T cells can promote antitumor immunity in multiple solid tumor models and can potentially overcome antigen loss, highlighting the potential of this universal CAR T-cell platform for the treatment of solid tumors.
Assuntos
Neoplasias , Microambiente Tumoral , Humanos , Imunoterapia Adotiva , Interleucina-2 , Interleucina-33RESUMO
Inferring the cell types in single-cell RNA-sequencing (scRNA-seq) data is of particular importance for understanding the potential cellular mechanisms and phenotypes occurring in complex tissues, such as the tumor-immune microenvironment (TME). The sparsity and noise of scRNA-seq data, combined with the fact that immune cell types often occur on a continuum, make cell typing of TME scRNA-seq data a significant challenge. Several single-label cell typing methods have been put forth to address the limitations of noise and sparsity, but accounting for the often overlapped spectrum of cell types in the immune TME remains an obstacle. To address this, we developed a new scRNA-seq cell-typing method, Cell-typing using variance Adjusted Mahalanobis distances with Multi-Labeling (CAMML). CAMML leverages cell type-specific weighted gene sets to score every cell in a dataset for every potential cell type. This allows cells to be labelled either by their highest scoring cell type as a single label classification or based on a score cut-off to give multi-label classification. For single-label cell typing, CAMML performance is comparable to existing cell typing methods, SingleR and Garnett. For scenarios where cells may exhibit features of multiple cell types (e.g., undifferentiated cells), the multi-label classification supported by CAMML offers important benefits relative to the current state-of-the-art methods. By integrating data across studies, omics platforms, and species, CAMML serves as a robust and adaptable method for overcoming the challenges of scRNA-seq analysis.
Assuntos
Biologia Computacional , Análise de Célula Única , RNA/genética , Análise de Sequência de RNA , Sequenciamento do ExomaRESUMO
BACKGROUND: Over the past decades, approaches for diagnosing and treating cancer have seen significant improvement. However, the variability of patient and tumor characteristics has limited progress on methods for prognosis prediction. The development of high-throughput omics technologies now provides multiple approaches for characterizing tumors. Although a large number of published studies have focused on integration of multi-omics data and use of pathway-level models for cancer prognosis prediction, there still exists a gap of knowledge regarding the prognostic landscape across multi-omics data for multiple cancer types using both gene-level and pathway-level predictors. METHODS: In this study, we systematically evaluated three often available types of omics data (gene expression, copy number variation and somatic point mutation) covering both DNA-level and RNA-level features. We evaluated the landscape of predictive performance of these three omics modalities for 33 cancer types in the TCGA using a Lasso or Group Lasso-penalized Cox model and either gene or pathway level predictors. RESULTS: We constructed the prognostic landscape using three types of omics data for 33 cancer types on both the gene and pathway levels. Based on this landscape, we found that predictive performance is cancer type dependent and we also highlighted the cancer types and omics modalities that support the most accurate prognostic models. In general, models estimated on gene expression data provide the best predictive performance on either gene or pathway level and adding copy number variation or somatic point mutation data to gene expression data does not improve predictive performance, with some exceptional cohorts including low grade glioma and thyroid cancer. In general, pathway-level models have better interpretative performance, higher stability and smaller model size across multiple cancer types and omics data types relative to gene-level models. CONCLUSIONS: Based on this landscape and comprehensively comparison, models estimated on gene expression data provide the best predictive performance on either gene or pathway level. Pathway-level models have better interpretative performance, higher stability and smaller model size relative to gene-level models.
Assuntos
Variações do Número de Cópias de DNA , Perfilação da Expressão Gênica/métodos , Expressão Gênica , Neoplasias/genética , Mutação Puntual , Estudos de Coortes , Bases de Dados Genéticas , Humanos , Neoplasias/mortalidade , Neoplasias/patologia , Valor Preditivo dos Testes , Prognóstico , Modelos de Riscos ProporcionaisRESUMO
The genetic alterations that underlie cancer development are highly tissue-specific with the majority of driving alterations occurring in only a few cancer types and with alterations common to multiple cancer types often showing a tissue-specific functional impact. This tissue-specificity means that the biology of normal tissues carries important information regarding the pathophysiology of the associated cancers, information that can be leveraged to improve the power and accuracy of cancer genomic analyses. Research exploring the use of normal tissue data for the analysis of cancer genomics has primarily focused on the paired analysis of tumor and adjacent normal samples. Efforts to leverage the general characteristics of normal tissue for cancer analysis has received less attention with most investigations focusing on understanding the tissue-specific factors that lead to individual genomic alterations or dysregulated pathways within a single cancer type. To address this gap and support scenarios where adjacent normal tissue samples are not available, we explored the genome-wide association between the transcriptomes of 21 solid human cancers and their associated normal tissues as profiled in healthy individuals. While the average gene expression profiles of normal and cancerous tissue may appear distinct, with normal tissues more similar to other normal tissues than to the associated cancer types, when transformed into relative expression values, i.e., the ratio of expression in one tissue or cancer relative to the mean in other tissues or cancers, the close association between gene activity in normal tissues and related cancers is revealed. As we demonstrate through an analysis of tumor data from The Cancer Genome Atlas and normal tissue data from the Human Protein Atlas, this association between tissue-specific and cancer-specific expression values can be leveraged to improve the prognostic modeling of cancer, the comparative analysis of different cancer types, and the analysis of cancer and normal tissue pairs.
Assuntos
Neoplasias/genética , Biologia Computacional , Bases de Dados Genéticas/estatística & dados numéricos , Feminino , Expressão Gênica , Perfilação da Expressão Gênica/estatística & dados numéricos , Humanos , Masculino , Especificidade de Órgãos/genética , Análise de Componente Principal , RNA-Seq , Valores de Referência , Análise de SobrevidaRESUMO
A substantial fraction of patients with stage I-III colorectal adenocarcinoma (CRC) experience disease relapse after surgery with curative intent. However, biomarkers for predicting the likelihood of CRC relapse have not been fully explored. Therefore, we assessed the association between tumor infiltration by a broad array of innate and adaptive immune cell types and CRC relapse risk. We implemented a discovery-validation design including a discovery dataset from Moffitt Cancer Center (MCC; Tampa, FL) and three independent validation datasets: (1) GSE41258 (2) the Molecular Epidemiology of Colorectal Cancer (MECC) study, and (3) GSE39582. Infiltration by 22 immune cell types was inferred from tumor gene expression data, and the association between immune infiltration by each cell type and relapse-free survival was assessed using Cox proportional hazards regression. Within each of the four independent cohorts, CD4+ memory activated T cell (HR: 0.93, 95% CI: 0.90-0.96; FDR = 0.0001) infiltration was associated with longer time to disease relapse, independent of stage, microsatellite instability, and adjuvant therapy. Based on our meta-analysis across the four datasets, 10 innate and adaptive immune cell types associated with disease relapse of which 2 were internally validated using multiplex immunofluorescence. Moreover, immune cell type infiltration was a better predictors of disease relapse than Consensus Molecular Subtype (CMS) and other expression-based biomarkers (Immune-AICMCC:238.1-238.9; CMS-AICMCC: 241.0). These data suggest that transcriptome-derived immune profiles are prognostic indicators of CRC relapse and quantification of both innate and adaptive immune cell types may serve as candidate biomarkers for predicting prognosis and guiding frequency and modality of disease surveillance.
Assuntos
Neoplasias Colorretais , Transcriptoma , Neoplasias Colorretais/genética , Humanos , Instabilidade de Microssatélites , Prognóstico , RecidivaRESUMO
BACKGROUND: Genomic profiling of solid human tumors by projects such as The Cancer Genome Atlas (TCGA) has provided important information regarding the somatic alterations that drive cancer progression and patient survival. Although researchers have successfully leveraged TCGA data to build prognostic models, most efforts have focused on specific cancer types and a targeted set of gene-level predictors. Less is known about the prognostic ability of pathway-level variables in a pan-cancer setting. To address these limitations, we systematically evaluated and compared the prognostic ability of somatic point mutation (SPM) and copy number variation (CNV) data, gene-level and pathway-level models for a diverse set of TCGA cancer types and predictive modeling approaches. RESULTS: We evaluated gene-level and pathway-level penalized Cox proportional hazards models using SPM and CNV data for 29 different TCGA cohorts. We measured predictive accuracy as the concordance index for predicting survival outcomes. Our comprehensive analysis suggests that the use of pathway-level predictors did not offer superior predictive power relative to gene-level models for all cancer types but had the advantages of robustness and parsimony. We identified a set of cohorts for which somatic alterations could not predict prognosis, and a unique cohort LGG, for which SPM data was more predictive than CNV data and the predictive accuracy is good for all model types. We found that the pathway-level predictors provide superior interpretative value and that there is often a serious collinearity issue for the gene-level models while pathway-level models avoided this issue. CONCLUSION: Our comprehensive analysis suggests that when using somatic alterations data for cancer prognosis prediction, pathway-level models are more interpretable, stable and parsimonious compared to gene-level models. Pathway-level models also avoid the issue of collinearity, which can be serious for gene-level somatic alterations. The prognostic power of somatic alterations is highly variable across different cancer types and we have identified a set of cohorts for which somatic alterations could not predict prognosis. In general, CNV data predicts prognosis better than SPM data with the exception of the LGG cohort.
Assuntos
Variações do Número de Cópias de DNA/genética , Mutação Puntual/genética , Humanos , PrognósticoRESUMO
CD8 T cell differentiation is orchestrated by dynamic metabolic changes that direct activation, proliferation, cytotoxic function, and epigenetic changes. We report that the BTB-ZF family transcriptional repressor Zbtb20 negatively regulates CD8 T cell metabolism and memory differentiation in mice. Effector and memory CD8 T cells with conditional Zbtb20 deficiency displayed enhanced mitochondrial and glycolytic metabolism, and memory CD8 T cells had enhanced spare respiratory capacity. Furthermore, Zbtb20-deficient CD8 T cells displayed increased flexibility in the use of mitochondrial fuel sources. Phenotypic and transcriptional skewing toward the memory fate was observed during the CD8 T cell response to Listeria monocytogenes Memory cells mounted larger secondary responses and conferred better protection following tumor challenge. These data suggest that inactivation of Zbtb20 may offer an approach to enhance metabolic activity and flexibility and improve memory CD8 T cell differentiation, useful attributes for T cells used in adoptive immunotherapy.
Assuntos
Metabolismo Energético/genética , Listeriose/imunologia , Neoplasias/imunologia , Linfócitos T Citotóxicos/imunologia , Fatores de Transcrição/metabolismo , Transferência Adotiva , Animais , Diferenciação Celular/genética , Diferenciação Celular/imunologia , Modelos Animais de Doenças , Metabolismo Energético/imunologia , Regulação Neoplásica da Expressão Gênica/imunologia , Glicólise/genética , Glicólise/imunologia , Humanos , Memória Imunológica/genética , Listeria monocytogenes/imunologia , Listeriose/microbiologia , Ativação Linfocitária , Camundongos , Camundongos Knockout , Mitocôndrias/metabolismo , Neoplasias/genética , Neoplasias/patologia , Neoplasias/terapia , Linfócitos T Citotóxicos/metabolismo , Fatores de Transcrição/genéticaRESUMO
BACKGROUND: Cancer prognosis prediction is valuable for patients and clinicians because it allows them to appropriately manage care. A promising direction for improving the performance and interpretation of expression-based predictive models involves the aggregation of gene-level data into biological pathways. While many studies have used pathway-level predictors for cancer survival analysis, a comprehensive comparison of pathway-level and gene-level prognostic models has not been performed. To address this gap, we characterized the performance of penalized Cox proportional hazard models built using either pathway- or gene-level predictors for the cancers profiled in The Cancer Genome Atlas (TCGA) and pathways from the Molecular Signatures Database (MSigDB). RESULTS: When analyzing TCGA data, we found that pathway-level models are more parsimonious, more robust, more computationally efficient and easier to interpret than gene-level models with similar predictive performance. For example, both pathway-level and gene-level models have an average Cox concordance index of ~ 0.85 for the TCGA glioma cohort, however, the gene-level model has twice as many predictors on average, the predictor composition is less stable across cross-validation folds and estimation takes 40 times as long as compared to the pathway-level model. When the complex correlation structure of the data is broken by permutation, the pathway-level model has greater predictive performance while still retaining superior interpretative power, robustness, parsimony and computational efficiency relative to the gene-level models. For example, the average concordance index of the pathway-level model increases to 0.88 while the gene-level model falls to 0.56 for the TCGA glioma cohort using survival times simulated from uncorrelated gene expression data. CONCLUSION: The results of this study show that when the correlations among gene expression values are low, pathway-level analyses can yield better predictive performance, greater interpretative power, more robust models and less computational cost relative to a gene-level model. When correlations among genes are high, a pathway-level analysis provides equivalent predictive power compared to a gene-level analysis while retaining the advantages of interpretability, robustness and computational efficiency.
Assuntos
Neoplasias/mortalidade , Estudos de Coortes , Expressão Gênica , Glioma/genética , Glioma/mortalidade , Humanos , Modelos Genéticos , Neoplasias/genética , Prognóstico , Modelos de Riscos ProporcionaisRESUMO
Approximately 20% of colorectal cancer patients with colorectal adenocarcinomas present with metastases at the time of diagnosis, and therapies that specially target these metastases are lacking. We present a novel approach for investigating transcriptomic differences between primary colorectal adenocarcinoma and distant metastases, which may help to identify primary tumors with high risk for future dissemination and to inform the development of metastasis-targeted therapies. To effectively compare the transcriptomes of primary colorectal adenocarcinoma and metastatic lesions at both the gene and pathway levels, we eliminated tissue specificity of the "host" organs where tumors are located and adjusted for confounders such as exposure to chemotherapy and radiation, and identified that metastases were characterized by reduced epithelial-mesenchymal transition (EMT) but increased MYC target and DNA-repair pathway activities. FBN2 and MMP3 were the most differentially expressed genes between primary tumors and metastases. The two subtypes of colorectal adenocarcinoma metastases that were identified, EMT inflammatory and proliferative, were distinct from the consensus molecular subtype (CMS) 3, suggesting subtype exclusivity. In summary, this study highlights transcriptomic differences between primary tumors and colorectal adenocarcinoma metastases and delineates pathways that are activated in metastases that could be targeted in colorectal adenocarcinoma patients with metastatic disease. SIGNIFICANCE: These findings identify a colorectal adenocarcinoma metastasis-specific gene-expression signature that is free from potentially confounding background signals coming from treatment exposure and the normal host tissue that the metastasis is now situated within.
Assuntos
Adenocarcinoma/genética , Adenocarcinoma/patologia , Neoplasias Colorretais/genética , Neoplasias Colorretais/patologia , Feminino , Perfilação da Expressão Gênica , Regulação Neoplásica da Expressão Gênica , Humanos , Neoplasias Hepáticas/genética , Neoplasias Hepáticas/secundário , Neoplasias Pulmonares/genética , Neoplasias Pulmonares/secundário , Masculino , Pessoa de Meia-IdadeRESUMO
Renal clear cell carcinoma (RCC) is the most common type of kidney cancer and has a high propensity for metastasis. While treatment with immune checkpoint inhibitors, such as anti-PD-1, have shown modest improvements in survival for RCC, it is difficult to identify responders from non-responders. Attempts to elucidate the mechanisms associated with differential response to checkpoint inhibitors have been limited by small sample size making it difficult to detect meaningful associations. We utilized existing large datasets from The Cancer Genome Atlas (TCGA) to first find predictors of disease aggressiveness in the tumor microenvironment (TME) and hypothesized that these same predictors may influence response to immunotherapy. We found primary metastatic (M1-stage IV) tumors exhibit high immune infiltration, and high TP53-inactivation induced senescence activity compared to non-metastatic (M0-Stage I/II) tumors. Moreover, some TME features inferred from deconvolution algorithms, which differ between M0 and M1 tumors, also influence overall survival. A focused analysis identified interactions between tumor TP53-inactivation induced senescence activity and expression of inflammatory molecules in pre-treatment RCC tumors, which predict both change in tumor size and response to checkpoint blockade therapy. We also noted frequency of inactivating mutations in the protein polybromo-1 (PBRM1) gene was found to be negatively associated with TP53-inactivation induced senescence enrichment. Our findings suggest a mechanism by which tumor TP53-inactivation induced senescence can modulate the TME and thereby influence outcome from checkpoint blockade therapy.
RESUMO
BACKGROUND: Cancer develops when pathways controlling cell survival, cell fate or genome maintenance are disrupted by the somatic alteration of key driver genes. Understanding how pathway disruption is driven by somatic alterations is thus essential for an accurate characterization of cancer biology and identification of therapeutic targets. Unfortunately, current cancer pathway analysis methods fail to fully model the relationship between somatic alterations and pathway activity. RESULTS: To address these limitations, we developed a multi-omics method for identifying biologically important pathways and genes in human cancer. Our approach combines single-sample pathway analysis with multi-stage, lasso-penalized regression to find pathways whose gene expression can be explained largely in terms of gene-level somatic alterations in the tumor. Importantly, this method can analyze case-only data sets, does not require information regarding pathway topology and supports personalized pathway analysis using just somatic alteration data for a limited number of cancer-associated genes. The practical effectiveness of this technique is illustrated through an analysis of data from The Cancer Genome Atlas using gene sets from the Molecular Signatures Database. CONCLUSIONS: Novel insights into the pathophysiology of human cancer can be obtained from statistical models that predict expression-based pathway activity in terms of non-silent somatic mutations and copy number variation. These models enable the identification of biologically important pathways and genes and support personalized pathway analysis in cases where gene expression data is unavailable.
Assuntos
Genômica/métodos , Neoplasias/genética , Humanos , Neoplasias/patologiaRESUMO
Gene set testing is an important bioinformatics technique that addresses the challenges of power, interpretation and replication. To better support the analysis of large and highly overlapping gene set collections, researchers have recently developed a number of multiset methods that jointly evaluate all gene sets in a collection to identify a parsimonious group of functionally independent sets. Unfortunately, current multiset methods all use binary indicators for gene and gene set activity and assume that a gene is active if any containing gene set is active. This simplistic model limits performance on many types of genomic data. To address this limitation, we developed gene set Selection via LASSO Penalized Regression (SLPR), a novel mapping of multiset gene set testing to penalized multiple linear regression. The SLPR method assumes a linear relationship between continuous measures of gene activity and the activity of all gene sets in the collection. As we demonstrate via simulation studies and the analysis of TCGA data using MSigDB gene sets, the SLPR method outperforms existing multiset methods when the true biological process is well approximated by continuous activity measures and a linear association between genes and gene sets.
Assuntos
Adenocarcinoma/genética , Biomarcadores Tumorais/genética , Carcinoma de Células Escamosas/genética , Regulação Neoplásica da Expressão Gênica , Neoplasias Pulmonares/genética , Modelos Estatísticos , Proteínas de Neoplasias/genética , Adenocarcinoma/diagnóstico , Adenocarcinoma/patologia , Adenocarcinoma de Pulmão , Benchmarking , Carcinoma de Células Escamosas/diagnóstico , Carcinoma de Células Escamosas/patologia , Biologia Computacional/estatística & dados numéricos , Diagnóstico Diferencial , Humanos , Neoplasias Pulmonares/diagnóstico , Neoplasias Pulmonares/patologia , Redes e Vias Metabólicas/genética , Modelos Biológicos , Família Multigênica , Análise de RegressãoRESUMO
BACKGROUND: Gene set testing, or pathway analysis, is a bioinformatics technique that performs statistical testing on biologically meaningful sets of genomic variables. Although originally developed for supervised analyses, i.e., to test the association between gene sets and an outcome variable, gene set testing also has important unsupervised applications, e.g., p-value weighting. For unsupervised testing, however, few effective gene set testing methods are available with support especially poor for several biologically relevant use cases. RESULTS: In this paper, we describe two new unsupervised gene set testing methods based on random matrix theory, the MarcÌenko-Pastur Distribution Test (MPDT) and the Tracy-Widom Test (TWT), that support both self-contained and competitive null hypotheses. For the self-contained case, we contrast our proposed tests with the classic multivariate test based on a modified likelihood ratio criterion. For the competitive case, we compare the new tests against a competitive version of the classic test and our recently developed Spectral Gene Set Enrichment (SGSE) method. Evaluation of the TWT and MPDT methods is based on both simulation studies and a weighted p-value analysis of two real gene expression data sets using gene sets drawn from MSigDB collections. CONCLUSIONS: The MPDT and TWT methods are novel and effective tools for unsupervised gene set analysis with superior statistical performance relative to existing techniques and the ability to generate biologically important results on real genomic data sets.
Assuntos
Algoritmos , Biomarcadores Tumorais/genética , Perfilação da Expressão Gênica/métodos , Leucemia/genética , Modelos Teóricos , Simulação por Computador , Redes Reguladoras de Genes , Genômica/métodos , HumanosRESUMO
Although gene-environment (G× E) interactions play an important role in many biological systems, detecting these interactions within genome-wide data can be challenging due to the loss in statistical power incurred by multiple hypothesis correction. To address the challenge of poor power and the limitations of existing multistage methods, we recently developed a screening-testing approach for G× E interaction detection that combines elastic net penalized regression with joint estimation to support a single omnibus test for the presence of G× E interactions. In our original work on this technique, however, we did not assess type I error control or power and evaluated the method using just a single, small bladder cancer data set. In this paper, we extend the original method in two important directions and provide a more rigorous performance evaluation. First, we introduce a hierarchical false discovery rate approach to formally assess the significance of individual G× E interactions. Second, to support the analysis of truly genome-wide data sets, we incorporate a score statistic-based prescreening step to reduce the number of single nucleotide polymorphisms prior to fitting the first stage penalized regression model. To assess the statistical properties of our method, we compare the type I error rate and statistical power of our approach with competing techniques using both simple simulation designs as well as designs based on real disease architectures. Finally, we demonstrate the ability of our approach to identify biologically plausible SNP-education interactions relative to Alzheimer's disease status using genome-wide association study data from the Alzheimer's Disease Neuroimaging Initiative (ADNI).
Assuntos
Interação Gene-Ambiente , Modelos Genéticos , Artrite Reumatoide/genética , Artrite Reumatoide/patologia , Neoplasias da Mama/genética , Neoplasias da Mama/patologia , Diabetes Mellitus Tipo 2/genética , Diabetes Mellitus Tipo 2/patologia , Feminino , Estudo de Associação Genômica Ampla , Humanos , Desequilíbrio de Ligação , Polimorfismo de Nucleotídeo ÚnicoRESUMO
Statistical interactions between markers of genetic variation, or gene-gene interactions, are believed to play an important role in the etiology of many multifactorial diseases and other complex phenotypes. Unfortunately, detecting gene-gene interactions is extremely challenging due to the large number of potential interactions and ambiguity regarding marker coding and interaction scale. For many data sets, there is insufficient statistical power to evaluate all candidate gene-gene interactions. In these cases, a global test for gene-gene interactions may be the best option. Global tests have much greater power relative to multiple individual interaction tests and can be used on subsets of the markers as an initial filter prior to testing for specific interactions. In this paper, we describe a novel global test for gene-gene interactions, the global epistasis test (GET), that is based on results from random matrix theory. As we show via simulation studies based on previously proposed models for common diseases including rheumatoid arthritis, type 2 diabetes, and breast cancer, our proposed GET method has superior performance characteristics relative to existing global gene-gene interaction tests. A glaucoma GWAS data set is used to demonstrate the practical utility of the GET method.
Assuntos
Artrite Reumatoide/genética , Neoplasias da Mama/genética , Diabetes Mellitus Tipo 2/genética , Epistasia Genética/genética , Marcadores Genéticos/genética , Estudo de Associação Genômica Ampla , Glaucoma/genética , Polimorfismo de Nucleotídeo Único/genética , Simulação por Computador , Feminino , Humanos , Modelos Genéticos , FenótipoRESUMO
BACKGROUND: Gene set testing is typically performed in a supervised context to quantify the association between groups of genes and a clinical phenotype. In many cases, however, a gene set-based interpretation of genomic data is desired in the absence of a phenotype variable. Although methods exist for unsupervised gene set testing, they predominantly compute enrichment relative to clusters of the genomic variables with performance strongly dependent on the clustering algorithm and number of clusters. RESULTS: We propose a novel method, spectral gene set enrichment (SGSE), for unsupervised competitive testing of the association between gene sets and empirical data sources. SGSE first computes the statistical association between gene sets and principal components (PCs) using our principal component gene set enrichment (PCGSE) method. The overall statistical association between each gene set and the spectral structure of the data is then computed by combining the PC-level p-values using the weighted Z-method with weights set to the PC variance scaled by Tracy-Widom test p-values. Using simulated data, we show that the SGSE algorithm can accurately recover spectral features from noisy data. To illustrate the utility of our method on real data, we demonstrate the superior performance of the SGSE method relative to standard cluster-based techniques for testing the association between MSigDB gene sets and the variance structure of microarray gene expression data. CONCLUSIONS: Unsupervised gene set testing can provide important information about the biological signal held in high-dimensional genomic data sets. Because it uses the association between gene sets and samples PCs to generate a measure of unsupervised enrichment, the SGSE method is independent of cluster or network creation algorithms and, most importantly, is able to utilize the statistical significance of PC eigenvalues to ignore elements of the data most likely to represent noise.
Assuntos
Algoritmos , Biomarcadores Tumorais/genética , Perfilação da Expressão Gênica/métodos , Leucemia/genética , Linfoma Difuso de Grandes Células B/genética , Análise em Microsséries/métodos , Análise por Conglomerados , Redes Reguladoras de Genes , Genômica/métodos , HumanosRESUMO
Gene-environment (G × E) interactions are biologically important for a wide range of environmental exposures and clinical outcomes. Because of the large number of potential interactions in genomewide association data, the standard approach fits one model per G × E interaction with multiple hypothesis correction (MHC) used to control the type I error rate. Although sometimes effective, using one model per candidate G × E interaction test has two important limitations: low power due to MHC and omitted variable bias. To avoid the coefficient estimation bias associated with independent models, researchers have used penalized regression methods to jointly test all main effects and interactions in a single regression model. Although penalized regression supports joint analysis of all interactions, can be used with hierarchical constraints, and offers excellent predictive performance, it cannot assess the statistical significance of G × E interactions or compute meaningful estimates of effect size. To address the challenge of low power, researchers have separately explored screening-testing, or two-stage, methods in which the set of potential G × E interactions is first filtered and then tested for interactions with MHC only applied to the tests actually performed in the second stage. Although two-stage methods are statistically valid and effective at improving power, they still test multiple separate models and so are impacted by MHC and biased coefficient estimation. To remedy the challenges of both poor power and omitted variable bias encountered with traditional G × E interaction detection methods, we propose a novel approach that combines elements of screening-testing and hierarchical penalized regression. Specifically, our proposed method uses, in the first stage, an elastic net-penalized multiple logistic regression model to jointly estimate either the marginal association filter statistic or the gene-environment correlation filter statistic for all candidate genetic markers. In the second stage, a single multiple logistic regression model is used to jointly assess marginal terms and G × E interactions for all genetic markers that pass the first stage filter. A single likelihood-ratio test is used to determine whether any of the interactions are statistically significant. We demonstrate the efficacy of our method relative to alternative G × E detection methods on a bladder cancer data set.