Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 41
Filtrar
Más filtros

Bases de datos
País/Región como asunto
Tipo del documento
País de afiliación
Intervalo de año de publicación
1.
PLoS Comput Biol ; 20(1): e1011717, 2024 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-38206988

RESUMEN

We describe a novel single sample gene set testing method for cancer transcriptomics data named tissue-adjusted pathway analysis of cancer (TPAC). The TPAC method leverages information about the normal tissue-specificity of human genes to compute a robust multivariate distance score that quantifies gene set dysregulation in each profiled tumor. Because the null distribution of the TPAC scores has an accurate gamma approximation, both population and sample-level inference is supported. As we demonstrate through an analysis of gene expression data for 21 solid human cancers from The Cancer Genome Atlas (TCGA) and associated normal tissue expression data from the Human Protein Atlas (HPA), TPAC gene set scores are more strongly associated with patient prognosis than the scores generated by existing single sample gene set testing methods.


Asunto(s)
Neoplasias , Humanos , Neoplasias/genética , Perfilación de la Expresión Génica/métodos
2.
PLoS Comput Biol ; 20(4): e1012084, 2024 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-38683883

RESUMEN

We have developed a new, and analytically novel, single sample gene set testing method called Reconstruction Set Test (RESET). RESET quantifies gene set importance based on the ability of set genes to reconstruct values for all measured genes. RESET is realized using a computationally efficient randomized reduced rank reconstruction algorithm (available via the RESET R package on CRAN) that can effectively detect patterns of differential abundance and differential correlation for self-contained and competitive scenarios. As demonstrated using real and simulated scRNA-seq data, RESET provides superior performance at a lower computational cost relative to other single sample approaches.


Asunto(s)
Algoritmos , Biología Computacional , Biología Computacional/métodos , Humanos , Perfilación de la Expresión Génica/métodos , Simulación por Computador
3.
BMC Bioinformatics ; 25(1): 212, 2024 Jun 13.
Artículo en Inglés | MEDLINE | ID: mdl-38872103

RESUMEN

BACKGROUND: A vital step in analyzing single-cell data is ascertaining which cell types are present in a dataset, and at what abundance. In many diseases, the proportions of varying cell types can have important implications for health and prognosis. Most approaches for cell type annotation have centered around cell typing for single-cell RNA-sequencing (scRNA-seq) and have had promising success. However, reliable methods are lacking for many other single-cell modalities such as single-cell sequencing assay for transposase-accessible chromatin (scATAC-seq), which quantifies the extent to which genes of interest in each cell are epigenetically "open" for expression. RESULTS: To leverage the informative potential of scATAC-seq data, we developed CAMML with the integration of chromatin accessibility (CAraCAl), a bioinformatic method that performs cell typing on scATAC-seq data. CAraCAl performs cell typing by scoring each cell for its enrichment of cell type-specific gene sets. These gene sets are composed of the most upregulated or downregulated genes present in each cell type according to projected gene activity. CONCLUSIONS: We found that CAraCAl does not improve performance beyond CAMML when scRNA-seq is present, but if only scATAC-seq is available, CAraCAl performs cell typing relatively successfully. As such, we also discuss best practices for cell typing and the strengths and weaknesses of various cell annotation options.


Asunto(s)
Cromatina , Biología Computacional , Cromatina/metabolismo , Cromatina/genética , Cromatina/química , Biología Computacional/métodos , Humanos , Análisis de la Célula Individual/métodos , Programas Informáticos , Análisis de Secuencia de ARN/métodos , Transposasas/metabolismo , Transposasas/genética
4.
Bioinformatics ; 38(23): 5206-5213, 2022 11 30.
Artículo en Inglés | MEDLINE | ID: mdl-36214642

RESUMEN

MOTIVATION: Cell typing is a critical task in the analysis of single-cell data, particularly when studying complex diseased tissues. Unfortunately, the sparsity and noise of single-cell data make accurate cell typing of individual cells difficult. To address these challenges, we previously developed the CAMML method for multi-label cell typing of single-cell RNA-sequencing (scRNA-seq) data. CAMML uses weighted gene sets to score each profiled cell for multiple potential cell types. While CAMML outperforms other scRNA-seq cell typing techniques, it only leverages transcriptomic data so cannot take advantage of newer multi-omic single-cell assays that jointly profile gene expression and protein abundance (e.g. joint scRNA-seq/CITE-seq). RESULTS: We developed the CAMML with the Integration of Marker Proteins (ChIMP) method to support multi-label cell typing of individual cells jointly profiled via scRNA-seq and CITE-seq. ChIMP combines cell type scores computed on scRNA-seq data via the CAMML approach with discretized CITE-seq measurements for cell type marker proteins. The multi-omic cell type scores generated by ChIMP allow researchers to more precisely and conservatively cell type joint scRNA-seq/CITE-seq data. AVAILABILITY AND IMPLEMENTATION: An implementation of this work is available on CRAN at https://cran.r-project.org/web/packages/CAMML/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Perfilación de la Expresión Génica , Análisis de la Célula Individual , Análisis de Secuencia de ARN/métodos , Análisis de la Célula Individual/métodos , Perfilación de la Expresión Génica/métodos , Programas Informáticos , Transcriptoma
5.
PLoS Comput Biol ; 18(5): e1010091, 2022 05.
Artículo en Inglés | MEDLINE | ID: mdl-35584140

RESUMEN

Research in human-associated microbiomes often involves the analysis of taxonomic count tables generated via high-throughput sequencing. It is difficult to apply statistical tools as the data is high-dimensional, sparse, and compositional. An approachable way to alleviate high-dimensionality and sparsity is to aggregate variables into pre-defined sets. Set-based analysis is ubiquitous in the genomics literature and has demonstrable impact on improving interpretability and power of downstream analysis. Unfortunately, there is a lack of sophisticated set-based analysis methods specific to microbiome taxonomic data, where current practice often employs abundance summation as a technique for aggregation. This approach prevents comparison across sets of different sizes, does not preserve inter-sample distances, and amplifies protocol bias. Here, we attempt to fill this gap with a new single-sample taxon enrichment method that uses a novel log-ratio formulation based on the competitive null hypothesis commonly used in the enrichment analysis literature. Our approach, titled competitive balances for taxonomic enrichment analysis (CBEA), generates sample-specific enrichment scores as the scaled log-ratio of the subcomposition defined by taxa within a set and the subcomposition defined by its complement. We provide sample-level significance testing by estimating an empirical null distribution of our test statistic with valid p-values. Herein, we demonstrate, using both real data applications and simulations, that CBEA controls for type I error, even under high sparsity and high inter-taxa correlation scenarios. Additionally, CBEA provides informative scores that can be inputs to downstream analyses such as prediction tasks.


Asunto(s)
Microbiota , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Microbiota/genética
6.
PLoS Comput Biol ; 17(6): e1009085, 2021 06.
Artículo en Inglés | MEDLINE | ID: mdl-34143767

RESUMEN

The genetic alterations that underlie cancer development are highly tissue-specific with the majority of driving alterations occurring in only a few cancer types and with alterations common to multiple cancer types often showing a tissue-specific functional impact. This tissue-specificity means that the biology of normal tissues carries important information regarding the pathophysiology of the associated cancers, information that can be leveraged to improve the power and accuracy of cancer genomic analyses. Research exploring the use of normal tissue data for the analysis of cancer genomics has primarily focused on the paired analysis of tumor and adjacent normal samples. Efforts to leverage the general characteristics of normal tissue for cancer analysis has received less attention with most investigations focusing on understanding the tissue-specific factors that lead to individual genomic alterations or dysregulated pathways within a single cancer type. To address this gap and support scenarios where adjacent normal tissue samples are not available, we explored the genome-wide association between the transcriptomes of 21 solid human cancers and their associated normal tissues as profiled in healthy individuals. While the average gene expression profiles of normal and cancerous tissue may appear distinct, with normal tissues more similar to other normal tissues than to the associated cancer types, when transformed into relative expression values, i.e., the ratio of expression in one tissue or cancer relative to the mean in other tissues or cancers, the close association between gene activity in normal tissues and related cancers is revealed. As we demonstrate through an analysis of tumor data from The Cancer Genome Atlas and normal tissue data from the Human Protein Atlas, this association between tissue-specific and cancer-specific expression values can be leveraged to improve the prognostic modeling of cancer, the comparative analysis of different cancer types, and the analysis of cancer and normal tissue pairs.


Asunto(s)
Neoplasias/genética , Biología Computacional , Bases de Datos Genéticas/estadística & datos numéricos , Femenino , Expresión Génica , Perfilación de la Expresión Génica/estadística & datos numéricos , Humanos , Masculino , Especificidad de Órganos/genética , Análisis de Componente Principal , RNA-Seq , Valores de Referencia , Análisis de Supervivencia
7.
J Immunol ; 205(10): 2649-2666, 2020 11 15.
Artículo en Inglés | MEDLINE | ID: mdl-32998985

RESUMEN

CD8 T cell differentiation is orchestrated by dynamic metabolic changes that direct activation, proliferation, cytotoxic function, and epigenetic changes. We report that the BTB-ZF family transcriptional repressor Zbtb20 negatively regulates CD8 T cell metabolism and memory differentiation in mice. Effector and memory CD8 T cells with conditional Zbtb20 deficiency displayed enhanced mitochondrial and glycolytic metabolism, and memory CD8 T cells had enhanced spare respiratory capacity. Furthermore, Zbtb20-deficient CD8 T cells displayed increased flexibility in the use of mitochondrial fuel sources. Phenotypic and transcriptional skewing toward the memory fate was observed during the CD8 T cell response to Listeria monocytogenes Memory cells mounted larger secondary responses and conferred better protection following tumor challenge. These data suggest that inactivation of Zbtb20 may offer an approach to enhance metabolic activity and flexibility and improve memory CD8 T cell differentiation, useful attributes for T cells used in adoptive immunotherapy.


Asunto(s)
Metabolismo Energético/genética , Listeriosis/inmunología , Neoplasias/inmunología , Linfocitos T Citotóxicos/inmunología , Factores de Transcripción/metabolismo , Traslado Adoptivo , Animales , Diferenciación Celular/genética , Diferenciación Celular/inmunología , Modelos Animales de Enfermedad , Metabolismo Energético/inmunología , Regulación Neoplásica de la Expresión Génica/inmunología , Glucólisis/genética , Glucólisis/inmunología , Humanos , Memoria Inmunológica/genética , Listeria monocytogenes/inmunología , Listeriosis/microbiología , Activación de Linfocitos , Ratones , Ratones Noqueados , Mitocondrias/metabolismo , Neoplasias/genética , Neoplasias/patología , Neoplasias/terapia , Linfocitos T Citotóxicos/metabolismo , Factores de Transcripción/genética
8.
BMC Microbiol ; 21(1): 238, 2021 08 28.
Artículo en Inglés | MEDLINE | ID: mdl-34454437

RESUMEN

BACKGROUND: The infant intestinal microbiome plays an important role in metabolism and immune development with impacts on lifelong health. The linkage between the taxonomic composition of the microbiome and its metabolic phenotype is undefined and complicated by redundancies in the taxon-function relationship within microbial communities. To inform a more mechanistic understanding of the relationship between the microbiome and health, we performed an integrative statistical and machine learning-based analysis of microbe taxonomic structure and metabolic function in order to characterize the taxa-function relationship in early life. RESULTS: Stool samples collected from infants enrolled in the New Hampshire Birth Cohort Study (NHBCS) at approximately 6-weeks (n = 158) and 12-months (n = 282) of age were profiled using targeted and untargeted nuclear magnetic resonance (NMR) spectroscopy as well as DNA sequencing of the V4-V5 hypervariable region from the bacterial 16S rRNA gene. There was significant inter-omic concordance based on Procrustes analysis (6 weeks: p = 0.056; 12 months: p = 0.001), however this association was no longer significant when accounting for phylogenetic relationships using generalized UniFrac distance metric (6 weeks: p = 0.376; 12 months: p = 0.069). Sparse canonical correlation analysis showed significant correlation, as well as identifying sets of microbe/metabolites driving microbiome-metabolome relatedness. Performance of machine learning models varied across different metabolites, with support vector machines (radial basis function kernel) being the consistently top ranked model. However, predictive R2 values demonstrated poor predictive performance across all models assessed (avg: - 5.06% -- 6 weeks; - 3.7% -- 12 months). Conversely, the Spearman correlation metric was higher (avg: 0.344-6 weeks; 0.265-12 months). This demonstrated that taxonomic relative abundance was not predictive of metabolite concentrations. CONCLUSIONS: Our results suggest a degree of overall association between taxonomic profiles and metabolite concentrations. However, lack of predictive capacity for stool metabolic signatures reflects, in part, the possible role of functional redundancy in defining the taxa-function relationship in early life as well as the bidirectional nature of the microbiome-metabolome association. Our results provide evidence in favor of a multi-omic approach for microbiome studies, especially those focused on health outcomes.


Asunto(s)
Bacterias/genética , Heces/microbiología , Microbioma Gastrointestinal/genética , Microbioma Gastrointestinal/fisiología , Metaboloma , Bacterias/clasificación , Bacterias/aislamiento & purificación , Cohorte de Nacimiento , Femenino , Humanos , Lactante , Aprendizaje Automático , Masculino , Filogenia , ARN Ribosómico 16S/genética , Análisis de Secuencia de ADN
9.
BMC Cancer ; 21(1): 1053, 2021 Sep 25.
Artículo en Inglés | MEDLINE | ID: mdl-34563154

RESUMEN

BACKGROUND: Over the past decades, approaches for diagnosing and treating cancer have seen significant improvement. However, the variability of patient and tumor characteristics has limited progress on methods for prognosis prediction. The development of high-throughput omics technologies now provides multiple approaches for characterizing tumors. Although a large number of published studies have focused on integration of multi-omics data and use of pathway-level models for cancer prognosis prediction, there still exists a gap of knowledge regarding the prognostic landscape across multi-omics data for multiple cancer types using both gene-level and pathway-level predictors. METHODS: In this study, we systematically evaluated three often available types of omics data (gene expression, copy number variation and somatic point mutation) covering both DNA-level and RNA-level features. We evaluated the landscape of predictive performance of these three omics modalities for 33 cancer types in the TCGA using a Lasso or Group Lasso-penalized Cox model and either gene or pathway level predictors. RESULTS: We constructed the prognostic landscape using three types of omics data for 33 cancer types on both the gene and pathway levels. Based on this landscape, we found that predictive performance is cancer type dependent and we also highlighted the cancer types and omics modalities that support the most accurate prognostic models. In general, models estimated on gene expression data provide the best predictive performance on either gene or pathway level and adding copy number variation or somatic point mutation data to gene expression data does not improve predictive performance, with some exceptional cohorts including low grade glioma and thyroid cancer. In general, pathway-level models have better interpretative performance, higher stability and smaller model size across multiple cancer types and omics data types relative to gene-level models. CONCLUSIONS: Based on this landscape and comprehensively comparison, models estimated on gene expression data provide the best predictive performance on either gene or pathway level. Pathway-level models have better interpretative performance, higher stability and smaller model size relative to gene-level models.


Asunto(s)
Variaciones en el Número de Copia de ADN , Perfilación de la Expresión Génica/métodos , Expresión Génica , Neoplasias/genética , Mutación Puntual , Estudios de Cohortes , Bases de Datos Genéticas , Humanos , Neoplasias/mortalidad , Neoplasias/patología , Valor Predictivo de las Pruebas , Pronóstico , Modelos de Riesgos Proporcionales
10.
BMC Bioinformatics ; 21(1): 467, 2020 Oct 20.
Artículo en Inglés | MEDLINE | ID: mdl-33081688

RESUMEN

BACKGROUND: Genomic profiling of solid human tumors by projects such as The Cancer Genome Atlas (TCGA) has provided important information regarding the somatic alterations that drive cancer progression and patient survival. Although researchers have successfully leveraged TCGA data to build prognostic models, most efforts have focused on specific cancer types and a targeted set of gene-level predictors. Less is known about the prognostic ability of pathway-level variables in a pan-cancer setting. To address these limitations, we systematically evaluated and compared the prognostic ability of somatic point mutation (SPM) and copy number variation (CNV) data, gene-level and pathway-level models for a diverse set of TCGA cancer types and predictive modeling approaches. RESULTS: We evaluated gene-level and pathway-level penalized Cox proportional hazards models using SPM and CNV data for 29 different TCGA cohorts. We measured predictive accuracy as the concordance index for predicting survival outcomes. Our comprehensive analysis suggests that the use of pathway-level predictors did not offer superior predictive power relative to gene-level models for all cancer types but had the advantages of robustness and parsimony. We identified a set of cohorts for which somatic alterations could not predict prognosis, and a unique cohort LGG, for which SPM data was more predictive than CNV data and the predictive accuracy is good for all model types. We found that the pathway-level predictors provide superior interpretative value and that there is often a serious collinearity issue for the gene-level models while pathway-level models avoided this issue. CONCLUSION: Our comprehensive analysis suggests that when using somatic alterations data for cancer prognosis prediction, pathway-level models are more interpretable, stable and parsimonious compared to gene-level models. Pathway-level models also avoid the issue of collinearity, which can be serious for gene-level somatic alterations. The prognostic power of somatic alterations is highly variable across different cancer types and we have identified a set of cohorts for which somatic alterations could not predict prognosis. In general, CNV data predicts prognosis better than SPM data with the exception of the LGG cohort.


Asunto(s)
Variaciones en el Número de Copia de ADN/genética , Mutación Puntual/genética , Humanos , Pronóstico
11.
BMC Bioinformatics ; 21(1): 76, 2020 Feb 28.
Artículo en Inglés | MEDLINE | ID: mdl-32111152

RESUMEN

BACKGROUND: Cancer prognosis prediction is valuable for patients and clinicians because it allows them to appropriately manage care. A promising direction for improving the performance and interpretation of expression-based predictive models involves the aggregation of gene-level data into biological pathways. While many studies have used pathway-level predictors for cancer survival analysis, a comprehensive comparison of pathway-level and gene-level prognostic models has not been performed. To address this gap, we characterized the performance of penalized Cox proportional hazard models built using either pathway- or gene-level predictors for the cancers profiled in The Cancer Genome Atlas (TCGA) and pathways from the Molecular Signatures Database (MSigDB). RESULTS: When analyzing TCGA data, we found that pathway-level models are more parsimonious, more robust, more computationally efficient and easier to interpret than gene-level models with similar predictive performance. For example, both pathway-level and gene-level models have an average Cox concordance index of ~ 0.85 for the TCGA glioma cohort, however, the gene-level model has twice as many predictors on average, the predictor composition is less stable across cross-validation folds and estimation takes 40 times as long as compared to the pathway-level model. When the complex correlation structure of the data is broken by permutation, the pathway-level model has greater predictive performance while still retaining superior interpretative power, robustness, parsimony and computational efficiency relative to the gene-level models. For example, the average concordance index of the pathway-level model increases to 0.88 while the gene-level model falls to 0.56 for the TCGA glioma cohort using survival times simulated from uncorrelated gene expression data. CONCLUSION: The results of this study show that when the correlations among gene expression values are low, pathway-level analyses can yield better predictive performance, greater interpretative power, more robust models and less computational cost relative to a gene-level model. When correlations among genes are high, a pathway-level analysis provides equivalent predictive power compared to a gene-level analysis while retaining the advantages of interpretability, robustness and computational efficiency.


Asunto(s)
Neoplasias/mortalidad , Estudios de Cohortes , Expresión Génica , Glioma/genética , Glioma/mortalidad , Humanos , Modelos Genéticos , Neoplasias/genética , Pronóstico , Modelos de Riesgos Proporcionales
12.
Bioinformatics ; 34(17): 2957-2964, 2018 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-29659714

RESUMEN

Motivation: Gene set testing, or pathway analysis, has become a critical tool for the analysis of high-dimensional genomic data. Although the function and activity of many genes and higher-level processes is tissue-specific, gene set testing is typically performed in a tissue agnostic fashion, which impacts statistical power and the interpretation and replication of results. Results: To address this challenge, we have developed a bioinformatics approach to compute tissue-specific weights for individual gene sets using information on tissue-specific gene activity from the Human Protein Atlas (HPA). We used this approach to create a public repository of tissue-specific gene set weights for 37 different human tissue types from the HPA and all collections in the Molecular Signatures Database. To demonstrate the validity and utility of these weights, we explored three different applications: the functional characterization of human tissues, multi-tissue analysis for systemic diseases and tissue-specific gene set testing. Availability and implementation: All data used in the reported analyses is publicly available. An R implementation of the method and tissue-specific weights for MSigDB gene set collections can be downloaded at http://www.dartmouth.edu/∼hrfrost/TissueSpecificGeneSets.


Asunto(s)
Biología Computacional , Proteínas/análisis , Biología Computacional/métodos , Humanos , Fenómenos Fisiológicos
13.
Nucleic Acids Res ; 45(12): e114, 2017 Jul 07.
Artículo en Inglés | MEDLINE | ID: mdl-28472344

RESUMEN

Gene set testing is an important bioinformatics technique that addresses the challenges of power, interpretation and replication. To better support the analysis of large and highly overlapping gene set collections, researchers have recently developed a number of multiset methods that jointly evaluate all gene sets in a collection to identify a parsimonious group of functionally independent sets. Unfortunately, current multiset methods all use binary indicators for gene and gene set activity and assume that a gene is active if any containing gene set is active. This simplistic model limits performance on many types of genomic data. To address this limitation, we developed gene set Selection via LASSO Penalized Regression (SLPR), a novel mapping of multiset gene set testing to penalized multiple linear regression. The SLPR method assumes a linear relationship between continuous measures of gene activity and the activity of all gene sets in the collection. As we demonstrate via simulation studies and the analysis of TCGA data using MSigDB gene sets, the SLPR method outperforms existing multiset methods when the true biological process is well approximated by continuous activity measures and a linear association between genes and gene sets.


Asunto(s)
Adenocarcinoma/genética , Biomarcadores de Tumor/genética , Carcinoma de Células Escamosas/genética , Regulación Neoplásica de la Expresión Génica , Neoplasias Pulmonares/genética , Modelos Estadísticos , Proteínas de Neoplasias/genética , Adenocarcinoma/diagnóstico , Adenocarcinoma/patología , Adenocarcinoma del Pulmón , Benchmarking , Carcinoma de Células Escamosas/diagnóstico , Carcinoma de Células Escamosas/patología , Biología Computacional/estadística & datos numéricos , Diagnóstico Diferencial , Humanos , Neoplasias Pulmonares/diagnóstico , Neoplasias Pulmonares/patología , Redes y Vías Metabólicas/genética , Modelos Biológicos , Familia de Multigenes , Análisis de Regresión
14.
BMC Bioinformatics ; 19(1): 479, 2018 Dec 12.
Artículo en Inglés | MEDLINE | ID: mdl-30541428

RESUMEN

BACKGROUND: Cancer develops when pathways controlling cell survival, cell fate or genome maintenance are disrupted by the somatic alteration of key driver genes. Understanding how pathway disruption is driven by somatic alterations is thus essential for an accurate characterization of cancer biology and identification of therapeutic targets. Unfortunately, current cancer pathway analysis methods fail to fully model the relationship between somatic alterations and pathway activity. RESULTS: To address these limitations, we developed a multi-omics method for identifying biologically important pathways and genes in human cancer. Our approach combines single-sample pathway analysis with multi-stage, lasso-penalized regression to find pathways whose gene expression can be explained largely in terms of gene-level somatic alterations in the tumor. Importantly, this method can analyze case-only data sets, does not require information regarding pathway topology and supports personalized pathway analysis using just somatic alteration data for a limited number of cancer-associated genes. The practical effectiveness of this technique is illustrated through an analysis of data from The Cancer Genome Atlas using gene sets from the Molecular Signatures Database. CONCLUSIONS: Novel insights into the pathophysiology of human cancer can be obtained from statistical models that predict expression-based pathway activity in terms of non-silent somatic mutations and copy number variation. These models enable the identification of biologically important pathways and genes and support personalized pathway analysis in cases where gene expression data is unavailable.


Asunto(s)
Genómica/métodos , Neoplasias/genética , Humanos , Neoplasias/patología
15.
Genet Epidemiol ; 40(8): 689-701, 2016 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-27386793

RESUMEN

Statistical interactions between markers of genetic variation, or gene-gene interactions, are believed to play an important role in the etiology of many multifactorial diseases and other complex phenotypes. Unfortunately, detecting gene-gene interactions is extremely challenging due to the large number of potential interactions and ambiguity regarding marker coding and interaction scale. For many data sets, there is insufficient statistical power to evaluate all candidate gene-gene interactions. In these cases, a global test for gene-gene interactions may be the best option. Global tests have much greater power relative to multiple individual interaction tests and can be used on subsets of the markers as an initial filter prior to testing for specific interactions. In this paper, we describe a novel global test for gene-gene interactions, the global epistasis test (GET), that is based on results from random matrix theory. As we show via simulation studies based on previously proposed models for common diseases including rheumatoid arthritis, type 2 diabetes, and breast cancer, our proposed GET method has superior performance characteristics relative to existing global gene-gene interaction tests. A glaucoma GWAS data set is used to demonstrate the practical utility of the GET method.


Asunto(s)
Artritis Reumatoide/genética , Neoplasias de la Mama/genética , Diabetes Mellitus Tipo 2/genética , Epistasis Genética/genética , Marcadores Genéticos/genética , Estudio de Asociación del Genoma Completo , Glaucoma/genética , Polimorfismo de Nucleótido Simple/genética , Simulación por Computador , Femenino , Humanos , Modelos Genéticos , Fenotipo
16.
Genet Epidemiol ; 40(7): 544-557, 2016 11.
Artículo en Inglés | MEDLINE | ID: mdl-27578615

RESUMEN

Although gene-environment (G× E) interactions play an important role in many biological systems, detecting these interactions within genome-wide data can be challenging due to the loss in statistical power incurred by multiple hypothesis correction. To address the challenge of poor power and the limitations of existing multistage methods, we recently developed a screening-testing approach for G× E interaction detection that combines elastic net penalized regression with joint estimation to support a single omnibus test for the presence of G× E interactions. In our original work on this technique, however, we did not assess type I error control or power and evaluated the method using just a single, small bladder cancer data set. In this paper, we extend the original method in two important directions and provide a more rigorous performance evaluation. First, we introduce a hierarchical false discovery rate approach to formally assess the significance of individual G× E interactions. Second, to support the analysis of truly genome-wide data sets, we incorporate a score statistic-based prescreening step to reduce the number of single nucleotide polymorphisms prior to fitting the first stage penalized regression model. To assess the statistical properties of our method, we compare the type I error rate and statistical power of our approach with competing techniques using both simple simulation designs as well as designs based on real disease architectures. Finally, we demonstrate the ability of our approach to identify biologically plausible SNP-education interactions relative to Alzheimer's disease status using genome-wide association study data from the Alzheimer's Disease Neuroimaging Initiative (ADNI).


Asunto(s)
Interacción Gen-Ambiente , Modelos Genéticos , Artritis Reumatoide/genética , Artritis Reumatoide/patología , Neoplasias de la Mama/genética , Neoplasias de la Mama/patología , Diabetes Mellitus Tipo 2/genética , Diabetes Mellitus Tipo 2/patología , Femenino , Estudio de Asociación del Genoma Completo , Humanos , Desequilibrio de Ligamiento , Polimorfismo de Nucleótido Simple
17.
BMC Bioinformatics ; 17(1): 442, 2016 Nov 04.
Artículo en Inglés | MEDLINE | ID: mdl-27809777

RESUMEN

BACKGROUND: Gene set testing, or pathway analysis, is a bioinformatics technique that performs statistical testing on biologically meaningful sets of genomic variables. Although originally developed for supervised analyses, i.e., to test the association between gene sets and an outcome variable, gene set testing also has important unsupervised applications, e.g., p-value weighting. For unsupervised testing, however, few effective gene set testing methods are available with support especially poor for several biologically relevant use cases. RESULTS: In this paper, we describe two new unsupervised gene set testing methods based on random matrix theory, the Marc̆enko-Pastur Distribution Test (MPDT) and the Tracy-Widom Test (TWT), that support both self-contained and competitive null hypotheses. For the self-contained case, we contrast our proposed tests with the classic multivariate test based on a modified likelihood ratio criterion. For the competitive case, we compare the new tests against a competitive version of the classic test and our recently developed Spectral Gene Set Enrichment (SGSE) method. Evaluation of the TWT and MPDT methods is based on both simulation studies and a weighted p-value analysis of two real gene expression data sets using gene sets drawn from MSigDB collections. CONCLUSIONS: The MPDT and TWT methods are novel and effective tools for unsupervised gene set analysis with superior statistical performance relative to existing techniques and the ability to generate biologically important results on real genomic data sets.


Asunto(s)
Algoritmos , Biomarcadores de Tumor/genética , Perfilación de la Expresión Génica/métodos , Leucemia/genética , Modelos Teóricos , Simulación por Computador , Redes Reguladoras de Genes , Genómica/métodos , Humanos
18.
BMC Bioinformatics ; 16: 70, 2015 Mar 03.
Artículo en Inglés | MEDLINE | ID: mdl-25879888

RESUMEN

BACKGROUND: Gene set testing is typically performed in a supervised context to quantify the association between groups of genes and a clinical phenotype. In many cases, however, a gene set-based interpretation of genomic data is desired in the absence of a phenotype variable. Although methods exist for unsupervised gene set testing, they predominantly compute enrichment relative to clusters of the genomic variables with performance strongly dependent on the clustering algorithm and number of clusters. RESULTS: We propose a novel method, spectral gene set enrichment (SGSE), for unsupervised competitive testing of the association between gene sets and empirical data sources. SGSE first computes the statistical association between gene sets and principal components (PCs) using our principal component gene set enrichment (PCGSE) method. The overall statistical association between each gene set and the spectral structure of the data is then computed by combining the PC-level p-values using the weighted Z-method with weights set to the PC variance scaled by Tracy-Widom test p-values. Using simulated data, we show that the SGSE algorithm can accurately recover spectral features from noisy data. To illustrate the utility of our method on real data, we demonstrate the superior performance of the SGSE method relative to standard cluster-based techniques for testing the association between MSigDB gene sets and the variance structure of microarray gene expression data. CONCLUSIONS: Unsupervised gene set testing can provide important information about the biological signal held in high-dimensional genomic data sets. Because it uses the association between gene sets and samples PCs to generate a measure of unsupervised enrichment, the SGSE method is independent of cluster or network creation algorithms and, most importantly, is able to utilize the statistical significance of PC eigenvalues to ignore elements of the data most likely to represent noise.


Asunto(s)
Algoritmos , Biomarcadores de Tumor/genética , Perfilación de la Expresión Génica/métodos , Leucemia/genética , Linfoma de Células B Grandes Difuso/genética , Análisis por Micromatrices/métodos , Análisis por Conglomerados , Redes Reguladoras de Genes , Genómica/métodos , Humanos
19.
Bioinformatics ; 30(12): 1698-706, 2014 Jun 15.
Artículo en Inglés | MEDLINE | ID: mdl-24574114

RESUMEN

MOTIVATION: Gene set enrichment has become a critical tool for interpreting the results of high-throughput genomic experiments. Inconsistent annotation quality and lack of annotation specificity, however, limit the statistical power of enrichment methods and make it difficult to replicate enrichment results across biologically similar datasets. RESULTS: We propose a novel algorithm for optimizing gene set annotations to best match the structure of specific empirical data sources. Our proposed method, entropy minimization over variable clusters (EMVC), filters the annotations for each gene set to minimize a measure of entropy across disjoint gene clusters computed for a range of cluster sizes over multiple bootstrap resampled datasets. As shown using simulated gene sets with simulated data and Molecular Signatures Database collections with microarray gene expression data, the EMVC algorithm accurately filters annotations unrelated to the experimental outcome resulting in increased gene set enrichment power and better replication of enrichment results. AVAILABILITY AND IMPLEMENTATION: http://cran.r-project.org/web/packages/EMVC/index.html.


Asunto(s)
Algoritmos , Perfilación de la Expresión Génica , Anotación de Secuencia Molecular , Humanos , Leucemia/genética , Leucemia/metabolismo , Análisis de Secuencia por Matrices de Oligonucleótidos , Proteína p53 Supresora de Tumor/genética , Proteína p53 Supresora de Tumor/metabolismo
20.
Appl Netw Sci ; 9(1): 14, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38699246

RESUMEN

We present a novel approach for computing a variant of eigenvector centrality for multilayer networks with inter-layer constraints on node importance. Specifically, we consider a multilayer network defined by multiple edge-weighted, potentially directed, graphs over the same set of nodes with each graph representing one layer of the network and no inter-layer edges. As in the standard eigenvector centrality construction, the importance of each node in a given layer is based on the weighted sum of the importance of adjacent nodes in that same layer. Unlike standard eigenvector centrality, we assume that the adjacency relationship and the importance of adjacent nodes may be based on distinct layers. Importantly, this type of centrality constraint is only partially supported by existing frameworks for multilayer eigenvector centrality that use edges between nodes in different layers to capture inter-layer dependencies. For our model, constrained, layer-specific eigenvector centrality values are defined by a system of independent eigenvalue problems and dependent pseudo-eigenvalue problems, whose solution can be efficiently realized using an interleaved power iteration algorithm. We refer to this model, and the associated algorithm, as the Constrained Multilayer Centrality (CMLC) method. The characteristics of this approach, and of standard techniques based on inter-layer edges, are demonstrated on both a simple multilayer network and on a range of random graph models. An R package implementing the CMLC method along with example vignettes is available at https://hrfrost.host.dartmouth.edu/CMLC/.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA