RESUMO
Motivation: Large datasets containing multiple clinical and omics measurements for each subject motivate the development of new statistical methods to integrate these data to advance scientific discovery. Model: We propose bootstrap evaluation of association matrices (BEAM), which integrates multiple omics profiles with multiple clinical endpoints. BEAM associates a set omic features with clinical endpoints via regression models and then uses bootstrap resampling to determine statistical significance of the set. Unlike existing methods, BEAM uniquely accommodates an arbitrary number of omic profiles and endpoints. Results: In simulations, BEAM performed similarly to the theoretically best simple test and outperformed other integrated analysis methods. In an example pediatric leukemia application, BEAM identified several genes with biological relevance established by a CRISPR assay that had been missed by univariate screens and other integrated analysis methods. Thus, BEAM is a powerful, flexible, and robust tool to identify genes for further laboratory and/or clinical research evaluation. Availability: Source code, documentation, and a vignette for BEAM are available on GitHub at: https://github.com/annaSeffernick/BEAMR. The R package is available from CRAN at: https://cran.r-project.org/package=BEAMR. Contact: Stanley.Pounds@stjude.org. Supplementary Information: Supplementary data are available at the journal's website.
RESUMO
T-lineage acute lymphoblastic leukaemia (T-ALL) is a high-risk tumour1 that has eluded comprehensive genomic characterization, which is partly due to the high frequency of noncoding genomic alterations that result in oncogene deregulation2,3. Here we report an integrated analysis of genome and transcriptome sequencing of tumour and remission samples from more than 1,300 uniformly treated children with T-ALL, coupled with epigenomic and single-cell analyses of malignant and normal T cell precursors. This approach identified 15 subtypes with distinct genomic drivers, gene expression patterns, developmental states and outcomes. Analyses of chromatin topology revealed multiple mechanisms of enhancer deregulation that involve enhancers and genes in a subtype-specific manner, thereby demonstrating widespread involvement of the noncoding genome. We show that the immunophenotypically described, high-risk entity of early T cell precursor ALL is superseded by a broader category of 'early T cell precursor-like' leukaemia. This category has a variable immunophenotype and diverse genomic alterations of a core set of genes that encode regulators of hematopoietic stem cell development. Using multivariable outcome models, we show that genetic subtypes, driver and concomitant genetic alterations independently predict treatment failure and survival. These findings provide a roadmap for the classification, risk stratification and mechanistic understanding of this disease.
Assuntos
Genoma Humano , Genômica , Leucemia-Linfoma Linfoblástico de Células T Precursoras , Criança , Feminino , Humanos , Masculino , Cromatina/genética , Cromatina/metabolismo , Elementos Facilitadores Genéticos/genética , Epigenômica , Regulação Leucêmica da Expressão Gênica , Genoma Humano/genética , Leucemia-Linfoma Linfoblástico de Células T Precursoras/genética , Leucemia-Linfoma Linfoblástico de Células T Precursoras/patologia , Análise de Célula Única , Transcriptoma/genética , Linfócitos T/citologia , Linfócitos T/patologiaRESUMO
While time-to-event data are often continuous, there are several instances where discrete survival data, which are inherently ordinal, may be available or are more appropriate or useful. Several discrete survival models exist, but the forward continuation ratio model with a complementary log-log link has a survival interpretation and is closely related to the Cox proportional hazards model, despite being an ordinal model. This model has previously been implemented in the high-dimensional setting using the ordinal generalized monotone incremental forward stagewise algorithm. Here, we propose a Bayesian penalized forward continuation ratio model with a complementary log-log link and explore different priors to perform variable selection and regularization. Through simulations, we show that our Bayesian model outperformed the existing frequentist method in terms of variable selection performance, and that a 10% prior inclusion probability performed better than 1% or 50%. We also illustrate our model on a publicly available acute myeloid leukemia dataset to identify genomic features associated with discrete survival. We identified nine features that map to ten unique genes, five of which have been previously associated with leukemia in the literature. In conclusion, our proposed Bayesian model is flexible, allows simultaneous variable selection and uncertainty quantification, and performed well in simulation studies and application to real data.
Assuntos
Algoritmos , Genômica , Teorema de Bayes , Modelos de Riscos Proporcionais , Simulação por ComputadorRESUMO
For many high-dimensional genomic and epigenomic datasets, the outcome of interest is ordinal. While these ordinal outcomes are often thought of as the observed cutpoints of some latent continuous variable, some ordinal outcomes are truly discrete and are comprised of the subjective combination of several factors. The nonlinear stereotype logistic model, which does not assume proportional odds, was developed for these 'assessed' ordinal variables. It has previously been extended to the frequentist high-dimensional feature selection setting, but the Bayesian framework provides some distinct advantages in terms of simultaneous uncertainty quantification and variable selection. Here, we review the stereotype model and Bayesian variable selection methods and demonstrate how to combine them to select genomic features associated with discrete ordinal outcomes. We compared the Bayesian and frequentist methods in terms of variable selection performance. We additionally applied the Bayesian stereotype method to an acute myeloid leukemia RNA-sequencing dataset to further demonstrate its variable selection abilities by identifying features associated with the European LeukemiaNet prognostic risk score.
Assuntos
Genômica , Modelos Logísticos , Teorema de Bayes , Fatores de RiscoRESUMO
The stage of cancer is a discrete ordinal response that indicates the aggressiveness of disease and is often used by physicians to determine the type and intensity of treatment to be administered. For example, the FIGO stage in cervical cancer is based on the size and depth of the tumor as well as the level of spread. It may be of clinical relevance to identify molecular features from high-throughput genomic assays that are associated with the stage of cervical cancer to elucidate pathways related to tumor aggressiveness, identify improved molecular features that may be useful for staging, and identify therapeutic targets. High-throughput RNA-Seq data and corresponding clinical data (including stage) for cervical cancer patients have been made available through The Cancer Genome Atlas Project (TCGA). We recently described penalized Bayesian ordinal response models that can be used for variable selection for over-parameterized datasets, such as the TCGA-CESC dataset. Herein, we describe our ordinalbayes R package, available from the Comprehensive R Archive Network (CRAN), which enhances the runjags R package by enabling users to easily fit cumulative logit models when the outcome is ordinal and the number of predictors exceeds the sample size, P > N, such as for TCGA and other high-throughput genomic data. We demonstrate the use of this package by applying it to the TCGA cervical cancer dataset. Our ordinalbayes package can be used to fit models to high-dimensional datasets, and it effectively performs variable selection.
RESUMO
BACKGROUND: Racial/ethnic disparities in health reflect a combination of genetic and environmental causes, and DNA methylation may be an important mediator. We compared in an exploratory manner the blood DNA methylome of Japanese Americans (JPA) versus European Americans (EUA). METHODS: Genome-wide buffy coat DNA methylation was profiled among healthy Multiethnic Cohort participant women who were Japanese (JPA; n = 30) or European (EUA; n = 28) Americans aged 60-65. Differentially methylated CpGs by race/ethnicity (DM-CpGs) were identified by linear regression (Bonferroni-corrected P < 0.1) and analyzed in relation to corresponding gene expression, a priori selected single nucleotide polymorphisms (SNPs), and blood biomarkers of inflammation and metabolism using Pearson or Spearman correlations (FDR < 0.1). RESULTS: We identified 174 DM-CpGs with the majority of hypermethylated in JPA compared to EUA (n = 133), often in promoter regions (n = 48). Half (51%) of the genes corresponding to the DM-CpGs were involved in liver function and liver disease, and the methylation in nine genes was significantly correlated with gene expression for DM-CpGs. A total of 156 DM-CpGs were associated with rs7489665 (SH2B1). Methylation of DM-CpGs was correlated with blood levels of the cytokine MIP1B (n = 146). We confirmed some of the DM-CpGs in the TCGA adjacent non-tumor liver tissue of Asians versus EUA. CONCLUSION: We found a number of differentially methylated CpGs in blood DNA between JPA and EUA women with a potential link to liver disease, specific SNPs, and systemic inflammation. These findings may support further research on the role of DNA methylation in mediating some of the higher risk of liver disease among JPA.