Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 223
Filtrar
1.
Bayesian Anal ; 19(4): 1067-1095, 2024 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-39465034

RESUMO

Functional concurrent, or varying-coefficient, regression models are a form of functional data analysis methods in which functional covariates and outcomes are collected concurrently. Two active areas of research for this class of models are identifying influential functional covariates and clustering their relations across observations. In various applications, researchers have applied and developed methods to address these objectives separately. However, no approach currently performs both tasks simultaneously. In this paper, we propose a fully Bayesian functional concurrent regression mixture model that simultaneously performs functional variable selection and clustering for subject-specific trajectories. Our approach introduces a novel spiked Ewens-Pitman attraction prior that identifies and clusters subjects' trajectories marginally for each functional covariate while using similarities in subjects' auxiliary covariate patterns to inform clustering allocation. Using simulated data, we evaluate the clustering, variable selection, and parameter estimation performance of our approach and compare its performance with alternative spiked processes. We then apply our method to functional data collected in a novel, smartphone-based smoking cessation intervention study to investigate individual-level dynamic relations between smoking behaviors and potential risk factors.

2.
Stat Med ; 2024 Oct 31.
Artigo em Inglês | MEDLINE | ID: mdl-39479896

RESUMO

Broadening eligibility criteria in cancer trials has been advocated to represent the intended patient population more accurately. The advantages are clear in terms of generalizability and recruitment, however there are some important considerations in terms of design for efficiency and patient safety. While toxicity may be expected to be homogeneous across these subpopulations, designs should be able to recommend safe and precise doses if subpopulations with different toxicity profiles exist. Dose-finding designs accounting for patient heterogeneity have been proposed, but existing methods assume that the source of heterogeneity is known. We propose a broadened eligibility dose-finding design to address the situation of unknown patient heterogeneity in phase I cancer clinical trials where eligibility is expanded, and multiple eligibility criteria could potentially lead to different optimal doses for patient subgroups. The design offers a two-in-one approach to dose-finding by simultaneously selecting patient criteria that differentiate the maximum tolerated dose (MTD), using stochastic search variable selection, and recommending the subpopulation-specific MTD if needed. Our simulation study compares the proposed design to the naive approach of assuming patient homogeneity and demonstrates favorable operating characteristics across a wide range of scenarios, allocating patients more often to their true MTD during the trial, recommending more than one MTD when needed, and identifying criteria that differentiate the patient population. The proposed design highlights the advantages of adding more variability at an early stage and demonstrates how assuming patient homogeneity can lead to unsafe or sub-therapeutic dose recommendations.

3.
Bioinform Biol Insights ; 18: 11779322241271535, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39286768

RESUMO

Tumor heterogeneity is a challenge to designing effective and targeted therapies. Glioma-type identification depends on specific molecular and histological features, which are defined by the official World Health Organization (WHO) classification of the central nervous system (CNS). These guidelines are constantly updated to support the diagnosis process, which affects all the successive clinical decisions. In this context, the search for new potential diagnostic and prognostic targets, characteristic of each glioma type, is crucial to support the development of novel therapies. Based on The Cancer Genome Atlas (TCGA) glioma RNA-sequencing data set updated according to the 2016 and 2021 WHO guidelines, we proposed a 2-step variable selection approach for biomarker discovery. Our framework encompasses the graphical lasso algorithm to estimate sparse networks of genes carrying diagnostic information. These networks are then used as input for regularized Cox survival regression model, allowing the identification of a smaller subset of genes with prognostic value. In each step, the results derived from the 2016 and 2021 classes were discussed and compared. For both WHO glioma classifications, our analysis identifies potential biomarkers, characteristic of each glioma type. Yet, better results were obtained for the WHO CNS classification in 2021, thereby supporting recent efforts to include molecular data on glioma classification.

4.
Entropy (Basel) ; 26(9)2024 Sep 16.
Artigo em Inglês | MEDLINE | ID: mdl-39330127

RESUMO

Variable selection methods have been extensively developed for and applied to cancer genomics data to identify important omics features associated with complex disease traits, including cancer outcomes. However, the reliability and reproducibility of the findings are in question if valid inferential procedures are not available to quantify the uncertainty of the findings. In this article, we provide a gentle but systematic review of high-dimensional frequentist and Bayesian inferential tools under sparse models which can yield uncertainty quantification measures, including confidence (or Bayesian credible) intervals, p values and false discovery rates (FDR). Connections in high-dimensional inferences between the two realms have been fully exploited under the "unpenalized loss function + penalty term" formulation for regularization methods and the "likelihood function × shrinkage prior" framework for regularized Bayesian analysis. In particular, we advocate for robust Bayesian variable selection in cancer genomics studies due to its ability to accommodate disease heterogeneity in the form of heavy-tailed errors and structured sparsity while providing valid statistical inference. The numerical results show that robust Bayesian analysis incorporating exact sparsity has yielded not only superior estimation and identification results but also valid Bayesian credible intervals under nominal coverage probabilities compared with alternative methods, especially in the presence of heavy-tailed model errors and outliers.

5.
Ann Appl Stat ; 18(2): 1360-1377, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-39328363

RESUMO

Environmental exposures such as cigarette smoking influence health outcomes through intermediate molecular phenotypes, such as the methylome, transcriptome, and metabolome. Mediation analysis is a useful tool for investigating the role of potentially high-dimensional intermediate phenotypes in the relationship between environmental exposures and health outcomes. However, little work has been done on mediation analysis when the mediators are high-dimensional and the outcome is a survival endpoint, and none of it has provided a robust measure of total mediation effect. To this end, we propose an estimation procedure for Mediation Analysis of Survival outcome and High-dimensional omics mediators (MASH) based on sure independence screening for putative mediator variable selection and a second-moment-based measure of total mediation effect for survival data analogous to the R 2 measure in a linear model. Extensive simulations showed good performance of MASH in estimating the total mediation effect and identifying true mediators. By applying MASH to the metabolomics data of 1919 subjects in the Framingham Heart Study, we identified five metabolites as mediators of the effect of cigarette smoking on coronary heart disease risk (total mediation effect, 51.1%) and two metabolites as mediators between smoking and risk of cancer (total mediation effect, 50.7%). Application of MASH to a diffuse large B-cell lymphoma genomics data set identified copy-number variations for eight genes as mediators between the baseline International Prognostic Index score and overall survival.

6.
Stat Med ; 43(26): 4928-4983, 2024 Nov 20.
Artigo em Inglês | MEDLINE | ID: mdl-39260448

RESUMO

Data irregularity in cancer genomics studies has been widely observed in the form of outliers and heavy-tailed distributions in the complex traits. In the past decade, robust variable selection methods have emerged as powerful alternatives to the nonrobust ones to identify important genes associated with heterogeneous disease traits and build superior predictive models. In this study, to keep the remarkable features of the quantile LASSO and fully Bayesian regularized quantile regression while overcoming their disadvantage in the analysis of high-dimensional genomics data, we propose the spike-and-slab quantile LASSO through a fully Bayesian spike-and-slab formulation under the robust likelihood by adopting the asymmetric Laplace distribution (ALD). The proposed robust method has inherited the prominent properties of selective shrinkage and self-adaptivity to the sparsity pattern from the spike-and-slab LASSO (Roc̆ková and George, J Am Stat Associat, 2018, 113(521): 431-444). Furthermore, the spike-and-slab quantile LASSO has a computational advantage to locate the posterior modes via soft-thresholding rule guided Expectation-Maximization (EM) steps in the coordinate descent framework, a phenomenon rarely observed for robust regularization with nondifferentiable loss functions. We have conducted comprehensive simulation studies with a variety of heavy-tailed errors in both homogeneous and heterogeneous model settings to demonstrate the superiority of the spike-and-slab quantile LASSO over its competing methods. The advantage of the proposed method has been further demonstrated in case studies of the lung adenocarcinomas (LUAD) and skin cutaneous melanoma (SKCM) data from The Cancer Genome Atlas (TCGA).


Assuntos
Teorema de Bayes , Simulação por Computador , Genômica , Neoplasias Pulmonares , Neoplasias , Humanos , Genômica/métodos , Neoplasias/genética , Neoplasias Pulmonares/genética , Modelos Estatísticos , Neoplasias Cutâneas/genética , Melanoma/genética , Funções Verossimilhança
7.
Can J Stat ; 52(3): 900-923, 2024 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-39319323

RESUMO

When analyzing data combined from multiple sources (e.g., hospitals, studies), the heterogeneity across different sources must be accounted for. In this paper, we consider high-dimensional linear regression models for integrative data analysis. We propose a new adaptive clustering penalty (ACP) method to simultaneously select variables and cluster source-specific regression coefficients with sub-homogeneity. We show that the estimator based on the ACP method enjoys a strong oracle property under certain regularity conditions. We also develop an efficient algorithm based on the alternating direction method of multipliers (ADMM) for parameter estimation. We conduct simulation studies to compare the performance of the proposed method to three existing methods (a fused LASSO with adjacent fusion, a pairwise fused LASSO, and a multi-directional shrinkage penalty method). Finally, we apply the proposed method to the multi-center Childhood Adenotonsillectomy Trial to identify sub-homogeneity in the treatment effects across different study sites.


Insérer votre résumé ici. We will supply a French abstract for those authors who can't prepare it themselves.

8.
IISE Trans Healthc Syst Eng ; 14(2): 130-140, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39055377

RESUMO

Radiation therapy (RT) is a frontline approach to treating cancer. While the target of radiation dose delivery is the tumor, there is an inevitable spill of dose to nearby normal organs causing complications. This phenomenon is known as radiotherapy toxicity. To predict the outcome of the toxicity, statistical models can be built based on dosimetric variables received by the normal organ at risk (OAR), known as Normal Tissue Complication Probability (NTCP) models. To tackle the challenge of the high dimensionality of dosimetric variables and limited clinical sample sizes, statistical models with variable selection techniques are viable choices. However, existing variable selection techniques are data-driven and do not integrate medical domain knowledge into the model formulation. We propose a knowledge-constrained generalized linear model (KC-GLM). KC-GLM includes a new mathematical formulation to translate three pieces of domain knowledge into non-negativity, monotonicity, and adjacent similarity constraints on the model coefficients. We further propose an equivalent transformation of the KC-GLM formulation, which makes it possible to solve the model coefficients using existing optimization solvers. Furthermore, we compare KC-GLM and several well-known variable selection techniques via a simulation study and on two real datasets of prostate cancer and lung cancer, respectively. These experiments show that KC-GLM selects variables with better interpretability, avoids producing counter-intuitive and misleading results, and has better prediction accuracy.

9.
Stat Med ; 43(20): 3792-3814, 2024 Sep 10.
Artigo em Inglês | MEDLINE | ID: mdl-38923006

RESUMO

Integrative analysis has emerged as a prominent tool in biomedical research, offering a solution to the "small n $$ n $$ and large p $$ p $$ " challenge. Leveraging the powerful capabilities of deep learning in extracting complex relationship between genes and diseases, our objective in this study is to incorporate deep learning into the framework of integrative analysis. Recognizing the redundancy within candidate features, we introduce a dedicated feature selection layer in the proposed integrative deep learning method. To further improve the performance of feature selection, the rich previous researches are utilized by an ensemble learning method to identify "prior information". This leads to the proposed prior assisted integrative deep learning (PANDA) method. We demonstrate the superiority of the PANDA method through a series of simulation studies, showing its clear advantages over competing approaches in both feature selection and outcome prediction. Finally, a skin cutaneous melanoma (SKCM) dataset is extensively analyzed by the PANDA method to show its practical application.


Assuntos
Aprendizado Profundo , Melanoma , Neoplasias Cutâneas , Humanos , Melanoma/genética , Simulação por Computador , Algoritmos
10.
Biometrics ; 80(1)2024 Jan 29.
Artigo em Inglês | MEDLINE | ID: mdl-38412301

RESUMO

Ordinal class labels are frequently observed in classification studies across various fields. In medical science, patients' responses to a drug can be arranged in the natural order, reflecting their recovery postdrug administration. The severity of the disease is often recorded using an ordinal scale, such as cancer grades or tumor stages. We propose a method based on the linear discriminant analysis (LDA) that generates a sparse, low-dimensional discriminant subspace reflecting the class orders. Unlike existing approaches that focus on predictors marginally associated with ordinal labels, our proposed method selects variables that collectively contribute to the ordinal labels. We employ the optimal scoring approach for LDA as a regularization framework, applying an ordinality penalty to the optimal scores and a sparsity penalty to the coefficients for the predictors. We demonstrate the effectiveness of our approach using a glioma dataset, where we predict cancer grades based on gene expression. A simulation study with various settings validates the competitiveness of our classification performance and demonstrates the advantages of our approach in terms of the interpretability of the estimated classifier with respect to the ordinal class labels.


Assuntos
Algoritmos , Neoplasias , Humanos , Análise Discriminante , Simulação por Computador , Neoplasias/genética , Neoplasias/metabolismo
12.
Int J Biostat ; 2024 Feb 13.
Artigo em Inglês | MEDLINE | ID: mdl-38348882

RESUMO

In many applications, it is of interest to identify a parsimonious set of features, or panel, from multiple candidates that achieves a desired level of performance in predicting a response. This task is often complicated in practice by missing data arising from the sampling design or other random mechanisms. Most recent work on variable selection in missing data contexts relies in some part on a finite-dimensional statistical model, e.g., a generalized or penalized linear model. In cases where this model is misspecified, the selected variables may not all be truly scientifically relevant and can result in panels with suboptimal classification performance. To address this limitation, we propose a nonparametric variable selection algorithm combined with multiple imputation to develop flexible panels in the presence of missing-at-random data. We outline strategies based on the proposed algorithm that achieve control of commonly used error rates. Through simulations, we show that our proposal has good operating characteristics and results in panels with higher classification and variable selection performance compared to several existing penalized regression approaches in cases where a generalized linear model is misspecified. Finally, we use the proposed method to develop biomarker panels for separating pancreatic cysts with differing malignancy potential in a setting where complicated missingness in the biomarkers arose due to limited specimen volumes.

13.
Spectrochim Acta A Mol Biomol Spectrosc ; 310: 123897, 2024 Apr 05.
Artigo em Inglês | MEDLINE | ID: mdl-38266599

RESUMO

Attenuated total reflectance (ATR) Fourier transform infrared (FTIR) spectroscopy is a promising rapid, reagent-free, and low-cost technique considered for clinical translation. It allows to characterize biofluids proteome, lipidome, and metabolome at once. Metainflammatory disorders share a constellation of chronic systemic inflammation, oxidative stress, aberrant adipogenesis, and hypoxia, that significantly increased cardiovascular and cancer risk. As a result, these patients have elevated concentration of cfDNA in the bloodstream. Considering this, DNA amplicons were analyzed by ATR-FTIR at 3 concentrations with 1:100 dilution: (IU/mL): 718, 7.18, and 0.0718. The generated IR spectrum was used as a guide for variable selection. The main peaks in the biofingerprint (1800-900 cm-1) give important information about the base, base-sugar, phosphate, and sugar-phosphate transitions of DNA. To validate our method of selecting variables in blood plasma, 38 control subjects and 12 with metabolic syndrome were used. Using the wavenumbers of the peaks in the biofingerprint of the DNA amplicons, was generated a discriminant analysis model with Mahalanobis distance in blood plasma, and 100 % discrimination accuracy was obtained. In addition, the interval 1475-1188 cm-1 showed the greatest sensitivity to variation in the concentration of DNA amplicons, so curve fitting with Gaussian funcion was performed, obtaining adjusted-R2 of 0.993. PCA with Mahalanobis distance in the interval 1475-1188 cm-1 obtained an accuracy of 96 % and PLS-DA modeling in the interval 1475-1088 cm-1 obtained AUC = 0.991 with sensitivity of 95 % and specificity of 100 %. Therefore, ATR-FTIR spectroscopy with variable selection guided by DNA IR peaks is a promising and efficient method to be applied in metainflammatory disorders.


Assuntos
Fosfatos , Plasma , Humanos , Espectroscopia de Infravermelho com Transformada de Fourier/métodos , Análise Discriminante , Açúcares , Proteínas Mutadas de Ataxia Telangiectasia
14.
Am J Hum Genet ; 111(2): 213-226, 2024 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-38171363

RESUMO

The aim of fine mapping is to identify genetic variants causally contributing to complex traits or diseases. Existing fine-mapping methods employ Bayesian discrete mixture priors and depend on a pre-specified maximum number of causal variants, which may lead to sub-optimal solutions. In this work, we propose a Bayesian fine-mapping method called h2-D2, utilizing a continuous global-local shrinkage prior. We also present an approach to define credible sets of causal variants in continuous prior settings. Simulation studies demonstrate that h2-D2 outperforms current state-of-the-art fine-mapping methods such as SuSiE and FINEMAP in accurately identifying causal variants and estimating their effect sizes. We further applied h2-D2 to prostate cancer analysis and discovered some previously unknown causal variants. In addition, we inferred 369 target genes associated with the detected causal variants and several pathways that were significantly over-represented by these genes, shedding light on their potential roles in prostate cancer development and progression.


Assuntos
Neoplasias da Próstata , Locos de Características Quantitativas , Masculino , Humanos , Teorema de Bayes , Polimorfismo de Nucleotídeo Único/genética , Simulação por Computador , Neoplasias da Próstata/genética , Estudo de Associação Genômica Ampla/métodos
15.
Phytopathology ; 114(2): 393-404, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-37581435

RESUMO

Peanuts grown in tropical, subtropical, and temperate regions are susceptible to stem rot, which is a soilborne disease caused by Athelia rolfsii. Due to the lack of reliable environmental-based scheduling recommendations, stem rot control relies heavily on fungicides that are applied at predetermined intervals. We conducted inoculated field experiments for six site-years in North Florida to examine the relationship between germination of A. rolfsii sclerotia: the inoculum, stem rot symptom development in the peanut crop, and environmental factors such as soil temperature (ST), soil moisture, relative humidity (RH), precipitation, evapotranspiration, and solar radiation. Window-pane analysis with hourly and daily environmental data for 5- to 28-day periods before each disease assessment were evaluated to select model predictors using correlation analysis, regularized regression, and exhaustive feature selection. Our results indicated that within-canopy ST (at 0.05 m belowground) and RH (at 0.15 m aboveground) were the most important environmental variables that influenced the progress of mycelial activity in susceptible peanut crops. Decision tree analysis resulted in an easy-to-interpret one-variable model (adjusted R2 = 0.51, Akaike information criterion [AIC] = 324, root average square error [RASE] = 14.21) or two-variable model (adjusted R2 = 0.61, AIC = 306, RASE = 10.95) that provided an action threshold for various disease scenarios based on number of hours of canopy RH above 90% and ST between 25 and 35°C in a 14-day window. Coupling an existing preseason risk index for stem rot, such as Peanut Rx, with the environmentally based predictors identified in this study would be a logical next step to optimize stem rot management. [Formula: see text] Copyright © 2024 The Author(s). This is an open access article distributed under the CC BY 4.0 International license.


Assuntos
Arachis , Doenças das Plantas , Doenças das Plantas/prevenção & controle , Produtos Agrícolas , Solo , Gerenciamento Clínico
16.
bioRxiv ; 2024 Mar 07.
Artigo em Inglês | MEDLINE | ID: mdl-37662296

RESUMO

Environmental exposures such as cigarette smoking influence health outcomes through intermediate molecular phenotypes, such as the methylome, transcriptome, and metabolome. Mediation analysis is a useful tool for investigating the role of potentially high-dimensional intermediate phenotypes in the relationship between environmental exposures and health outcomes. However, little work has been done on mediation analysis when the mediators are high-dimensional and the outcome is a survival endpoint, and none of it has provided a robust measure of total mediation effect. To this end, we propose an estimation procedure for Mediation Analysis of Survival outcome and High-dimensional omics mediators (MASH) based on sure independence screening for putative mediator variable selection and a second-moment-based measure of total mediation effect for survival data analogous to the R2 measure in a linear model. Extensive simulations showed good performance of MASH in estimating the total mediation effect and identifying true mediators. By applying MASH to the metabolomics data of 1919 subjects in the Framingham Heart Study, we identified five metabolites as mediators of the effect of cigarette smoking on coronary heart disease risk (total mediation effect, 51.1%) and two metabolites as mediators between smoking and risk of cancer (total mediation effect, 50.7%). Application of MASH to a diffuse large B-cell lymphoma genomics data set identified copy-number variations for eight genes as mediators between the baseline International Prognostic Index score and overall survival.

17.
Stat Methods Med Res ; 33(1): 3-23, 2024 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-38155567

RESUMO

Generalized linear mixed models are commonly used to describe relationships between correlated responses and covariates in medical research. In this paper, we propose a simple and easily implementable regularized estimation approach to select both fixed and random effects in generalized linear mixed model. Specifically, we propose to construct and optimize the objective functions using the confidence distributions of model parameters, as opposed to using the observed data likelihood functions, to perform effect selections. Two estimation methods are developed. The first one is to use the joint confidence distribution of model parameters to perform simultaneous fixed and random effect selections. The second method is to use the marginal confidence distributions of model parameters to perform the selections of fixed and random effects separately. With a proper choice of regularization parameters in the adaptive LASSO framework, we show the consistency and oracle properties of the proposed regularized estimators. Simulation studies have been conducted to assess the performance of the proposed estimators and demonstrate computational efficiency. Our method has also been applied to two longitudinal cancer studies to identify demographic and clinical factors associated with patient health outcomes after cancer therapies.


Assuntos
Neoplasias , Humanos , Modelos Lineares , Funções Verossimilhança , Simulação por Computador , Estudos Longitudinais
18.
Stat Med ; 42(30): 5616-5629, 2023 12 30.
Artigo em Inglês | MEDLINE | ID: mdl-37806971

RESUMO

A wealth of gene expression data generated by high-throughput techniques provides exciting opportunities for studying gene-gene interactions systematically. Gene-gene interactions in a biological system are tightly regulated and are often highly dynamic. The interactions can change flexibly under various internal cellular signals or external stimuli. Previous studies have developed statistical methods to examine these dynamic changes in gene-gene interactions. However, due to the massive number of possible gene combinations that need to be considered in a typical genomic dataset, intensive computation is a common challenge for exploring gene-gene interactions. On the other hand, oftentimes only a small proportion of gene combinations exhibit dynamic co-expression changes. To solve this problem, we propose Bayesian variable selection approaches based on spike-and-slab priors. The proposed algorithms reduce the computational intensity by focusing on identifying subsets of promising gene combinations in the search space. We also adopt a Bayesian multiple hypothesis testing procedure to identify strong dynamic gene co-expression changes. Simulation studies are performed to compare the proposed approaches with existing exhaustive search heuristics. We demonstrate the implementation of our proposed approach to study the association between gene co-expression patterns and overall survival using the RNA-sequencing dataset from The Cancer Genome Atlas breast cancer BRCA-US project.


Assuntos
Algoritmos , Genômica , Humanos , Teorema de Bayes , Simulação por Computador , Heurística
19.
Stat Methods Med Res ; 32(12): 2455-2471, 2023 12.
Artigo em Inglês | MEDLINE | ID: mdl-37823396

RESUMO

Standard survival models such as the proportional hazards model contain a single regression component, corresponding to the scale of the hazard. In contrast, we consider the so-called "multi-parameter regression" approach whereby covariates enter the model through multiple distributional parameters simultaneously, for example, scale and shape parameters. This approach has previously been shown to achieve flexibility with relatively low model complexity. However, beyond a stepwise type selection method, variable selection methods are underdeveloped in the multi-parameter regression survival modeling setting. Therefore, we propose penalized multi-parameter regression estimation procedures using the following penalties: least absolute shrinkage and selection operator, smoothly clipped absolute deviation, and adaptive least absolute shrinkage and selection operator. We compare these procedures using extensive simulation studies and an application to data from an observational lung cancer study; the Weibull multi-parameter regression model is used throughout as a running example.


Assuntos
Neoplasias Pulmonares , Humanos , Modelos de Riscos Proporcionais , Simulação por Computador , Análise Multivariada
20.
Stat Med ; 42(28): 5266-5284, 2023 12 10.
Artigo em Inglês | MEDLINE | ID: mdl-37715500

RESUMO

In recent years, comprehensive cancer genomics platforms, such as The Cancer Genome Atlas (TCGA), provide access to an enormous amount of high throughput genomic datasets for each patient, including gene expression, DNA copy number alterations, DNA methylation, and somatic mutation. While the integration of these multi-omics datasets has the potential to provide novel insights that can lead to personalized medicine, most existing approaches only focus on gene-level analysis and lack the ability to facilitate biological findings at the pathway-level. In this article, we propose Bayes-InGRiD (Bayesian Integrative Genomics Robust iDentification of cancer subgroups), a novel pathway-guided Bayesian sparse latent factor model for the simultaneous identification of cancer patient subgroups (clustering) and key molecular features (variable selection) within a unified framework, based on the joint analysis of continuous, binary, and count data. By utilizing pathway (gene set) information, Bayes-InGRiD does not only enhance the accuracy and robustness of cancer patient subgroup and key molecular feature identification, but also promotes biological understanding and interpretation. Finally, to facilitate an efficient posterior sampling, an alternative Gibbs sampler for logistic and negative binomial models is proposed using Pólya-Gamma mixtures of normal to represent latent variables for binary and count data, which yields a conditionally Gaussian representation of the posterior. The R package "INGRID" implementing the proposed approach is currently available in our research group GitHub webpage (https://dongjunchung.github.io/INGRID/).


Assuntos
Genômica , Neoplasias , Humanos , Teorema de Bayes , Neoplasias/genética , Modelos Estatísticos , Metilação de DNA
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA