RESUMO
This paper tackles the challenge of estimating correlations between higher-level biological variables (e.g. proteins and gene pathways) when only lower-level measurements are directly observed (e.g. peptides and individual genes). Existing methods typically aggregate lower-level data into higher-level variables and then estimate correlations based on the aggregated data. However, different data aggregation methods can yield varying correlation estimates as they target different higher-level quantities. Our solution is a latent factor model that directly estimates these higher-level correlations from lower-level data without the need for data aggregation. We further introduce a shrinkage estimator to ensure the positive definiteness and improve the accuracy of the estimated correlation matrix. Furthermore, we establish the asymptotic normality of our estimator, enabling efficient computation of P-values for the identification of significant correlations. The effectiveness of our approach is demonstrated through comprehensive simulations and the analysis of proteomics and gene expression datasets. We develop the R package highcor for implementing our method.
RESUMO
Biological networks are important for the analysis of human diseases, which summarize the regulatory interactions and other relationships between different molecules. Understanding and constructing networks for molecules, such as DNA, RNA and proteins, can help elucidate the mechanisms of complex biological systems. The Gaussian Graphical Models (GGMs) are popular tools for the estimation of biological networks. Nonetheless, reconstructing GGMs from high-dimensional datasets is still challenging. The current methods cannot handle the sparsity and high-dimensionality issues arising from datasets very well. Here, we developed a new GGM, called the GR2D2 (Graphical $R^2$-induced Dirichlet Decomposition) model, based on the R2D2 priors for linear models. Besides, we provided a data-augmented block Gibbs sampler algorithm. The R code is available at https://github.com/RavenGan/GR2D2. The GR2D2 estimator shows superior performance in estimating the precision matrices compared with the existing techniques in various simulation settings. When the true precision matrix is sparse and of high dimension, the GR2D2 provides the estimates with smallest information divergence from the underlying truth. We also compare the GR2D2 estimator with the graphical horseshoe estimator in five cancer RNA-seq gene expression datasets grouped by three cancer types. Our results show that GR2D2 successfully identifies common cancer pathways and cancer-specific pathways for each dataset.
Assuntos
Algoritmos , Oncogenes , Humanos , Modelos Lineares , Simulação por Computador , RNARESUMO
Biomarkers are often measured in bulk to diagnose patients, monitor patient conditions, and research novel drug pathways. The measurement of these biomarkers often suffers from detection limits that result in missing and untrustworthy measurements. Frequently, missing biomarkers are imputed so that down-stream analysis can be conducted with modern statistical methods that cannot normally handle data subject to informative censoring. This work develops an empirical Bayes g $$ g $$ -modeling method for imputing and denoising biomarker measurements. We establish superior estimation properties compared to popular methods in simulations and with real data, providing the useful biomarker measurement estimations for down-stream analysis.
Assuntos
Teorema de Bayes , Biomarcadores , Simulação por Computador , Humanos , Biomarcadores/análise , Modelos Estatísticos , Estatísticas não Paramétricas , Interpretação Estatística de DadosRESUMO
We propose a computationally and statistically efficient divide-and-conquer (DAC) algorithm to fit sparse Cox regression to massive datasets where the sample size $n_0$ is exceedingly large and the covariate dimension $p$ is not small but $n_0\gg p$. The proposed algorithm achieves computational efficiency through a one-step linear approximation followed by a least square approximation to the partial likelihood (PL). These sequences of linearization enable us to maximize the PL with only a small subset and perform penalized estimation via a fast approximation to the PL. The algorithm is applicable for the analysis of both time-independent and time-dependent survival data. Simulations suggest that the proposed DAC algorithm substantially outperforms the full sample-based estimators and the existing DAC algorithm with respect to the computational speed, while it achieves similar statistical efficiency as the full sample-based estimators. The proposed algorithm was applied to extraordinarily large survival datasets for the prediction of heart failure-specific readmission within 30 days among Medicare heart failure patients.
Assuntos
Algoritmos , Medicare , Idoso , Simulação por Computador , Humanos , Análise dos Mínimos Quadrados , Modelos de Riscos Proporcionais , Estados UnidosRESUMO
Phase I early-phase clinical studies aim at investigating the safety and the underlying dose-toxicity relationship of a drug or combination. While little may still be known about the compound's properties, it is crucial to consider quantitative information available from any studies that may have been conducted previously on the same drug. A meta-analytic approach has the advantages of being able to properly account for between-study heterogeneity, and it may be readily extended to prediction or shrinkage applications. Here we propose a simple and robust two-stage approach for the estimation of maximum tolerated dose(s) utilizing penalized logistic regression and Bayesian random-effects meta-analysis methodology. Implementation is facilitated using standard R packages. The properties of the proposed methods are investigated in Monte Carlo simulations. The investigations are motivated and illustrated by two examples from oncology.
Assuntos
Oncologia , Projetos de Pesquisa , Teorema de Bayes , Simulação por Computador , Relação Dose-Resposta a Droga , Humanos , Modelos Logísticos , Dose Máxima Tolerável , Método de Monte CarloRESUMO
Shrinkage estimation in a meta-analysis framework may be used to facilitate dynamical borrowing of information. This framework might be used to analyze a new study in the light of previous data, which might differ in their design (e.g., a randomized controlled trial and a clinical registry). We show how the common study weights arise in effect and shrinkage estimation, and how these may be generalized to the case of Bayesian meta-analysis. Next we develop simple ways to compute bounds on the weights, so that the contribution of the external evidence may be assessed a priori. These considerations are illustrated and discussed using numerical examples, including applications in the treatment of Creutzfeldt-Jakob disease and in fetal monitoring to prevent the occurrence of metabolic acidosis. The target study's contribution to the resulting estimate is shown to be bounded below. Therefore, concerns of evidence being easily overwhelmed by external data are largely unwarranted.
Assuntos
Teorema de Bayes , Ensaios Clínicos Controlados Aleatórios como AssuntoRESUMO
In a host of business applications, biomedical and epidemiological studies, the problem of multicollinearity among predictor variables is a frequent issue in longitudinal data analysis for linear mixed models (LMM). We consider an efficient estimation strategy for high-dimensional data application, where the dimensions of the parameters are larger than the number of observations. In this paper, we are interested in estimating the fixed effects parameters of the LMM when it is assumed that some prior information is available in the form of linear restrictions on the parameters. We propose the pretest and shrinkage estimation strategies using the ridge full model as the base estimator. We establish the asymptotic distributional bias and risks of the suggested estimators and investigate their relative performance with respect to the ridge full model estimator. Furthermore, we compare the numerical performance of the LASSO-type estimators with the pretest and shrinkage ridge estimators. The methodology is investigated using simulation studies and then demonstrated on an application exploring how effective brain connectivity in the default mode network (DMN) may be related to genetics within the context of Alzheimer's disease.
RESUMO
This article concerns with conditionally formulated multivariate Gaussian Markov random fields (MGMRF) for modeling multivariate local dependencies with unknown dependence parameters subject to positivity constraint. In the context of Bayesian hierarchical modeling of lattice data in general and Bayesian disease mapping in particular, analytic and simulation studies provide new insights into various approaches to posterior estimation of dependence parameters under "hard" or "soft" positivity constraint, including the well-known strictly diagonal dominance criterion and options of hierarchical priors. Hierarchical centering is examined as a means to gain computational efficiency in Bayesian estimation of multivariate generalized linear mixed effects models in the presence of spatial confounding and weakly identified model parameters. Simulated data on irregular or regular lattice, and three datasets from the multivariate and spatiotemporal disease mapping literature, are used for illustration. The present investigation also sheds light on the use of deviance information criterion for model comparison, choice, and interpretation in the context of posterior risk predictions judged by borrowing-information and bias-precision tradeoff. The article concludes with a summary discussion and directions of future work. Potential applications of MGMRF in spatial information fusion and image analysis are briefly mentioned.
Assuntos
Modelos Estatísticos , Teorema de Bayes , Simulação por Computador , Humanos , Modelos Lineares , Distribuição NormalRESUMO
BACKGROUND/AIMS: Alzheimer's disease (AD) is a chronic neurodegenerative disease that causes memory loss and a decline in cognitive abilities. AD is the sixth leading cause of death in the USA, affecting an estimated 5 million Americans. To assess the association between multiple genetic variants and multiple measurements of structural changes in the brain, a recent study of AD used a multivariate measure of linear dependence, the RV coefficient. The authors decomposed the RV coefficient into contributions from individual variants and displayed these contributions graphically. METHODS: We investigate the properties of such a "contribution plot" in terms of an underlying linear model, and discuss shrinkage estimation of the components of the plot when the correlation signal may be sparse. RESULTS: The contribution plot is applied to simulated data and to genomic and brain imaging data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). CONCLUSIONS: The contribution plot with shrinkage estimation can reveal truly associated explanatory variables.
Assuntos
Doença de Alzheimer/diagnóstico por imagem , Doença de Alzheimer/genética , Biomarcadores/metabolismo , Encéfalo/diagnóstico por imagem , Neuroimagem , Apolipoproteínas E/genética , Simulação por Computador , Genótipo , Humanos , Fenótipo , Polimorfismo de Nucleotídeo Único/genéticaRESUMO
We discuss alternative estimators of the population total given a dual-frame random-digit-dial (RDD) telephone survey in which samples are selected from landline and cell phone sampling frames. The estimators are subject to sampling and nonsampling errors. To reduce sampling variability when an optimum balance of landline and cell phone samples is not feasible, we develop an application of shrinkage estimation. We demonstrate the implications for survey weighting of a differential nonresponse mechanism by telephone status. We illustrate these ideas using data from the National Immunization Survey-Child, a large dual-frame RDD telephone survey sponsored by the Centers for Disease Control and Prevention and conducted to measure the vaccination status of American children aged 19 to 35 months.
Assuntos
Inquéritos Epidemiológicos , Telefone , Vacinação/estatística & dados numéricos , Centers for Disease Control and Prevention, U.S. , Pré-Escolar , Feminino , Humanos , Lactente , Masculino , Projetos de Pesquisa , Estudos de Amostragem , Estados UnidosRESUMO
This paper discusses a number of methods for adjusting treatment effect estimates in clinical trials where differential effects in several subpopulations are suspected. In such situations, the estimates from the most extreme subpopulation are often overinterpreted. The paper focusses on the construction of simultaneous confidence intervals intended to provide a more realistic assessment regarding the uncertainty around these extreme results. The methods from simultaneous inference are compared with shrinkage estimates arising from Bayesian hierarchical models by discussing salient features of both approaches in a typical application.
Assuntos
Biometria/métodos , Asma/terapia , Teorema de Bayes , Ensaios Clínicos Fase I como Assunto , Intervalos de Confiança , Humanos , Modelos Estatísticos , Viés de Seleção , IncertezaRESUMO
There are challenges in designing pediatric trials arising from special ethical issues and the relatively small accessible patient population. The application of conventional phase 3 trial designs to pediatrics is not realistic in some therapeutic areas. To address this issue, we propose various approaches for designing pediatric trials that incorporate data available from adult studies using James-Stein shrinkage estimation, empirical shrinkage estimation, and Bayesian methods. We also apply the concept of consistency used in multi-regional trials to pediatric trials. The performance of these methods is assessed through representative scenarios and an example using actual Type 2 diabetes mellitus (T2DM) trials.
Assuntos
Ensaios Clínicos como Assunto/normas , Diabetes Mellitus Tipo 2/terapia , Pediatria/normas , Guias de Prática Clínica como Assunto/normas , Criança , Ensaios Clínicos como Assunto/ética , Diabetes Mellitus Tipo 2/diagnóstico , Diabetes Mellitus Tipo 2/epidemiologia , Humanos , Pediatria/éticaRESUMO
Genome-wide association studies (GWAS) have detected large numbers of variants associated with complex human traits and diseases. However, the proportion of variance explained by GWAS-significant single nucleotide polymorphisms has been usually small. This brought interest in the use of whole-genome regression (WGR) methods. However, there has been limited research on the factors that affect prediction accuracy (PA) of WGRs when applied to human data of distantly related individuals. Here, we examine, using real human genotypes and simulated phenotypes, how trait complexity, marker-quantitative trait loci (QTL) linkage disequilibrium (LD), and the model used affect the performance of WGRs. Our results indicated that the estimated rate of missing heritability is dependent on the extent of marker-QTL LD. However, this parameter was not greatly affected by trait complexity. Regarding PA our results indicated that: (a) under perfect marker-QTL LD WGR can achieve moderately high prediction accuracy, and with simple genetic architectures variable selection methods outperform shrinkage procedures and (b) under imperfect marker-QTL LD, variable selection methods can achieved reasonably good PA with simple or moderately complex genetic architectures; however, the PA of these methods deteriorated as trait complexity increases and with highly complex traits variable selection and shrinkage methods both performed poorly. This was confirmed with an analysis of human height.
Assuntos
Doença/genética , Genoma Humano , Modelos Genéticos , Locos de Características Quantitativas , Simulação por Computador , Estudo de Associação Genômica Ampla , Humanos , Desequilíbrio de Ligação , Análise de RegressãoRESUMO
Many existing cohort studies designed to investigate health effects of environmental exposures also collect data on genetic markers. The Early Life Exposures in Mexico to Environmental Toxicants project, for instance, has been genotyping single nucleotide polymorphisms on candidate genes involved in mental and nutrient metabolism and also in potentially shared metabolic pathways with the environmental exposures. Given the longitudinal nature of these cohort studies, rich exposure and outcome data are available to address novel questions regarding gene-environment interaction (G × E). Latent variable (LV) models have been effectively used for dimension reduction, helping with multiple testing and multicollinearity issues in the presence of correlated multivariate exposures and outcomes. In this paper, we first propose a modeling strategy, based on LV models, to examine the association between repeated outcome measures (e.g., child weight) and a set of correlated exposure biomarkers (e.g., prenatal lead exposure). We then construct novel tests for G × E effects within the LV framework to examine effect modification of outcome-exposure association by genetic factors (e.g., the hemochromatosis gene). We consider two scenarios: one allowing dependence of the LV models on genes and the other assuming independence between the LV models and genes. We combine the two sets of estimates by shrinkage estimation to trade off bias and efficiency in a data-adaptive way. Using simulations, we evaluate the properties of the shrinkage estimates, and in particular, we demonstrate the need for this data-adaptive shrinkage given repeated outcome measures, exposure measures possibly repeated and time-varying gene-environment association.
Assuntos
Exposição Ambiental/estatística & dados numéricos , Interação Gene-Ambiente , Modelos Estatísticos , Bioestatística/métodos , Pré-Escolar , Simulação por Computador , Feminino , Proteína da Hemocromatose , Antígenos de Histocompatibilidade Classe I/genética , Humanos , Lactente , Recém-Nascido , Intoxicação por Chumbo/etiologia , Intoxicação por Chumbo/genética , Estudos Longitudinais , Proteínas de Membrana/genética , México , Modelos Genéticos , Polimorfismo de Nucleotídeo Único , Gravidez , Efeitos Tardios da Exposição Pré-Natal/etiologia , Efeitos Tardios da Exposição Pré-Natal/genéticaRESUMO
A major challenge in cancer epidemiologic studies, especially those of rare cancers, is observing enough cases. To address this, researchers often join forces by bringing multiple studies together to achieve large sample sizes, allowing for increased power in hypothesis testing, and improved efficiency in effect estimation. Combining studies, however, renders the analysis difficult owing to the presence of heterogeneity in the pooled data. In this article, motivated by a collaborative nested case-control (NCC) study of ovarian cancer in three cohorts from United States, Sweden, and Italy, we investigate the use of penalty regularized partial likelihood estimation in the context of pooled NCC studies to achieve two goals. First, we propose an adaptive group lasso (gLASSO) penalized approach to simultaneously identify important variables and estimate their effects. Second, we propose a composite agLASSO penalized approach to identify variables with heterogeneous effects. Both methods are readily implemented with the group coordinate gradient decent algorithm and shown to enjoy the oracle property. We conduct simulation studies to evaluate the performance of our proposed approaches in finite samples under various heterogeneity settings, and apply them to the pooled ovarian cancer study.
Assuntos
Estudos de Casos e Controles , Interpretação Estatística de Dados , Modelos de Riscos Proporcionais , Simulação por Computador , Feminino , Humanos , Itália/epidemiologia , Funções Verossimilhança , Neoplasias Ovarianas/epidemiologia , Suécia/epidemiologia , Estados Unidos/epidemiologiaRESUMO
While there has been extensive research developing gene-environment interaction (GEI) methods in case-control studies, little attention has been given to sparse and efficient modeling of GEI in longitudinal studies. In a two-way table for GEI with rows and columns as categorical variables, a conventional saturated interaction model involves estimation of a specific parameter for each cell, with constraints ensuring identifiability. The estimates are unbiased but are potentially inefficient because the number of parameters to be estimated can grow quickly with increasing categories of row/column factors. On the other hand, Tukey's one-degree-of-freedom model for non-additivity treats the interaction term as a scaled product of row and column main effects. Because of the parsimonious form of interaction, the interaction estimate leads to enhanced efficiency, and the corresponding test could lead to increased power. Unfortunately, Tukey's model gives biased estimates and low power if the model is misspecified. When screening multiple GEIs where each genetic and environmental marker may exhibit a distinct interaction pattern, a robust estimator for interaction is important for GEI detection. We propose a shrinkage estimator for interaction effects that combines estimates from both Tukey's and saturated interaction models and use the corresponding Wald test for testing interaction in a longitudinal setting. The proposed estimator is robust to misspecification of interaction structure. We illustrate the proposed methods using two longitudinal studies-the Normative Aging Study and the Multi-ethnic Study of Atherosclerosis.
Assuntos
Aterosclerose/etiologia , Aterosclerose/genética , Exposição Ambiental/efeitos adversos , Interação Gene-Ambiente , Chumbo/efeitos adversos , Idoso , Idoso de 80 Anos ou mais , Envelhecimento/fisiologia , Aterosclerose/etnologia , Osso e Ossos/efeitos dos fármacos , Osso e Ossos/metabolismo , Simulação por Computador , Exposição Ambiental/estatística & dados numéricos , Etnicidade/genética , Etnicidade/estatística & dados numéricos , Feminino , Humanos , Ferro/metabolismo , Chumbo/metabolismo , Análise dos Mínimos Quadrados , Funções Verossimilhança , Estudos Longitudinais , Masculino , Pessoa de Meia-Idade , Modelos Genéticos , Estados Unidos/epidemiologia , United States Department of Veterans AffairsRESUMO
Asking direct questions in face to face surveys about sensitive traits is an intricate issue. One of the solutions to this issue is the randomized response technique (RRT). Being the most widely used indirect questioning technique to obtain truthful data on sensitive traits in survey sampling RRT has been applied in a variety of fields including behavioral science, socio-economic, psychological, epidemiology, biomedical, criminology, data masking, public health engineering, conservation studies, ecological studies and many others. This paper aims at exploring the methods to subsidize the randomized response technique through additional information relevant to the parameter of interest. Specifically, we plan to contribute by proposing more efficient hybrid estimators compared to existing estimator based on (Kuk, 1990) [31] family of randomized response models. The proposed estimators are based on the methodology of incorporating the pertinent information, available on the basis of either historical records or expert opinion. Specifically, in case of availability of auxiliary information, the regression-cum-ratio estimator is found to be the best to further enhance the estimation through (Kuk, 1990) [31] model while the (Thompson, 1968) [49] shrinkage estimation is observed to be yielding more precise and accurate estimator of sensitive proportion. The findings in this study signify the importance of the proposed methodology. Additionally, to support the mathematical findings, a detailed numerical investigation to evaluate the comparative performances is also conducted. Based on performance analysis, overwhelming evidences are witnessed in the favor of proposed strategies.
RESUMO
Basket trials allow simultaneous evaluation of a single therapy across multiple cancer types or subtypes of the same cancer. Since the same treatment is tested across all baskets, it may be desirable to borrow information across them to improve the statistical precision and power in estimating and detecting the treatment effects in different baskets. We review recent developments in Bayesian methods for the design and analysis of basket trials, focusing on the mechanism of information borrowing. We explain the common components of these methods, such as a prior model for the treatment effects that embodies an assumption of exchangeability. We also discuss the distinct features of these methods that lead to different degrees of borrowing. Through simulation studies, we demonstrate the impact of information borrowing on the operating characteristics of these methods and discuss its broader implications for drug development. Examples of basket trials are presented in both phase I and phase II settings.
RESUMO
We propose a generalized double Pareto prior for Bayesian shrinkage estimation and inferences in linear models. The prior can be obtained via a scale mixture of Laplace or normal distributions, forming a bridge between the Laplace and Normal-Jeffreys' priors. While it has a spike at zero like the Laplace density, it also has a Student's t-like tail behavior. Bayesian computation is straightforward via a simple Gibbs sampling algorithm. We investigate the properties of the maximum a posteriori estimator, as sparse estimation plays an important role in many problems, reveal connections with some well-established regularization procedures, and show some asymptotic results. The performance of the prior is tested through simulations and an application.
RESUMO
This article describes some potential uses of Bayesian estimation for time-series and panel data models by incorporating information from prior probabilities (i.e., priors) in addition to observed data. Drawing on econometrics and other literatures we illustrate the use of informative "shrinkage" or "small variance" priors (including so-called "Minnesota priors") while extending prior work on the general cross-lagged panel model (GCLM). Using a panel dataset of national income and subjective well-being (SWB) we describe three key benefits of these priors. First, they shrink parameter estimates toward zero or toward each other for time-varying parameters, which lends additional support for an income â SWB effect that is not supported with maximum likelihood (ML). This is useful because, second, these priors increase model parsimony and the stability of estimates (keeping them within more reasonable bounds) and thus improve out-of-sample predictions and interpretability, which means estimated effect should also be more trustworthy than under ML. Third, these priors allow estimating otherwise under-identified models under ML, allowing higher-order lagged effects and time-varying parameters that are otherwise impossible to estimate using observed data alone. In conclusion we note some of the responsibilities that come with the use of priors which, departing from typical commentaries on their scientific applications, we describe as involving reflection on how best to apply modeling tools to address matters of worldly concern.