RESUMEN
It is becoming increasingly common for researchers to consider incorporating external information from large studies to improve the accuracy of statistical inference instead of relying on a modestly sized data set collected internally. With some new predictors only available internally, we aim to build improved regression models based on individual-level data from an "internal" study while incorporating summary-level information from "external" models. We propose a meta-analysis framework along with two weighted estimators as the composite of empirical Bayes estimators, which combines the estimates from different external models. The proposed framework is flexible and robust in the ways that (i) it is capable of incorporating external models that use a slightly different set of covariates; (ii) it is able to identify the most relevant external information and diminish the influence of information that is less compatible with the internal data; and (iii) it nicely balances the bias-variance trade-off while preserving the most efficiency gain. The proposed estimators are more efficient than the naïve analysis of the internal data and other naïve combinations of external estimators.
Asunto(s)
Modelos Estadísticos , Humanos , Teorema de Bayes , Interpretación Estadística de Datos , SesgoRESUMEN
BACKGROUND: Head and neck cancer (HNC) has low 5-year survival, and evidence-based recommendations for tertiary prevention are lacking. Aspirin improves outcomes for cancers at other sites, but its role in HNC tertiary prevention remains understudied. METHODS: HNC patients were recruited in the University of Michigan Head and Neck Cancer Specialized Program of Research Excellence (SPORE) from 2003 to 2014. Aspirin data were collected through medical record review; outcomes (overall mortality, HNC-specific mortality, and recurrence) were collected through medical record review, Social Security Death Index, or LexisNexis. Cox proportional hazards models were used to evaluate the associations between aspirin use at diagnosis (yes/no) and HNC outcomes. RESULTS: We observed no statistically significant associations between aspirin and cancer outcome in our HNC patient cohort (n = 1161) (HNC-specific mortality: HR = 0.91, 95% CI = 0.68-1.21; recurrence: HR = 0.94, 95% CI = 0.73-1.19). In analyses stratified by anatomic site, HPV status, and disease stage, we observed no association in any strata examined with the possible exception of a lower risk of recurrence in oropharynx patients (HR = 0.60, 95% CI 0.35-1.04). CONCLUSIONS: Our findings do not support a protective association between aspirin use and cancer-specific death or recurrence in HNC patients, with the possible exception of a lower risk of recurrence in oropharynx patients.
Asunto(s)
Aspirina , Neoplasias de Cabeza y Cuello , Humanos , Aspirina/uso terapéutico , Neoplasias de Cabeza y Cuello/tratamiento farmacológico , Modelos de Riesgos ProporcionalesRESUMEN
PURPOSE: Surveillance, Epidemiology, and End Results (SEER) cancer registries provides information about survival duration and cause of death for cancer patients. Baseline demographic and tumor characteristics such as age, sex, race, year of diagnosis, and tumor stage can inform the expected survival time of patients, but their associations with survival may not be constant over the post-diagnosis period. METHODS: Using SEER data, we examined if there were time-varying associations of patient and tumor characteristics on survival, and we assessed how these relationships differed across 14 cancer sites. Standard Cox proportional hazards models were extended to allow for time-varying associations and incorporated into a competing-risks framework, separately modeling cancer-specific and other-cause deaths. For each cancer site and for each of the five factors, we estimated the relative hazard ratio and absolute hazard over time in the presence of competing risks. RESULTS: Our comprehensive consideration of patient and tumor characteristics when estimating time-varying hazards showed that the associations of age, tumor stage at diagnosis, and race/ethnicity with risk of death (cancer-specific and other-cause) change over time for many cancers; characteristics of sex and year of diagnosis exhibit some time-varying patterns as well. Stage at diagnosis had the largest associations with survival. CONCLUSION: These findings suggest that proportional hazards assumptions are often violated when examining patient characteristics on cancer survival post-diagnosis. We discuss several interesting results where the relative hazards are time-varying and suggest possible interpretations. Based on the time-varying associations of several important covariates on survival after cancer diagnosis using a pan-cancer approach, the likelihood of the proportional hazards assumption being met or corresponding interpretation should be considered in survival analyses, as flawed inference may have implications for cancer care and policy.
Asunto(s)
Neoplasias , Programa de VERF , Humanos , Masculino , Femenino , Neoplasias/mortalidad , Neoplasias/epidemiología , Neoplasias/patología , Persona de Mediana Edad , Anciano , Modelos de Riesgos Proporcionales , Estados Unidos/epidemiología , Tasa de Supervivencia , Adulto , Factores de TiempoRESUMEN
We aim to estimate parameters in a generalized linear model (GLM) for a binary outcome when, in addition to the raw data from the internal study, more than 1 external study provides summary information in the form of parameter estimates from fitting GLMs with varying subsets of the internal study covariates. We propose an adaptive penalization method that exploits the external summary information and gains efficiency for estimation, and that is both robust and computationally efficient. The robust property comes from exploiting the relationship between parameters of a GLM and parameters of a GLM with omitted covariates and from downweighting external summary information that is less compatible with the internal data through a penalization. The computational burden associated with searching for the optimal tuning parameter for the penalization is reduced by using adaptive weights and by using an information criterion when searching for the optimal tuning parameter. Simulation studies show that the proposed estimator is robust against various types of population distribution heterogeneity and also gains efficiency compared to direct maximum likelihood estimation. The method is applied to improve a logistic regression model that predicts high-grade prostate cancer making use of parameter estimates from 2 external models.
Asunto(s)
Modelos Estadísticos , Masculino , Humanos , Modelos Lineales , Análisis de Regresión , Funciones de Verosimilitud , Modelos Logísticos , Simulación por ComputadorRESUMEN
We consider the setting where (1) an internal study builds a linear regression model for prediction based on individual-level data, (2) some external studies have fitted similar linear regression models that use only subsets of the covariates and provide coefficient estimates for the reduced models without individual-level data, and (3) there is heterogeneity across these study populations. The goal is to integrate the external model summary information into fitting the internal model to improve prediction accuracy. We adapt the James-Stein shrinkage method to propose estimators that are no worse and are oftentimes better in the prediction mean squared error after information integration, regardless of the degree of study population heterogeneity. We conduct comprehensive simulation studies to investigate the numerical performance of the proposed estimators. We also apply the method to enhance a prediction model for patella bone lead level in terms of blood lead level and other covariates by integrating summary information from published literature.
Asunto(s)
Simulación por Computador , Humanos , Modelos Lineales , Biometría/métodos , Plomo/sangre , Rótula , Modelos Estadísticos , Interpretación Estadística de DatosRESUMEN
Joint models for longitudinal and time-to-event data are often employed to calculate dynamic individualized predictions used in numerous applications of precision medicine. Two components of joint models that influence the accuracy of these predictions are the shape of the longitudinal trajectories and the functional form linking the longitudinal outcome history to the hazard of the event. Finding a single well-specified model that produces accurate predictions for all subjects and follow-up times can be challenging, especially when considering multiple longitudinal outcomes. In this work, we use the concept of super learning and avoid selecting a single model. In particular, we specify a weighted combination of the dynamic predictions calculated from a library of joint models with different specifications. The weights are selected to optimize a predictive accuracy metric using V-fold cross-validation. We use as predictive accuracy measures the expected quadratic prediction error and the expected predictive cross-entropy. In a simulation study, we found that the super learning approach produces results very similar to the Oracle model, which was the model with the best performance in the test datasets. All proposed methodology is implemented in the freely available R package JMbayes2.
Asunto(s)
Medicina de Precisión , Humanos , Simulación por Computador , Medicina de Precisión/métodosRESUMEN
Biomedical data often exhibit jumps or abrupt changes. For example, women's basal body temperature may jump at ovulation, menstruation, implantation, and miscarriage. These sudden changes make these data challenging to model: many methods will oversmooth the sharp changes or overfit in response to measurement error. We develop horseshoe process regression (HPR) to address this problem. We define a horseshoe process as a stochastic process in which each increment is horseshoe-distributed. We use the horseshoe process as a nonparametric Bayesian prior for modeling a potentially nonlinear association between an outcome and its continuous predictor, which we implement via Stan and in the R package HPR. We provide guidance and extensions to advance HPR's use in applied practice: we introduce a Bayesian imputation scheme to allow for interpolation at unobserved values of the predictor within the HPR; include additional covariates via a partial linear model framework; and allow for monotonicity constraints. We find that HPR performs well when fitting functions that have sharp changes. We apply HPR to model women's basal body temperatures over the course of the menstrual cycle.
Asunto(s)
Temperatura Corporal , Ciclo Menstrual , Femenino , Humanos , Teorema de Bayes , Ciclo Menstrual/fisiología , Menstruación , Modelos LinealesRESUMEN
A common practice in clinical trials is to evaluate a treatment effect on an intermediate outcome when the true outcome of interest would be difficult or costly to measure. We consider how to validate intermediate outcomes in a causally-valid way when the trial outcomes are time-to-event. Using counterfactual outcomes, those that would be observed if the counterfactual treatment had been given, the causal association paradigm assesses the relationship of the treatment effect on the surrogate outcome with the treatment effect on the true, primary outcome. In particular, we propose illness-death models to accommodate the censored and semicompeting risk structure of survival data. The proposed causal version of these models involves estimable and counterfactual frailty terms. Via these multistate models, we characterize what a valid surrogate would look like using a causal effect predictiveness plot. We evaluate the estimation properties of a Bayesian method using Markov chain Monte Carlo and assess the sensitivity of our model assumptions. Our motivating data source is a localized prostate cancer clinical trial where the two survival outcomes are time to distant metastasis and time to death.
Asunto(s)
Fragilidad , Modelos Estadísticos , Humanos , Teorema de Bayes , BiomarcadoresRESUMEN
Valid surrogate endpoints S can be used as a substitute for a true outcome of interest T to measure treatment efficacy in a clinical trial. We propose a causal inference approach to validate a surrogate by incorporating longitudinal measurements of the true outcomes using a mixed modeling approach, and we define models and quantities for validation that may vary across the study period using principal surrogacy criteria. We consider a surrogate-dependent treatment efficacy curve that allows us to validate the surrogate at different time points. We extend these methods to accommodate a delayed-start treatment design where all patients eventually receive the treatment. Not all parameters are identified in the general setting. We apply a Bayesian approach for estimation and inference, utilizing more informative prior distributions for selected parameters. We consider the sensitivity of these prior assumptions as well as assumptions of independence among certain counterfactual quantities conditional on pretreatment covariates to improve identifiability. We examine the frequentist properties (bias of point and variance estimates, credible interval coverage) of a Bayesian imputation method. Our work is motivated by a clinical trial of a gene therapy where the functional outcomes are measured repeatedly throughout the trial.
Asunto(s)
Modelos Estadísticos , Humanos , Teorema de Bayes , Biomarcadores , Resultado del Tratamiento , CausalidadRESUMEN
Consider the setting where (i) individual-level data are collected to build a regression model for the association between an event of interest and certain covariates, and (ii) some risk calculators predicting the risk of the event using less detailed covariates are available, possibly as algorithmic black boxes with little information available about how they were built. We propose a general empirical-likelihood-based framework to integrate the rich auxiliary information contained in the calculators into fitting the regression model, to make the estimation of regression parameters more efficient. Two methods are developed, one using working models to extract the calculator information and one making a direct use of calculator predictions without working models. Theoretical and numerical investigations show that the calculator information can substantially reduce the variance of regression parameter estimation. As an application, we study the dependence of the risk of high grade prostate cancer on both conventional risk factors and newly identified molecular biomarkers by integrating information from the Prostate Biopsy Collaborative Group (PBCG) risk calculator, which was built based on conventional risk factors alone.
Insérer votre résumé ici. We will supply a French abstract for those authors who can't prepare it themselves.
RESUMEN
Dynamic prediction uses patient information collected during follow-up to produce individualized survival predictions at given time points beyond treatment or diagnosis. This allows clinicians to obtain updated predictions of a patient's prognosis that can be used in making personalized treatment decisions. Two commonly used approaches for dynamic prediction are landmarking and joint modeling. Landmarking does not constitute a comprehensive probability model, and joint modeling often requires strong distributional assumptions and computationally intensive methods for estimation. We introduce an alternative approximate approach for dynamic prediction that aims to overcome the limitations of both methods while achieving good predictive performance. We separately specify the marker and failure time distributions conditional on surviving up to a prediction time of interest and use standard variable selection and goodness-of-fit techniques to identify the best-fitting models. Taking advantage of its analytic tractability and easy two-stage estimation, we use a Gaussian copula to link the marginal distributions smoothly at each prediction time with an association function. With simulation studies, we examine the proposed method's performance. We illustrate its use for dynamic prediction in an application to predicting death for heart valve transplant patients using longitudinal left ventricular mass index information.
Asunto(s)
Modelos Estadísticos , Biomarcadores/análisis , Simulación por Computador , Humanos , Distribución Normal , Probabilidad , PronósticoRESUMEN
The goal in personalized medicine is to individualize treatment using patient characteristics and improve health outcomes. Selection of optimal dose must balance the effect of dose on both treatment efficacy and toxicity outcomes. We consider a setting with one binary efficacy and one binary toxicity outcome. The goal is to find the optimal dose for each patient using clinical features and biomarkers from available dataset. We propose to use flexible machine learning methods such as random forest and Gaussian process models to build models for efficacy and toxicity depending on dose and biomarkers. A copula is used to model the joint distribution of the two outcomes and the estimates are constrained to have non-decreasing dose-efficacy and dose-toxicity relationships. Numerical utilities are elicited from clinicians for each potential bivariate outcome. For each patient, the optimal dose is chosen to maximize the posterior mean of the utility function. We also propose alternative approaches to optimal dose selection by adding additional toxicity based constraints and an approach taking into account the uncertainty in the estimation of the utility function. The proposed methods are evaluated in a simulation study to compare expected utility outcomes under various estimated optimal dose rules. Gaussian process models tended to have better performance than random forest. Enforcing monotonicity during modeling provided small benefits. Whether and how, correlation between efficacy and toxicity, was modeled, had little effect on performance. The proposed methods are illustrated with a study of patients with liver cancer treated with stereotactic body radiation therapy.
Asunto(s)
Aprendizaje Automático , Biomarcadores , Simulación por Computador , Humanos , Distribución Normal , Resultado del TratamientoRESUMEN
Survival modeling with time-varying coefficients has proven useful in analyzing time-to-event data with one or more distinct failure types. When studying the cause-specific etiology of breast and prostate cancers using the large-scale data from the Surveillance, Epidemiology, and End Results (SEER) Program, we encountered two major challenges that existing methods for estimating time-varying coefficients cannot tackle. First, these methods, dependent on expanding the original data in a repeated measurement format, result in formidable time and memory consumption as the sample size escalates to over one million. In this case, even a well-configured workstation cannot accommodate their implementations. Second, when the large-scale data under analysis include binary predictors with near-zero variance (e.g., only 0.6% of patients in our SEER prostate cancer data had tumors regional to the lymph nodes), existing methods suffer from numerical instability due to ill-conditioned second-order information. The estimation accuracy deteriorates further with multiple competing risks. To address these issues, we propose a proximal Newton algorithm with a shared-memory parallelization scheme and tests of significance and nonproportionality for the time-varying effects. A simulation study shows that our scalable approach reduces the time and memory costs by orders of magnitude and enjoys improved estimation accuracy compared with alternative approaches. Applications to the SEER cancer data demonstrate the real-world performance of the proximal Newton algorithm.
Asunto(s)
Neoplasias de la Próstata , Algoritmos , Humanos , Masculino , Neoplasias de la Próstata/epidemiología , Programa de VERF , Tamaño de la MuestraRESUMEN
Head and neck squamous cell carcinoma (HNSCC) is a morbid cancer with poor outcomes. Statins possess anticancer properties such as immunomodulatory and anti-inflammatory effects. The objective of our study is to identify the association between statin use among untreated HNSCC patients and overall death, disease-specific death and recurrence. HNSCC patients were recruited to participate in the University of Michigan Head and Neck Cancer Specialized Program of Research Excellence (SPORE) from 2003 to 2014. Statin use data were collected through medical record review. Participants were considered a statin user if they used a statin at or after diagnosis. Outcome data were collected through medical record review, Social Security Death Index or LexisNexis. Our analytic cohort included 1638 participants. Cox proportional hazard models were used to estimate the association between ever statin use and HNSCC outcomes. Statin use was seen in 36.0% of participants. We observed a statistically significant inverse association between ever using a statin and overall death (HR = 0.75, 95% CI = 0.63-0.88) and HNSCC-specific death (HR = 0.79, 95% CI = 0.63-0.99) and a nonstatistically significant inverse association for recurrence (HR = 0.85, 95% CI = 0.70-1.04). When investigating the association between statin use and HNSCC outcomes utilizing interaction terms between statin use and human papillomavirus (HPV), statistically significant interactions for HNSCC-specific death and recurrence were identified (HNSCC-specific death: HPV-positive HR = 0.41, 95% CI = 0.21-0.84; HPV-negative HR = 1.04, 95% CI = 0.71-1.51; p-int=0.02; recurrence: HPV-positive HR = 0.49, 95% CI = 0.29-0.84; HPV-negative HR = 1.03, 95% CI = 0.74-1.43; p=int-0.02). Statin use may be protective for adverse outcomes in HNSCC patients, particularly those with HPV-positive disease. If true, these findings could have a meaningful impact on tertiary prevention for this cancer.
RESUMEN
Multiple imputation by chained equations (MICE) has emerged as a popular approach for handling missing data. A central challenge for applying MICE is determining how to incorporate outcome information into covariate imputation models, particularly for complicated outcomes. Often, we have a particular analysis model in mind, and we would like to ensure congeniality between the imputation and analysis models. We propose a novel strategy for directly incorporating the analysis model into the handling of missing data. In our proposed approach, multiple imputations of missing covariates are obtained without using outcome information. We then utilize the strategy of imputation stacking, where multiple imputations are stacked on top of each other to create a large data set. The analysis model is then incorporated through weights. Instead of applying Rubin's combining rules, we obtain parameter estimates by fitting a weighted version of the analysis model on the stacked data set. We propose a novel estimator for obtaining standard errors for this stacked and weighted analysis. Our estimator is based on the observed data information principle in Louis' work and can be applied for analyzing stacked multiple imputations more generally. Our approach for analyzing stacked multiple imputations is the first method that can be easily applied (using R package StackImpute) for a wide variety of standard analysis models and missing data settings.
Asunto(s)
Modelos Estadísticos , Proyectos de InvestigaciónRESUMEN
Dynamic prediction methods incorporate longitudinal biomarker information to produce updated, more accurate predictions of conditional survival probability. There are two approaches for obtaining dynamic predictions: (1) a joint model of the longitudinal marker and survival process, and (2) an approximate approach that specifies a model for a specific component of the joint distribution. In the case of a binary marker, an illness-death model is an example of a joint modeling approach that is unified and produces consistent predictions. However, previous literature has shown that approximate approaches, such as landmarking, with additional flexibility can have good predictive performance. One such approach proposes using a Gaussian copula to model the joint distribution of conditional continuous marker and survival distributions. It has the advantage of specifying established, flexible models for the marginals for which goodness-of-fit can be assessed, and has easy estimation that can be implemented in standard software. In this article, we provide a Gaussian copula approach for dynamic prediction to accommodate a binary marker using a continuous latent variable formulation. We compare the predictive performance of this approach to joint modeling and landmarking using simulations and demonstrate its use for obtaining dynamic predictions in an application to a prostate cancer study.
Asunto(s)
Modelos Estadísticos , Neoplasias de la Próstata , Biomarcadores/análisis , Humanos , Masculino , Distribución Normal , ProbabilidadRESUMEN
Not-at-random missingness presents a challenge in addressing missing data in many health research applications. In this article, we propose a new approach to account for not-at-random missingness after multiple imputation through weighted analysis of stacked multiple imputations. The weights are easily calculated as a function of the imputed data and assumptions about the not-at-random missingness. We demonstrate through simulation that the proposed method has excellent performance when the missingness model is correctly specified. In practice, the missingness mechanism will not be known. We show how we can use our approach in a sensitivity analysis framework to evaluate the robustness of model inference to different assumptions about the missingness mechanism, and we provide R package StackImpute to facilitate implementation as part of routine sensitivity analyses. We apply the proposed method to account for not-at-random missingness in human papillomavirus test results in a study of survival for patients diagnosed with oropharyngeal cancer.
Asunto(s)
Modelos Estadísticos , Proyectos de Investigación , Simulación por Computador , Interpretación Estadística de Datos , HumanosRESUMEN
A surrogate endpoint S in a clinical trial is an outcome that may be measured earlier or more easily than the true outcome of interest T. In this work, we extend causal inference approaches to validate such a surrogate using potential outcomes. The causal association paradigm assesses the relationship of the treatment effect on the surrogate with the treatment effect on the true endpoint. Using the principal surrogacy criteria, we utilize the joint conditional distribution of the potential outcomes T, given the potential outcomes S. In particular, our setting of interest allows us to assume the surrogate under the placebo, S(0) , is zero-valued, and we incorporate baseline covariates in the setting of normally distributed endpoints. We develop Bayesian methods to incorporate conditional independence and other modeling assumptions and explore their impact on the assessment of surrogacy. We demonstrate our approach via simulation and data that mimics an ongoing study of a muscular dystrophy gene therapy.
Asunto(s)
Teorema de Bayes , Biomarcadores , Causalidad , HumanosRESUMEN
Dynamic treatment regimes (DTRs) include a sequence of treatment decision rules, in which treatment is adapted over time in response to the changes in an individual's disease progression and health care history. In medical practice, nested test-and-treat strategies are common to improve cost-effectiveness. For example, for patients at risk of prostate cancer, only patients who have high prostate-specific antigen (PSA) need a biopsy, which is costly and invasive, to confirm the diagnosis and help determine the treatment if needed. A decision about treatment happens after the biopsy, and is thus nested within the decision of whether to do the test. However, current existing statistical methods are not able to accommodate such a naturally embedded property of the treatment decision within the test decision. Therefore, we developed a new statistical learning method, step-adjusted tree-based reinforcement learning, to evaluate DTRs within such a nested multistage dynamic decision framework using observational data. At each step within each stage, we combined the robust semiparametric estimation via augmented inverse probability weighting with a tree-based reinforcement learning method to deal with the counterfactual optimization. The simulation studies demonstrated robust performance of the proposed methods under different scenarios. We further applied our method to evaluate the necessity of prostate biopsy and identify the optimal test-and-treat regimes for prostate cancer patients using data from the Johns Hopkins University prostate cancer active surveillance dataset.
Asunto(s)
Neoplasias de la Próstata , Proyectos de Investigación , Simulación por Computador , Humanos , Masculino , Probabilidad , Neoplasias de la Próstata/diagnóstico , Neoplasias de la Próstata/terapiaRESUMEN
In medical research, the Brier score (BS) and the area under the receiver operating characteristic (ROC) curves (AUC) are two common metrics used to evaluate prediction models of a binary outcome, such as using biomarkers to predict the risk of developing a disease in the future. The assessment of an existing prediction models using data with missing covariate values is challenging. In this article, we propose inverse probability weighted (IPW) and augmented inverse probability weighted (AIPW) estimates of AUC and BS to handle the missing data. An alternative approach uses multiple imputation (MI), which requires a model for the distribution of the missing variable. We evaluated the performance of IPW and AIPW in comparison with MI in simulation studies under missing completely at random, missing at random, and missing not at random scenarios. When there are missing observations in the data, MI and IPW can be used to obtain unbiased estimates of BS and AUC if the imputation model for the missing variable or the model for the missingness is correctly specified. MI is more efficient than IPW. Our simulation results suggest that AIPW can be more efficient than IPW, and also achieves double robustness from miss-specification of either the missingness model or the imputation model. The outcome variable should be included in the model for the missing variable under all scenarios, while it only needs to be included in missingness model if the missingness depends on the outcome. We illustrate these methods using an example from prostate cancer.