RESUMO
Handling missing data in clinical prognostic studies is an essential yet challenging task. This study aimed to provide a comprehensive assessment of the effectiveness and reliability of different machine learning (ML) imputation methods across various analytical perspectives. Specifically, it focused on three distinct classes of performance metrics used to evaluate ML imputation methods: post-imputation bias of regression estimates, post-imputation predictive accuracy, and substantive model-free metrics. As an illustration, we applied data from a real-world breast cancer survival study. This comprehensive approach aimed to provide a thorough assessment of the effectiveness and reliability of ML imputation methods across various analytical perspectives. A simulated dataset with 30% Missing At Random (MAR) values was used. A number of single imputation (SI) methods - specifically KNN, missMDA, CART, missForest, missRanger, missCforest - and multiple imputation (MI) methods - specifically miceCART and miceRF - were evaluated. The performance metrics used were Gower's distance, estimation bias, empirical standard error, coverage rate, length of confidence interval, predictive accuracy, proportion of falsely classified (PFC), normalized root mean squared error (NRMSE), AUC, and C-index scores. The analysis revealed that in terms of Gower's distance, CART and missForest were the most accurate, while missMDA and CART excelled for binary covariates; missForest and miceCART were superior for continuous covariates. When assessing bias and accuracy in regression estimates, miceCART and miceRF exhibited the least bias. Overall, the various imputation methods demonstrated greater efficiency than complete-case analysis (CCA), with MICE methods providing optimal confidence interval coverage. In terms of predictive accuracy for Cox models, missMDA and missForest had superior AUC and C-index scores. Despite offering better predictive accuracy, the study found that SI methods introduced more bias into the regression coefficients compared to MI methods. This study underlines the importance of selecting appropriate imputation methods based on study goals and data types in time-to-event research. The varying effectiveness of methods across the different performance metrics studied highlights the value of using advanced machine learning algorithms within a multiple imputation framework to enhance research integrity and the robustness of findings.
Assuntos
Neoplasias da Mama , Aprendizado de Máquina , Humanos , Neoplasias da Mama/mortalidade , Feminino , Reprodutibilidade dos Testes , Algoritmos , Prognóstico , Interpretação Estatística de Dados , Análise de SobrevidaRESUMO
In the presence of competing causes of event occurrence (e.g., death), the interest might not only be in the overall survival but also in the so-called net survival, that is, the hypothetical survival that would be observed if the disease under study were the only possible cause of death. Net survival estimation is commonly based on the excess hazard approach in which the hazard rate of individuals is assumed to be the sum of a disease-specific and expected hazard rate, supposed to be correctly approximated by the mortality rates obtained from general population life tables. However, this assumption might not be realistic if the study participants are not comparable with the general population. Also, the hierarchical structure of the data can induces a correlation between the outcomes of individuals coming from the same clusters (e.g., hospital, registry). We proposed an excess hazard model that corrects simultaneously for these two sources of bias, instead of dealing with them independently as before. We assessed the performance of this new model and compared it with three similar models, using extensive simulation study, as well as an application to breast cancer data from a multicenter clinical trial. The new model performed better than the others in terms of bias, root mean square error, and empirical coverage rate. The proposed approach might be useful to account simultaneously for the hierarchical structure of the data and the non-comparability bias in studies such as long-term multicenter clinical trials, when there is interest in the estimation of net survival.
Assuntos
Neoplasias da Mama , Humanos , Feminino , Modelos de Riscos Proporcionais , Análise de Sobrevida , Simulação por Computador , ViésRESUMO
BACKGROUND: Methods for estimating relative survival are widely used in population-based cancer survival studies. These methods are based on splitting the observed (the overall) mortality into excess mortality (due to cancer) and background mortality (due to other causes, as expected in the general population). The latter is derived from life tables usually stratified by age, sex, and calendar year but not by other covariates (such as the deprivation level or the socioeconomic status) which may lack though they would influence background mortality. The absence of these covariates leads to inaccurate background mortality, thus to biases in estimating the excess mortality. These biases may be avoided by adjusting the background mortality for these covariates whenever available. METHODS: In this work, we propose a regression model of excess mortality that corrects for potentially inaccurate background mortality by introducing age-dependent multiplicative parameters through breakpoints, which gives some flexibility. The performance of this model was first assessed with a single and two breakpoints in an intensive simulation study, then the method was applied to French population-based data on colorectal cancer. RESULTS: The proposed model proved to be interesting in the simulations and the applications to real data; it limited the bias in parameter estimates of the excess mortality in several scenarios and improved the results and the generalizability of Touraine's proportional hazards model. CONCLUSION: Finally, the proposed model is a good approach to correct reliably inaccurate background mortality by introducing multiplicative parameters that depend on age and on an additional variable through breakpoints.
Assuntos
Neoplasias , Viés , Simulação por Computador , Humanos , Modelos de Riscos Proporcionais , Projetos de PesquisaRESUMO
Net survival is used in epidemiological studies to assess excess mortality due to a given disease when causes of death are unreliable. By correcting for the general population mortality, it allows comparisons between regions or periods and thus evaluation of health policies. The Pohar-Perme non-parametric estimator of net survival has been recently proposed, soon followed by an appropriate log-rank-type test. However, log-rank tests are known to be under-optimal in non-proportional settings (e.g. crossing of the hazard functions). In classical survival analysis, one solution is to compare the restricted mean survival times. A difference in restricted mean survival time represents a life benefit or loss over the studied period. In the present article the restricted mean net survival time was used to derive a specific test statistic to compare net survivals in proportional and non-proportional hazards settings. The new test was generalized to more than two groups and to stratified analysis. The test performance was assessed on simulation study, compared to the log-rank-type test, and its use illustrated on a population-based colorectal cancer registry. The new test for net survival comparisons proved robust to non-proportionality and well-performing in proportional hazards situations. Furthermore, it is also suited to the classical survival framework.
Assuntos
Taxa de Sobrevida , Causalidade , Simulação por Computador , Humanos , Modelos de Riscos Proporcionais , Análise de SobrevidaRESUMO
In randomized clinical trials (RCT), the analysis is based on the intent-to-treat principle to avoid any selection bias in the constitution of groups. However, estimates of overall survival can be biased when significant crossover occurs because the separation of randomized groups is lost. To handle these switches, the inverse probability of censoring weighting (IPCW) method has been proposed; however, it is still poorly used in RCT, notably because of its complex implementation. In particular, for time-to-event outcomes, it can be difficult to format data, especially when time-dependent covariates have to be managed, with different measurement times between patients. This paper aims to present the R package ipcwswitch with some guidance for the analysis of the treatment effect on survival in a hypothetical setting where all patients would have continued to take the randomization treatment. After a brief recall of the key principles of the IPCW method, each step of the implementation is described using a toy example. The guidelines are illustrated in a case study that aimed at evaluating the benefit of therapy based on tumour molecular profiling for advanced cancers, SHIVA01.
Assuntos
Interpretação Estatística de Dados , Neoplasias , Software , Análise de Sobrevida , Humanos , Neoplasias/mortalidade , Neoplasias/terapia , Ensaios Clínicos Controlados Aleatórios como AssuntoRESUMO
BACKGROUND: Net survival, a measure of the survival where the patients would only die from the cancer under study, may be compared between treatment groups using either "cause-specific methods", when the causes of death are known and accurate, or "population-based methods", when the causes are missing or inaccurate. The latter methods rely on the assumption that mortality due to other causes than cancer is the same as the expected mortality in the general population with same demographic characteristics derived from population life tables. This assumption may not hold in clinical trials where patients are likely to be quite different from the general population due to some criteria for patient selection. METHODS: In this work, we propose and assess the performance of a new flexible population-based model to estimate long-term net survival in clinical trials and that allows for cause-of-death misclassification and for effects of selection. Comparisons were made with cause-specific and other population-based methods in a simulation study and in an application to prostate cancer clinical trial data. RESULTS: In estimating net survival, cause-specific methods seemed to introduce important biases associated with the degree of misclassification of cancer deaths. The usual population-based method provides also biased estimates, depending on the strength of the selection effect. Compared to these methods, the new model was able to provide more accurate estimates of net survival in long-term clinical trials. CONCLUSION: Finally, the new model paves the way for new methodological developments in the field of net survival methods in multicenter clinical trials.
Assuntos
Ensaios Clínicos como Assunto/métodos , Confiabilidade dos Dados , Neoplasias da Próstata/mortalidade , Análise de Sobrevida , Idoso , Causas de Morte , Simulação por Computador , Dietilestilbestrol/uso terapêutico , Humanos , Masculino , Neoplasias da Próstata/tratamento farmacológico , Projetos de PesquisaRESUMO
For estimating the causal effect of treatment exposure on the occurrence of adverse events, inverse probability weights (IPW) can be used in marginal structural models to correct for time-dependent confounding. The R package ipw allows IPW estimation by modeling the relationship between the exposure and confounders via several regression models, among which is the Cox model. For right-censored data and time-dependent exposures such as treatment switches, the ipw package allows a single switch, assuming that patients are treated once and for all. However, to accommodate multiple switches, we extend this package by implementing a function that allows for multiple and intermittent exposure status in the estimation of IPW using a survival model. This extension allows for the whole exposure treatment trajectory in the estimation of IPW. The impact of the estimated weights on the estimated causal effect, with both methods, is assessed in a simulation study. Then, the function is illustrated on a real dataset from a nationwide prospective observational cohort including patients with inflammatory bowel disease. In this study, patients received one or multiple medications (thiopurines, methotrexate, and anti-TNF) over time. We used a Cox marginal structural model to assess the effect of thiopurines exposure on the cause-specific hazard for cancer incidence considering other treatments as confounding factors. To this end, we used our extended function which is available online in the Supporting Information.
Assuntos
Biometria/métodos , Modelos Estatísticos , Determinação de Ponto Final , Feminino , Humanos , Masculino , Estudos Observacionais como Assunto , Probabilidade , Modelos de Riscos Proporcionais , Fatores de TempoRESUMO
In population-based cancer studies, it is often interesting to compare cancer survival between different populations. However, in such studies, the exact causes of death are often unavailable or unreliable. Net survival methods were developed to overcome this difficulty. Net survival is the survival that would be observed if the disease under study was the only possible cause of death. The Pohar-Perme estimator (PPE) is a nonparametric consistent estimator of net survival. In this article, we present a log-rank-type test for comparing net survival functions (as estimated by PPE) between several groups. We put the test within the counting process framework to introduce the inverse probability weighting procedure as required by the PPE. We built a stratified version to control for categorical covariates that affect the outcome. We performed simulation studies to evaluate the performance of this test and worked an application on real data.
Assuntos
Neoplasias Colorretais/mortalidade , Modelos Estatísticos , Análise de Sobrevida , Simulação por Computador , Humanos , ProbabilidadeRESUMO
Regression-based relative survival models are commonly used in population-based cancer studies to estimate the real impact on the excess mortality of covariates that influence overall mortality. Usually, the mortality observed in a study cohort is corrected by the expected mortality hazard in the general population, which is given by life tables provided by national statistics institutes. These life tables are stratified by age, sex, calendar year, and, sometimes, other demographic data (ethnicity, deprivation, and others). However, in most cases, the same demographic data are not available for the study cohort and the general population; this leads to differences between the expected mortality of the general population and that of the study cohort. More generally, the absence of some demographic variables in life tables may introduce a measurement bias into the estimation of the excess mortality. In the present article, we used a simulation approach with different plausible scenarios to evaluate the impact of an additional life-table variable on excess mortality estimates and study the extent and the direction of the biases in estimating the effect of each covariate on the excess mortality. We showed that the use of life table that lacks stratification by a variable present in the excess hazard model results in a measurement bias not only in the estimate of the effect of this variable but also, to a lesser extent, in the estimates of the effects of the other covariates included in the model. We also demonstrated this measurement bias by a population-based colorectal cancer analysis.