Pesquisa | BVS Integralidade em Saúde

A flexible approach for variable selection in large-scale healthcare database studies with missing covariate and outcome data.

Lin, Jung-Yi Joyce; Hu, Liangyuan; Huang, Chuyue; Jiayi, Ji; Lawrence, Steven; Govindarajulu, Usha.

BMC Med Res Methodol ; 22(1): 132, 2022 05 04.

Artigo em Inglês | MEDLINE | ID: mdl-35508974

RESUMO

BACKGROUND: Prior work has shown that combining bootstrap imputation with tree-based machine learning variable selection methods can provide good performances achievable on fully observed data when covariate and outcome data are missing at random (MAR). This approach however is computationally expensive, especially on large-scale datasets. METHODS: We propose an inference-based method, called RR-BART, which leverages the likelihood-based Bayesian machine learning technique, Bayesian additive regression trees, and uses Rubin's rule to combine the estimates and variances of the variable importance measures on multiply imputed datasets for variable selection in the presence of MAR data. We conduct a representative simulation study to investigate the practical operating characteristics of RR-BART, and compare it with the bootstrap imputation based methods. We further demonstrate the methods via a case study of risk factors for 3-year incidence of metabolic syndrome among middle-aged women using data from the Study of Women's Health Across the Nation (SWAN). RESULTS: The simulation study suggests that even in complex conditions of nonlinearity and nonadditivity with a large percentage of missingness, RR-BART can reasonably recover both prediction and variable selection performances, achievable on the fully observed data. RR-BART provides the best performance that the bootstrap imputation based methods can achieve with the optimal selection threshold value. In addition, RR-BART demonstrates a substantially stronger ability of detecting discrete predictors. Furthermore, RR-BART offers substantial computational savings. When implemented on the SWAN data, RR-BART adds to the literature by selecting a set of predictors that had been less commonly identified as risk factors but had substantial biological justifications. CONCLUSION: The proposed variable selection method for MAR data, RR-BART, offers both computational efficiency and good operating characteristics and is utilitarian in large-scale healthcare database studies.

Assuntos

Atenção à Saúde , Modelos Estatísticos , Teorema de Bayes , Simulação por Computador , Interpretação Estatística de Dados , Feminino , Humanos , Funções Verossimilhança , Pessoa de Meia-Idade

Comparison of statistical and machine learning models for healthcare cost data: a simulation study motivated by Oncology Care Model (OCM) data.

Mazumdar, Madhu; Lin, Jung-Yi Joyce; Zhang, Wei; Li, Lihua; Liu, Mark; Dharmarajan, Kavita; Sanderson, Mark; Isola, Luis; Hu, Liangyuan.

BMC Health Serv Res ; 20(1): 350, 2020 Apr 25.

Artigo em Inglês | MEDLINE | ID: mdl-32334595

RESUMO

BACKGROUND: The Oncology Care Model (OCM) was developed as a payment model to encourage participating practices to provide better-quality care for cancer patients at a lower cost. The risk-adjustment model used in OCM is a Gamma generalized linear model (Gamma GLM) with log-link. The predicted value of expense for the episodes identified for our academic medical center (AMC), based on the model fitted to the national data, did not correlate well with our observed expense. This motivated us to fit the Gamma GLM to our AMC data and compare it with two other flexible modeling methods: Random Forest (RF) and Partially Linear Additive Quantile Regression (PLAQR). We also performed a simulation study to assess comparative performance of these methods and examined the impact of non-linearity and interaction effects, two understudied aspects in the field of cost prediction. METHODS: The simulation was designed with an outcome of cost generated from four distributions: Gamma, Weibull, Log-normal with a heteroscedastic error term, and heavy-tailed. Simulation parameters both similar to and different from OCM data were considered. The performance metrics considered were the root mean square error (RMSE), mean absolute prediction error (MAPE), and cost accuracy (CA). Bootstrap resampling was utilized to estimate the operating characteristics of the performance metrics, which were described by boxplots. RESULTS: RF attained the best performance with lowest RMSE, MAPE, and highest CA for most of the scenarios. When the models were misspecified, their performance was further differentiated. Model performance differed more for non-exponential than exponential outcome distributions. CONCLUSIONS: RF outperformed Gamma GLM and PLAQR in predicting overall and top decile costs. RF demonstrated improved prediction under various scenarios common in healthcare cost modeling. Additionally, RF did not require prespecification of outcome distribution, nonlinearity effect, or interaction terms. Therefore, RF appears to be the best tool to predict average cost. However, when the goal is to estimate extreme expenses, e.g., high cost episodes, the accuracy gained by RF versus its computational costs may need to be considered.

Assuntos

Custos de Cuidados de Saúde/estatística & dados numéricos , Aprendizado de Máquina , Modelos Estatísticos , Simulação por Computador , Humanos , Modelos Lineares , Oncologia/economia , Risco Ajustado

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

Detalhe da pesquisa