Comparison of the effects of imputation methods for missing data in predictive modelling of cohort study datasets.

Li, JiaHang; Guo, ShuXia; Ma, RuLin; He, Jia; Zhang, XiangHui; Rui, DongSheng; Ding, YuSong; Li, Yu; Jian, LeYao; Cheng, Jing; Guo, Heng

Li, JiaHang; Guo, ShuXia; Ma, RuLin; He, Jia; Zhang, XiangHui; Rui, DongSheng; Ding, YuSong; Li, Yu; Jian, LeYao; Cheng, Jing; Guo, Heng.

Li J; Department of Public Health, Shihezi University School of Medicine, North 2th Road, Shihezi, 832003, Xinjiang, China.
Guo S; Key Laboratory for Prevention and Control of Emerging Infectious Diseases and Public Health Security, the Xinjiang Production and Construction Corps, Shihezi, Xinjiang, 832000, China.
Ma R; Department of Public Health, Shihezi University School of Medicine, North 2th Road, Shihezi, 832003, Xinjiang, China.
He J; Key Laboratory for Prevention and Control of Emerging Infectious Diseases and Public Health Security, the Xinjiang Production and Construction Corps, Shihezi, Xinjiang, 832000, China.
Zhang X; Department of Public Health, Shihezi University School of Medicine, North 2th Road, Shihezi, 832003, Xinjiang, China.
Rui D; Key Laboratory for Prevention and Control of Emerging Infectious Diseases and Public Health Security, the Xinjiang Production and Construction Corps, Shihezi, Xinjiang, 832000, China.
Ding Y; Department of Public Health, Shihezi University School of Medicine, North 2th Road, Shihezi, 832003, Xinjiang, China.
Li Y; Key Laboratory for Prevention and Control of Emerging Infectious Diseases and Public Health Security, the Xinjiang Production and Construction Corps, Shihezi, Xinjiang, 832000, China.
Jian L; Department of Public Health, Shihezi University School of Medicine, North 2th Road, Shihezi, 832003, Xinjiang, China.
Cheng J; Key Laboratory for Prevention and Control of Emerging Infectious Diseases and Public Health Security, the Xinjiang Production and Construction Corps, Shihezi, Xinjiang, 832000, China.
Guo H; Department of Public Health, Shihezi University School of Medicine, North 2th Road, Shihezi, 832003, Xinjiang, China.

BMC Med Res Methodol ; 24(1): 41, 2024 Feb 16.

Article en En | MEDLINE | ID: mdl-38365610

ABSTRACT

ABSTRACT

BACKGROUND:

Missing data is frequently an inevitable issue in cohort studies and it can adversely affect the study's findings. We assess the effectiveness of eight frequently utilized statistical and machine learning (ML) imputation methods for dealing with missing data in predictive modelling of cohort study datasets. This evaluation is based on real data and predictive models for cardiovascular disease (CVD) risk.

METHODS:

The data is from a real-world cohort study in Xinjiang, China. It includes personal information, physical examination data, questionnaires, and laboratory biochemical results from 10,164 subjects with a total of 37 variables. Simple imputation (Simple), regression imputation (Regression), expectation-maximization(EM), multiple imputation (MICE) , K nearest neighbor classification (KNN), clustering imputation (Cluster), random forest (RF), and decision tree (Cart) were the chosen imputation methods. Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) are utilised to assess the performance of different methods for missing data imputation at a missing rate of 20%. The datasets processed with different missing data imputation methods were employed to construct a CVD risk prediction model utilizing the support vector machine (SVM). The predictive performance was then compared using the area under the curve (AUC).

RESULTS:

The most effective imputation results were attained by KNN (MAE 0.2032, RMSE 0.7438, AUC 0.730, CI 0.719-0.741) and RF (MAE 0.3944, RMSE 1.4866, AUC 0.777, CI 0.769-0.785). The subsequent best performances were achieved by EM, Cart, and MICE, while Simple, Regression, and Cluster attained the worst performances. The CVD risk prediction model was constructed using the complete data (AUC0.804, CI0.796-0.812) in comparison with all other models with p<0.05.

CONCLUSION:

KNN and RF exhibit superior performance and are more adept at imputing missing data in predictive modelling of cohort study datasets.

Asunto(s)

Algoritmos; Enfermedades Cardiovasculares; Humanos; Estudios de Cohortes; Aprendizaje Automático; Máquina de Vectores de Soporte; Enfermedades Cardiovasculares/diagnóstico; Enfermedades Cardiovasculares/epidemiología

Palabras clave

Cardiovascular disease; Cohort study; Imputation methods; Machine learning; Missing data

Texto completo

Imprimir

XML

PubMed Links

Search on Google

Texto completo: 1 Banco de datos: MEDLINE Asunto principal: Algoritmos / Enfermedades Cardiovasculares Límite: Humans Idioma: En Año: 2024 Tipo del documento: Article

Texto completo

Imprimir

XML

PubMed Links

Search on Google

Texto completo: 1 Banco de datos: MEDLINE Asunto principal: Algoritmos / Enfermedades Cardiovasculares Límite: Humans Idioma: En Año: 2024 Tipo del documento: Article