RESUMO
BACKGROUND: Missing data is frequently an inevitable issue in cohort studies and it can adversely affect the study's findings. We assess the effectiveness of eight frequently utilized statistical and machine learning (ML) imputation methods for dealing with missing data in predictive modelling of cohort study datasets. This evaluation is based on real data and predictive models for cardiovascular disease (CVD) risk. METHODS: The data is from a real-world cohort study in Xinjiang, China. It includes personal information, physical examination data, questionnaires, and laboratory biochemical results from 10,164 subjects with a total of 37 variables. Simple imputation (Simple), regression imputation (Regression), expectation-maximization(EM), multiple imputation (MICE) , K nearest neighbor classification (KNN), clustering imputation (Cluster), random forest (RF), and decision tree (Cart) were the chosen imputation methods. Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) are utilised to assess the performance of different methods for missing data imputation at a missing rate of 20%. The datasets processed with different missing data imputation methods were employed to construct a CVD risk prediction model utilizing the support vector machine (SVM). The predictive performance was then compared using the area under the curve (AUC). RESULTS: The most effective imputation results were attained by KNN (MAE: 0.2032, RMSE: 0.7438, AUC: 0.730, CI: 0.719-0.741) and RF (MAE: 0.3944, RMSE: 1.4866, AUC: 0.777, CI: 0.769-0.785). The subsequent best performances were achieved by EM, Cart, and MICE, while Simple, Regression, and Cluster attained the worst performances. The CVD risk prediction model was constructed using the complete data (AUC:0.804, CI:0.796-0.812) in comparison with all other models with p<0.05. CONCLUSION: KNN and RF exhibit superior performance and are more adept at imputing missing data in predictive modelling of cohort study datasets.
Assuntos
Algoritmos , Doenças Cardiovasculares , Humanos , Estudos de Coortes , Aprendizado de Máquina , Máquina de Vetores de Suporte , Doenças Cardiovasculares/diagnóstico , Doenças Cardiovasculares/epidemiologiaRESUMO
BACKGROUND: This study aimed to compare the ability of certain obesity-related indicators to identify metabolic syndrome (MetS) among normal-weight adults in rural Xinjiang. METHODS: A total of 4315 subjects were recruited in rural Xinjiang. The questionnaire, biochemical and anthropometric data were collected from them. Binary logistic regression was used to analyze the association between the z-score of each index and MetS. The area under the receiver-operating characteristic (ROC) curves were used to compare the diagnostic ability of each index. According to the cut-off value of each index, nomogram models were established and their diagnostic ability were evaluated. RESULTS: After adjusting for confounding factors, each indicator in different genders was correlated with MetS. Triglyceride-glucose index (TyG index) showed the strongest association with MetS in both males (OR = 3.749, 95%CI: 3.173-4.429) and females (OR = 3.521,95%CI: 2.990-4.148). Lipid accumulation product (LAP) showed the strongest diagnostic ability in both males (AUC = 0.831, 95%CI: 0.806-0.856) and females (AUC = 0.842, 95%CI: 0.820-0.864), and its optimal cut-off values were 39.700 and 35.065, respectively. The identification ability of the TyG index in different genders (males AUC: 0.817, females AUC: 0.817) was slightly weaker than LAP. Waist-to-height ratio (WHtR) had the similar AUC (males: 0.717, females: 0.747) to conicity index (CI) (males: 0.734, females: 0.749), whereas the identification ability of a body shape index (ABSI) (males AUC: 0.700, females AUC: 0.717) was relatively weak. Compared with the diagnostic ability of a single indicator, the AUC of the male nomogram model was 0.876 (95%CI: 0.856-0.895) and the AUC of the female model was 0.877 (95%CI: 0.856-0.896). The identification ability had been significantly improved. CONCLUSION: LAP and TyG index are effective indicators for identifying MetS among normal-weight adults in rural Xinjiang. Nomogram models including age, CI, LAP, and TyG index can significantly improve diagnostic ability.