Diabetes mellitus risk prediction in the presence of class imbalance using flexible machine learning methods.

Sadeghi, Somayeh; Khalili, Davood; Ramezankhani, Azra; Mansournia, Mohammad Ali; Parsaeian, Mahboubeh

Sadeghi, Somayeh; Khalili, Davood; Ramezankhani, Azra; Mansournia, Mohammad Ali; Parsaeian, Mahboubeh.

Afiliación

Sadeghi S; Department of Epidemiology and Biostatistics, School of Public Health, Tehran University of Medical Sciences, P.O. Box 14155-6446, Tehran, Iran.
Khalili D; Prevention of Metabolic Disorders Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran.
Ramezankhani A; Department of Biostatistics and Epidemiology, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran.
Mansournia MA; Prevention of Metabolic Disorders Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran.
Parsaeian M; Department of Epidemiology and Biostatistics, School of Public Health, Tehran University of Medical Sciences, P.O. Box 14155-6446, Tehran, Iran. mansournia_ma@yahoo.com.

BMC Med Inform Decis Mak ; 22(1): 36, 2022 02 10.

Article en En | MEDLINE | ID: mdl-35139846

ABSTRACT

ABSTRACT

BACKGROUND:

Early detection and prediction of type two diabetes mellitus incidence by baseline measurements could reduce associated complications in the future. The low incidence rate of diabetes in comparison with non-diabetes makes accurate prediction of minority diabetes class more challenging.

METHODS:

Deep neural network (DNN), extremely gradient boosting (XGBoost), and random forest (RF) performance is compared in predicting minority diabetes class in Tehran Lipid and Glucose Study (TLGS) cohort data. The impact of changing threshold, cost-sensitive learning, over and under-sampling strategies as solutions to class imbalance have been compared in improving algorithms performance.

RESULTS:

DNN with the highest accuracy in predicting diabetes, 54.8%, outperformed XGBoost and RF in terms of AUROC, g-mean, and f1-measure in original imbalanced data. Changing threshold based on the maximum of f1-measure improved performance in g-mean, and f1-measure in three algorithms. Repeated edited nearest neighbors (RENN) under-sampling in DNN and cost-sensitive learning in tree-based algorithms were the best solutions to tackle the imbalance issue. RENN increased ROC and Precision-Recall AUCs, g-mean and f1-measure from 0.857, 0.603, 0.713, 0.575 to 0.862, 0.608, 0.773, 0.583, respectively in DNN. Weighing improved g-mean and f1-measure from 0.667, 0.554 to 0.776, 0.588 in XGBoost, and from 0.659, 0.543 to 0.775, 0.566 in RF, respectively. Also, ROC and Precision-Recall AUCs in RF increased from 0.840, 0.578 to 0.846, 0.591, respectively.

CONCLUSION:

G-mean experienced the most increase by all imbalance solutions. Weighing and changing threshold as efficient strategies, in comparison with resampling methods are faster solutions to handle class imbalance. Among sampling strategies, under-sampling methods had better performance than others.

Asunto(s)

Diabetes Mellitus; Aprendizaje Automático; Algoritmos; Humanos; Irán; Redes Neurales de la Computación

Palabras clave

Cost-sensitive learning; Diabetes mellitus; Imbalanced data; Machine learning; Sampling strategies

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Asunto principal: Diabetes Mellitus / Aprendizaje Automático Tipo de estudio: Etiology_studies / Prognostic_studies / Risk_factors_studies / Screening_studies Límite: Humans País/Región como asunto: Asia Idioma: En Revista: BMC Med Inform Decis Mak Asunto de la revista: INFORMATICA MEDICA Año: 2022 Tipo del documento: Article País de afiliación: Irán

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google