Performance assessment of different machine learning approaches in predicting diabetic ketoacidosis in adults with type 1 diabetes using electronic health records data.

Li, Lin; Lee, Chuang-Chung; Zhou, Fang Liz; Molony, Cliona; Doder, Zoran; Zalmover, Evgeny; Sharma, Kristen; Juhaeri, Juhaeri; Wu, Chuntao

Li, Lin; Lee, Chuang-Chung; Zhou, Fang Liz; Molony, Cliona; Doder, Zoran; Zalmover, Evgeny; Sharma, Kristen; Juhaeri, Juhaeri; Wu, Chuntao.

Afiliação

Li L; Sanofi U.S. LLC, Bridgewater, New Jersey, USA.
Lee CC; Sanofi U.S. LLC, Cambridge, Massachusetts, USA.
Zhou FL; Sanofi U.S. LLC, Bridgewater, New Jersey, USA.
Molony C; Sanofi U.S. LLC, Cambridge, Massachusetts, USA.
Doder Z; Sanofi U.S. LLC, Bridgewater, New Jersey, USA.
Zalmover E; Sanofi U.S. LLC, Bridgewater, New Jersey, USA.
Sharma K; Sanofi U.S. LLC, Bridgewater, New Jersey, USA.
Juhaeri J; Sanofi U.S. LLC, Bridgewater, New Jersey, USA.
Wu C; Sanofi U.S. LLC, Bridgewater, New Jersey, USA.

Pharmacoepidemiol Drug Saf ; 30(5): 610-618, 2021 05.

Article em En | MEDLINE | ID: mdl-33480091

ABSTRACT

ABSTRACT

PURPOSE:

To assess the performance of different machine learning (ML) approaches in identifying risk factors for diabetic ketoacidosis (DKA) and predicting DKA.

METHODS:

This study applied flexible ML (XGBoost, distributed random forest [DRF] and feedforward network) and conventional ML approaches (logistic regression and least absolute shrinkage and selection operator [LASSO]) to 3400 DKA cases and 11 780 controls nested in adults with type 1 diabetes identified from Optum® de-identified Electronic Health Record dataset (2007-2018). Area under the curve (AUC), accuracy, sensitivity and specificity were computed using fivefold cross validation, and their 95% confidence intervals (CI) were established using 1000 bootstrap samples. The importance of predictors was compared across these models.

RESULTS:

In the training set, XGBoost and feedforward network yielded higher AUC values (0.89 and 0.86, respectively) than logistic regression (0.83), LASSO (0.83) and DRF (0.81). However, the AUC values were similar (0.82) among these approaches in the test set (95% CI range, 0.80-0.84). While the accuracy values >0.8 and the specificity values >0.9 for all models, the sensitivity values were only 0.4. The differences in these metrics across these models were minimal in the test set. All approaches selected some known risk factors for DKA as the top 10 features. XGBoost and DRF included more laboratory measurements or vital signs compared with conventional ML approaches, while feedforward network included more social demographics.

CONCLUSIONS:

In our empirical study, all ML approaches demonstrated similar performance, and identified overlapping, but different, top 10 predictors. The difference in selected top predictors needs further research.

Assuntos

Diabetes Mellitus Tipo 1; Cetoacidose Diabética; Adulto; Diabetes Mellitus Tipo 1/complicações; Diabetes Mellitus Tipo 1/epidemiologia; Cetoacidose Diabética/diagnóstico; Cetoacidose Diabética/epidemiologia; Cetoacidose Diabética/etiologia; Registros Eletrônicos de Saúde; Humanos; Modelos Logísticos; Aprendizado de Máquina

Palavras-chave

AUC; diabetic ketoacidosis; least absolute shrinkage and selection operator; logistic regression; machine learning; prediction model

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Cetoacidose Diabética / Diabetes Mellitus Tipo 1 Tipo de estudo: Diagnostic_studies / Etiology_studies / Prognostic_studies / Risk_factors_studies Limite: Adult / Humans Idioma: En Revista: Pharmacoepidemiol Drug Saf Assunto da revista: EPIDEMIOLOGIA / TERAPIA POR MEDICAMENTOS Ano de publicação: 2021 Tipo de documento: Article País de afiliação: Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google