Your browser doesn't support javascript.
loading
Extremely missing numerical data in Electronic Health Records for machine learning can be managed through simple imputation methods considering informative missingness: A comparative of solutions in a COVID-19 mortality case study.
Ferri, Pablo; Romero-Garcia, Nekane; Badenes, Rafael; Lora-Pablos, David; Morales, Teresa García; Gómez de la Cámara, Agustín; García-Gómez, Juan M; Sáez, Carlos.
Afiliação
  • Ferri P; Biomedical Data Science Lab, Instituto Universitario de Tecnologías de la Información y Comunicaciones, Universitat Politècnica de València, Camino de Vera s/n, Valencia 46022, Spain. Electronic address: pabferb2@upv.es.
  • Romero-Garcia N; Departament de Cirugia, Universitat de València, Spain.
  • Badenes R; Departament de Cirugia, Universitat de València, Spain; Instituto INCLIVA, Hospital Clínico Universitario de Valencia, Spain; Department Anesthesiology, Surgical-Trauma Intensive Care and Pain Clinic, Hospital Clínic Universitari, Valencia, Spain.
  • Lora-Pablos D; Instituto de Investigación imas12, Hospital 12 de Octubre, Madrid, Spain; Facultad de Estudios Estadísticos, Universidad Complutense de Madrid, Spain.
  • Morales TG; Instituto de Investigación imas12, Hospital 12 de Octubre, Madrid, Spain.
  • Gómez de la Cámara A; Instituto de Investigación imas12, Hospital 12 de Octubre, Madrid, Spain.
  • García-Gómez JM; Biomedical Data Science Lab, Instituto Universitario de Tecnologías de la Información y Comunicaciones, Universitat Politècnica de València, Camino de Vera s/n, Valencia 46022, Spain.
  • Sáez C; Biomedical Data Science Lab, Instituto Universitario de Tecnologías de la Información y Comunicaciones, Universitat Politècnica de València, Camino de Vera s/n, Valencia 46022, Spain.
Comput Methods Programs Biomed ; 242: 107803, 2023 Dec.
Article em En | MEDLINE | ID: mdl-37703700
ABSTRACT
BACKGROUND AND

OBJECTIVE:

Reusing Electronic Health Records (EHRs) for Machine Learning (ML) leads on many occasions to extremely incomplete and sparse tabular datasets, which can hinder the model development processes and limit their performance and generalization. In this study, we aimed to characterize the most effective data imputation techniques and ML models for dealing with highly missing numerical data in EHRs, in the case where only a very limited number of data are complete, as opposed to the usual case of having a reduced number of missing values.

METHODS:

We used a case study including full blood count laboratory data, demographic and survival data in the context of COVID-19 hospital admissions and evaluated 30 processing pipelines combining imputation methods with ML classifiers. The imputation methods included missing mask, translation and encoding, mean imputation, k-nearest neighbors' imputation, Bayesian ridge regression imputation and generative adversarial imputation networks. The classifiers included k-nearest neighbors, logistic regression, random forest, gradient boosting and deep multilayer perceptron.

RESULTS:

Our results suggest that in the presence of highly missing data, combining translation and encoding imputation-which considers informative missingness-with tree ensemble classifiers-random forest and gradient boosting-is a sensible choice when aiming to maximize performance, in terms of area under curve.

CONCLUSIONS:

Based on our findings, we recommend the consideration of this imputer-classifier configuration when constructing models in the presence of extremely incomplete numerical data in EHR.
Assuntos
Palavras-chave

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Algoritmos / COVID-19 Tipo de estudo: Clinical_trials / Prognostic_studies Limite: Humans Idioma: En Revista: Comput Methods Programs Biomed Assunto da revista: INFORMATICA MEDICA Ano de publicação: 2023 Tipo de documento: Article

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Algoritmos / COVID-19 Tipo de estudo: Clinical_trials / Prognostic_studies Limite: Humans Idioma: En Revista: Comput Methods Programs Biomed Assunto da revista: INFORMATICA MEDICA Ano de publicação: 2023 Tipo de documento: Article