A method for comparing multiple imputation techniques: A case study on the U.S. national COVID cohort collaborative.

Casiraghi, Elena; Wong, Rachel; Hall, Margaret; Coleman, Ben; Notaro, Marco; Evans, Michael D; Tronieri, Jena S; Blau, Hannah; Laraway, Bryan; Callahan, Tiffany J; Chan, Lauren E; Bramante, Carolyn T; Buse, John B; Moffitt, Richard A; Stürmer, Til; Johnson, Steven G; Raymond Shao, Yu; Reese, Justin; Robinson, Peter N; Paccanaro, Alberto; Valentini, Giorgio; Huling, Jared D; Wilkins, Kenneth J

Casiraghi, Elena; Wong, Rachel; Hall, Margaret; Coleman, Ben; Notaro, Marco; Evans, Michael D; Tronieri, Jena S; Blau, Hannah; Laraway, Bryan; Callahan, Tiffany J; Chan, Lauren E; Bramante, Carolyn T; Buse, John B; Moffitt, Richard A; Stürmer, Til; Johnson, Steven G; Raymond Shao, Yu; Reese, Justin; Robinson, Peter N; Paccanaro, Alberto; Valentini, Giorgio; Huling, Jared D; Wilkins, Kenneth J.

Casiraghi E; AnacletoLab, Department of Computer Science "Giovanni degli Antoni", Università degli Studi di Milano, Milan, Italy; CINI, Infolife National Laboratory, Roma, Italy; Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
Wong R; Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA.
Hall M; Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA.
Coleman B; The Jackson Laboratory for Genomic Medicine, Farmington, USA; Institute for Systems Genomics, University of Connecticut, Farmington, CT, USA.
Notaro M; AnacletoLab, Department of Computer Science "Giovanni degli Antoni", Università degli Studi di Milano, Milan, Italy; CINI, Infolife National Laboratory, Roma, Italy.
Evans MD; Biostatistical Design and Analysis Center, Clinical and Translational Science Institute, University of Minnesota, Minneapolis, MN, USA.
Tronieri JS; Department of Psychiatry, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, USA.
Blau H; The Jackson Laboratory for Genomic Medicine, Farmington, USA.
Laraway B; University of Colorado, Anschutz Medical Campus, Aurora, CO, USA.
Callahan TJ; University of Colorado, Anschutz Medical Campus, Aurora, CO, USA.
Chan LE; College of Public Health and Human Sciences, Oregon State University, Corvallis, USA.
Bramante CT; Division of General Internal Medicine, University of Minnesota, Minneapolis, MN, USA.
Buse JB; NC Translational and Clinical Sciences Institute, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA; Division of Endocrinology, Department of Medicine, University of North Carolina School of Medicine, USA.
Moffitt RA; Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA.
Stürmer T; Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
Johnson SG; Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA.
Raymond Shao Y; Harvard-MIT Division of Health Sciences and Technology (HST), 260 Longwood Ave, Boston, USA; Department of Radiation Oncology, UT Southwestern Medical Center, Dallas, USA.
Reese J; Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
Robinson PN; The Jackson Laboratory for Genomic Medicine, Farmington, USA; Institute for Systems Genomics, University of Connecticut, Farmington, CT, USA.
Paccanaro A; School of Applied Mathematics (EMAp), Fundação Getúlio Vargas, Rio de Janeiro, Brazil; Department of Computer Science, Royal Holloway, University of London, Egham, UK.
Valentini G; AnacletoLab, Department of Computer Science "Giovanni degli Antoni", Università degli Studi di Milano, Milan, Italy; CINI, Infolife National Laboratory, Roma, Italy.
Huling JD; Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA.
Wilkins KJ; Biostatistics Program, Office of the Director, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD, USA.

J Biomed Inform ; 139: 104295, 2023 03.

Article en En | MEDLINE | ID: mdl-36716983

RESUMEN

Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful for assessing associations between patients' predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases, whose removal may introduce severe bias. Several multiple imputation algorithms have been proposed to attempt to recover the missing information under an assumed missingness mechanism. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithm works best in a given scenario. Furthermore, the selection of each algorithm's parameters and data-related modeling choices are also both crucial and challenging. In this paper we propose a novel framework to numerically evaluate strategies for handling missing data in the context of statistical analysis, with a particular focus on multiple imputation techniques. We demonstrate the feasibility of our approach on a large cohort of type-2 diabetes patients provided by the National COVID Cohort Collaborative (N3C) Enclave, where we explored the influence of various patient characteristics on outcomes related to COVID-19. Our analysis included classic multiple imputation techniques as well as simple complete-case Inverse Probability Weighted models. Extensive experiments show that our approach can effectively highlight the most promising and performant missing-data handling strategy for our case study. Moreover, our methodology allowed a better understanding of the behavior of the different models and of how it changed as we modified their parameters. Our method is general and can be applied to different research fields and on datasets containing heterogeneous types.

Asunto(s)

COVID-19; Humanos; Algoritmos; Proyectos de Investigación; Sesgo; Probabilidad

Palabras clave

COVID-19 severity assessment; Clinical informatics; Diabetic patients; Evaluation framework; Multiple Imputation

Texto completo

Imprimir

XML

PubMed Links

Search on Google

Texto completo: 1 Banco de datos: MEDLINE Asunto principal: COVID-19 Tipo de estudio: Etiology_studies / Prognostic_studies / Risk_factors_studies Límite: Humans Idioma: En Año: 2023 Tipo del documento: Article

Texto completo

Imprimir

XML

PubMed Links

Search on Google