Your browser doesn't support javascript.
loading
Understanding random resampling techniques for class imbalance correction and their consequences on calibration and discrimination of clinical risk prediction models.
Piccininni, Marco; Wechsung, Maximilian; Van Calster, Ben; Rohmann, Jessica L; Konigorski, Stefan; van Smeden, Maarten.
Afiliação
  • Piccininni M; Digital Health - Machine Learning Research Group, Hasso Plattner Institute for Digital Engineering, Potsdam, Germany; Digital Engineering Faculty, University of Potsdam, Potsdam, Germany; Institute of Public Health, Charité - Universitätsmedizin Berlin, Berlin, Germany. Electronic address: marco.pic
  • Wechsung M; Department of Mathematics and Statistics, York University, Toronto, Canada.
  • Van Calster B; Department of Development and Regeneration, KU Leuven, Leuven, Belgium; Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, the Netherlands; Leuven Unit for Health Technology Assessment Research (LUHTAR), KU Leuven, Leuven, Belgium.
  • Rohmann JL; Institute of Public Health, Charité - Universitätsmedizin Berlin, Berlin, Germany; Center for Stroke Research Berlin, Charité - Universitätsmedizin Berlin, Berlin, Germany.
  • Konigorski S; Digital Health - Machine Learning Research Group, Hasso Plattner Institute for Digital Engineering, Potsdam, Germany; Digital Engineering Faculty, University of Potsdam, Potsdam, Germany; Icahn School of Medicine at Mount Sinai, Hasso Plattner Institute for Digital Health at Mount Sinai, New York, N
  • van Smeden M; Julius Center for Health Sciences and Primary Care, UMC Utrecht, Utrecht University, Utrecht, the Netherlands.
J Biomed Inform ; 155: 104666, 2024 Jul.
Article em En | MEDLINE | ID: mdl-38848886
ABSTRACT

OBJECTIVE:

Class imbalance is sometimes considered a problem when developing clinical prediction models and assessing their performance. To address it, correction strategies involving manipulations of the training dataset, such as random undersampling or oversampling, are frequently used. The aim of this article is to illustrate the consequences of these class imbalance correction strategies on clinical prediction models' internal validity in terms of calibration and discrimination performances.

METHODS:

We used both heuristic intuition and formal mathematical reasoning to characterize the relations between conditional probabilities of interest and probabilities targeted when using random undersampling or oversampling. We propose a plug-in estimator that represents a natural correction for predictions obtained from models that have been trained on artificially balanced datasets ("naïve" models). We conducted a Monte Carlo simulation with two different data generation processes and present a real-world example using data from the International Stroke Trial database to empirically demonstrate the consequences of applying random resampling techniques for class imbalance correction on calibration and discrimination (in terms of Area Under the ROC, AUC) for logistic regression and tree-based prediction models.

RESULTS:

Across our simulations and in the real-world example, calibration of the naïve models was very poor. The models using the plug-in estimator generally outperformed the models relying on class imbalance correction in terms of calibration while achieving the same discrimination performance.

CONCLUSION:

Random resampling techniques for class imbalance correction do not generally improve discrimination performance (i.e., AUC), and their use is hard to justify when aiming at providing calibrated predictions. Improper use of such class imbalance correction techniques can lead to suboptimal data usage and less valid risk prediction models.
Assuntos
Palavras-chave

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Método de Monte Carlo Limite: Humans Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Método de Monte Carlo Limite: Humans Idioma: En Ano de publicação: 2024 Tipo de documento: Article