Your browser doesn't support javascript.
loading
Processing imbalanced medical data at the data level with assisted-reproduction data as an example.
Zhu, Junliang; Pu, Shaowei; He, Jiaji; Su, Dongchao; Cai, Weijie; Xu, Xueying; Liu, Hongbo.
Afiliación
  • Zhu J; Department of Health Statistics, School of Public Health, China Medical University, Shenyang, 110122, PR China.
  • Pu S; Department of Health Statistics, School of Public Health, China Medical University, Shenyang, 110122, PR China.
  • He J; Department of Health Statistics, School of Public Health, China Medical University, Shenyang, 110122, PR China.
  • Su D; Department of Health Statistics, School of Public Health, China Medical University, Shenyang, 110122, PR China.
  • Cai W; Department of Health Statistics, School of Public Health, China Medical University, Shenyang, 110122, PR China.
  • Xu X; Department of Health Statistics, School of Public Health, China Medical University, Shenyang, 110122, PR China.
  • Liu H; Department of Health Statistics, School of Public Health, China Medical University, Shenyang, 110122, PR China. hbliu@cmu.edu.cn.
BioData Min ; 17(1): 29, 2024 Sep 04.
Article en En | MEDLINE | ID: mdl-39232851
ABSTRACT

OBJECTIVE:

Data imbalance is a pervasive issue in medical data mining, often leading to biased and unreliable predictive models. This study aims to address the urgent need for effective strategies to mitigate the impact of data imbalance on classification models. We focus on quantifying the effects of different imbalance degrees and sample sizes on model performance, identifying optimal cut-off values, and evaluating the efficacy of various methods to enhance model accuracy in highly imbalanced and small sample size scenarios.

METHODS:

We collected medical records of patients receiving assisted reproductive treatment in a reproductive medicine center. Random forest was used to screen the key variables for the prediction target. Various datasets with different imbalance degrees and sample sizes were constructed to compare the classification performance of logistic regression models. Metrics such as AUC, G-mean, F1-Score, Accuracy, Recall, and Precision were used for evaluation. Four imbalance treatment methods (SMOTE, ADASYN, OSS, and CNN) were applied to datasets with low positive rates and small sample sizes to assess their effectiveness.

RESULTS:

The logistic model's performance was low when the positive rate was below 10% but stabilized beyond this threshold. Similarly, sample sizes below 1200 yielded poor results, with improvement seen above this threshold. For robustness, the optimal cut-offs for positive rate and sample size were identified as 15% and 1500, respectively. SMOTE and ADASYN oversampling significantly improved classification performance in datasets with low positive rates and small sample sizes.

CONCLUSIONS:

The study identifies a positive rate of 15% and a sample size of 1500 as optimal cut-offs for stable logistic model performance. For datasets with low positive rates and small sample sizes, SMOTE and ADASYN are recommended to improve balance and model accuracy.
Palabras clave

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Idioma: En Revista: BioData Min Año: 2024 Tipo del documento: Article Pais de publicación: Reino Unido

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Idioma: En Revista: BioData Min Año: 2024 Tipo del documento: Article Pais de publicación: Reino Unido