Pesquisa | Portal de Pesquisa da BVS Enfermagem

Effect of machine learning re-sampling techniques for imbalanced datasets in ¹⁸F-FDG PET-based radiomics model on prognostication performance in cohorts of head and neck cancer patients.

Xie, Chenyi; Du, Richard; Ho, Joshua Wk; Pang, Herbert H; Chiu, Keith Wh; Lee, Elaine Yp; Vardhanabhuti, Varut.

Eur J Nucl Med Mol Imaging ; 47(12): 2826-2835, 2020 11.

Artigo em Inglês | MEDLINE | ID: mdl-32253486

RESUMO

PURPOSE: Biomedical data frequently contain imbalance characteristics which make achieving good predictive performance with data-driven machine learning approaches a challenging task. In this study, we investigated the impact of re-sampling techniques for imbalanced datasets in PET radiomics-based prognostication model in head and neck (HNC) cancer patients. METHODS: Radiomics analysis was performed in two cohorts of patients, including 166 patients newly diagnosed with nasopharyngeal carcinoma (NPC) in our centre and 182 HNC patients from open database. Conventional PET parameters and robust radiomics features were extracted for correlation analysis of the overall survival (OS) and disease progression-free survival (DFS). We investigated a cross-combination of 10 re-sampling methods (oversampling, undersampling, and hybrid sampling) with 4 machine learning classifiers for survival prediction. Diagnostic performance was assessed in hold-out test sets. Statistical differences were analysed using Monte Carlo cross-validations by post hoc Nemenyi analysis. RESULTS: Oversampling techniques like ADASYN and SMOTE could improve prediction performance in terms of G-mean and F-measures in minority class, without significant loss of F-measures in majority class. We identified optimal PET radiomics-based prediction model of OS (AUC of 0.82, G-mean of 0.77) for our NPC cohort. Similar findings that oversampling techniques improved the prediction performance were seen when this was tested on an external dataset indicating generalisability. CONCLUSION: Our study showed a significant positive impact on the prediction performance in imbalanced datasets by applying re-sampling techniques. We have created an open-source solution for automated calculations and comparisons of multiple re-sampling techniques and machine learning classifiers for easy replication in future studies.

Assuntos

Fluordesoxiglucose F18 , Neoplasias de Cabeça e Pescoço , Estudos de Coortes , Neoplasias de Cabeça e Pescoço/diagnóstico por imagem , Humanos , Aprendizado de Máquina , Intervalo Livre de Progressão

Imbalanced Data Correction Based PET/CT Radiomics Model for Predicting Lymph Node Metastasis in Clinical Stage T1 Lung Adenocarcinoma.

Lv, Jieqin; Chen, Xiaohui; Liu, Xinran; Du, Dongyang; Lv, Wenbing; Lu, Lijun; Wu, Hubing.

Front Oncol ; 12: 788968, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-35155231

RESUMO

OBJECTIVES: To develop and validate the imbalanced data correction based PET/CT radiomics model for predicting lymph node metastasis (LNM) in clinical stage T1 lung adenocarcinoma (LUAD). METHODS: A total of 183 patients (148/35 non-metastasis/LNM) with pathologically confirmed LUAD were retrospectively included. The cohorts were divided into training vs. validation cohort in a ratio of 7:3. A total of 487 radiomics features were extracted from PET and CT components separately for radiomics model construction. Four clinical features and seven PET/CT radiological features were extracted for traditional model construction. To balance the distribution of majority (non-metastasis) class and minority (LNM) class, the imbalance-adjustment strategies using ten data re-sampling methods were adopted. Three multivariate models (denoted as Traditional, Radiomics, and Combined) were constructed using multivariable logistic regression analysis, where the combined model incorporated all of the significant clinical, radiological, and radiomics features. One hundred times repeated Monte Carlo cross-validation was used to assess the application order of feature selection and imbalance-adjustment strategies in the machine learning pipeline. Prediction performance of each model was evaluated using the area under the receiver operating characteristic curve (AUC) and Geometric mean score (G-mean). RESULTS: A total of 2 clinical parameters, 2 radiological features, 3 PET, and 5 CT radiomics features were significantly associated with LNM. The combined model with Edited Nearest Neighbors (ENN) re-sampling methods showed strong prediction performance than traditional model or radiomics model with the AUC of 0.94 (95%CI = 0.86-0.97) vs. 0.89 (95%CI = 0.79-0.93), 0.92 (95%CI = 0.85-0.97), and G-mean of 0.88 vs. 0.82, 0.80 in the training cohort, and the AUC of 0.75 (95%CI = 0.57-0.91) vs. 0.68 (95%CI = 0.36-0.83), 0.71 (95%CI = 0.48-0.83) and G-mean of 0.76 vs. 0.64, 0.51 in the validation cohort. The combination of performing feature selection before data re-sampling obtains a better result than the reverse combination (AUC 0.76 ± 0.06 vs. 0.70 ± 0.07, p<0.001). CONCLUSIONS: The combined model (consisting of age, histological type, C/T ratio, MATV, and radiomics signature) integrated with ENN re-sampling methods had strong lymph node metastasis prediction performance for imbalance cohorts in clinical stage T1 LUAD. Radiomics signatures extracted from PET/CT images could provide complementary prediction information compared with traditional model.

Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.

Varotto, Giulia; Susi, Gianluca; Tassi, Laura; Gozzo, Francesca; Franceschetti, Silvana; Panzica, Ferruccio.

Front Neuroinform ; 15: 715421, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34867255

RESUMO

Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery. Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered. Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method. Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.

RESUMO

Assuntos

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA