Your browser doesn't support javascript.
loading
Hybrid Class Balancing Approach for Chemical Compound Toxicity Prediction.
Santiago-Gonzalez, Felipe; Martinez-Rodriguez, Jose L; García-Perez, Carlos; Juárez-Saldivar, Alfredo; Camacho-Cruz, Hugo E.
Afiliação
  • Santiago-Gonzalez F; Multidisciplinary Academic Unit Reynosa-Rodhe, Autonomous University of Tamaulipas, Mexico.
  • Martinez-Rodriguez JL; Faculty of Engineering and Science, Autonomous University of Tamaulipas, Mexico.
  • García-Perez C; Digitalization and Transformation Department, Helmholtz Zentrum München, Ingolstädter Landstrasse 1, Germany.
  • Juárez-Saldivar A; Multidisciplinary Academic Unit Reynosa-Aztlan, Autonomous University of Tamaulipas, Mexico.
  • Camacho-Cruz HE; FMM- Autonomous University of Tamaulipas, Sendero Nacional KM. 3.H. Matamoros, Mexico.
Article em En | MEDLINE | ID: mdl-39318212
ABSTRACT

INTRODUCTION:

Computational methods are crucial for efficient and cost-effective drug toxicity prediction. Unfortunately, the data used for prediction is often imbalanced, resulting in biased models that favor the majority class. This paper proposes an approach to apply a hybrid class balancing technique and evaluate its performance on computational models for toxicity prediction in Tox21 datasets.

METHODS:

The process begins by converting chemical compound data structures (SMILES strings) from various bioassay datasets into molecular descriptors that can be processed by algorithms. Subsequently, Undersampling and Oversampling techniques are applied in two different schemes on the training data. In the first scheme (Individual), only one balancing technique (Oversampling or Undersampling) is used. In the second scheme (Hybrid), the training data is divided according to a ratio (e.g., 90-10), applying a different balancing technique to each proportion. We considered eight resampling techniques (four Oversampling and four Undersampling), six molecular descriptors (based on MACCS, ECFP, and Mordred), and five classification models (KNN, MLP, RF, XGB and SVM) over 10 bioassay datasets to determine the configurations that yield the best performance.

RESULTS:

We defined three testing scenarios without balancing techniques (baseline), Individual, and Hybrid. We found that using the ENN technique in the MACCS-MLP combination resulted in a 10.01% improvement in performance. The increase for ECFP6-2048 was 16.47% after incorporating a combination of the SMOTE (10%) and RUS (90%) techniques. Meanwhile, using the same combination of techniques, MORDRED-XGB showed the most significant increase in performance, achieving a 22.62% improvement.

CONCLUSION:

Integrating any of the class balancing schemes resulted in a minimum of 10.01% improvement in prediction performance compared to the best baseline configuration. In this study, Undersampling techniques were more appropriate due to the significant overlap among samples. By eliminating specific samples from the predominant class that are close to the minority class, this overlap is greatly reduced.
Palavras-chave

Texto completo: 1 Base de dados: MEDLINE Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo: 1 Base de dados: MEDLINE Idioma: En Ano de publicação: 2024 Tipo de documento: Article