RESUMO
The prediction of compound cytotoxicity is an important part of the drug discovery process. However, it usually appears as poor predictive performance because the datasets are high-throughput and have a class-imbalance problem. In this study, several strategies of performing a structure-activity relationship study for a cytotoxic endpoint in the AID364 dataset were explored to solve the class-imbalance problem. Random forest adaboost was used as the base learners for 10 types of molecular fingerprints and an ensemble method and six data-balancing methods were applied to balance the classes. As a result, the ensemble model using MACCS fingerprint was found to be the best, giving area under the curve of 85.2% ± 0.35%, sensitivity of 81.8% ± 0.65%, and specificity of 76.0% ± 0.12% in fivefold cross-validation and area under the curve of 78.8%, sensitivity of 55.5% and specificity of 78.5% in external validation. Good performance also appeared on other datasets with different sizes/degrees of imbalance. To explore the structural commonality of cytotoxic compounds, several substructures were identified as an important reference for substructure alerts. The convincing results indicate that the proposed models are helpful in predicting the cytotoxicity of chemicals.