Exploring different strategies for imbalanced ADME data problem: case study on Caco-2 permeability modeling.

Pham-The, Hai; Casañola-Martin, Gerardo; Garrigues, Teresa; Bermejo, Marival; González-Álvarez, Isabel; Nguyen-Hai, Nam; Cabrera-Pérez, Miguel Ángel; Le-Thi-Thu, Huong

Pham-The, Hai; Casañola-Martin, Gerardo; Garrigues, Teresa; Bermejo, Marival; González-Álvarez, Isabel; Nguyen-Hai, Nam; Cabrera-Pérez, Miguel Ángel; Le-Thi-Thu, Huong.

Afiliación

Pham-The H; Hanoi University of Pharmacy, 13-15 Le Thanh Tong, Hanoi, Vietnam.
Casañola-Martin G; Departament de Bioquímica i Biologia Molecular, Universitat de València, Burjassot, 46100, Valencia, Spain.
Garrigues T; Unidad de Investigación de Diseño de Fármacos y Conectividad Molecular, Departamento de Química Física, Facultad de Farmacia, Universitat de València, Valencia, Spain.
Bermejo M; Facultad de Ingeniería Ambiental, Universidad Estatal Amazónica, Paso lateral km 2 1/2 via Napo, Puyo, Ecuador.
González-Álvarez I; Department of Pharmacy and Pharmaceutical Technology, University of Valencia, Burjassot, 46100, Valencia, Spain.
Nguyen-Hai N; Department of Engineering, Area of Pharmacy and Pharmaceutical Technology, Miguel Hernández University, 03550 Sant Joan d'Alacant, Alicante, Spain.
Cabrera-Pérez MÁ; Department of Engineering, Area of Pharmacy and Pharmaceutical Technology, Miguel Hernández University, 03550 Sant Joan d'Alacant, Alicante, Spain.
Le-Thi-Thu H; Hanoi University of Pharmacy, 13-15 Le Thanh Tong, Hanoi, Vietnam.

Mol Divers ; 20(1): 93-109, 2016 Feb.

Article en En | MEDLINE | ID: mdl-26643659

ABSTRACT

ABSTRACT

In many absorption, distribution, metabolism, and excretion (ADME) modeling problems, imbalanced data could negatively affect classification performance of machine learning algorithms. Solutions for handling imbalanced dataset have been proposed, but their application for ADME modeling tasks is underexplored. In this paper, various strategies including cost-sensitive learning and resampling methods were studied to tackle the moderate imbalance problem of a large Caco-2 cell permeability database. Simple physicochemical molecular descriptors were utilized for data modeling. Support vector machine classifiers were constructed and compared using multiple comparison tests. Results showed that the models developed on the basis of resampling strategies displayed better performance than the cost-sensitive classification models, especially in the case of oversampling data where misclassification rates for minority class have values of 0.11 and 0.14 for training and test set, respectively. A consensus model with enhanced applicability domain was subsequently constructed and showed improved performance. This model was used to predict a set of randomly selected high-permeability reference drugs according to the biopharmaceutics classification system. Overall, this study provides a comparison of numerous rebalancing strategies and displays the effectiveness of oversampling methods to deal with imbalanced permeability data problems.

Asunto(s)

Modelos Biológicos; Células CACO-2; Bases de Datos Factuales; Humanos; Aprendizaje Automático; Permeabilidad; Máquina de Vectores de Soporte

Palabras clave

ADME modeling; Biopharmaceutics classification system; Caco-2 cell permeability; Cost-sensitive learning; Resampling technique; Support vector machine

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Bases de datos: MEDLINE Asunto principal: Modelos Biológicos Tipo de estudio: Prognostic_studies Límite: Humans Idioma: En Revista: Mol Divers Asunto de la revista: BIOLOGIA MOLECULAR Año: 2016 Tipo del documento: Article País de afiliación: Vietnam

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google