ESPDHot: An Effective Machine Learning-Based Approach for Predicting Protein-DNA Interaction Hotspots.

Tao, Lianci; Zhou, Tong; Wu, Zhixiang; Hu, Fangrui; Yang, Shuang; Kong, Xiaotian; Li, Chunhua

Tao, Lianci; Zhou, Tong; Wu, Zhixiang; Hu, Fangrui; Yang, Shuang; Kong, Xiaotian; Li, Chunhua.

Afiliación

Tao L; College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China.
Zhou T; College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China.
Wu Z; College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China.
Hu F; College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China.
Yang S; College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China.
Kong X; College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China.
Li C; College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China.

J Chem Inf Model ; 64(8): 3548-3557, 2024 Apr 22.

Article en En | MEDLINE | ID: mdl-38587997

ABSTRACT

ABSTRACT

Protein-DNA interactions are pivotal to various cellular processes. Precise identification of the hotspot residues for protein-DNA interactions holds great significance for revealing the intricate mechanisms in protein-DNA recognition and for providing essential guidance for protein engineering. Aiming at protein-DNA interaction hotspots, this work introduces an effective prediction method, ESPDHot based on a stacked ensemble machine learning framework. Here, the interface residue whose mutation leads to a binding free energy change (ΔΔG) exceeding 2 kcal/mol is defined as a hotspot. To tackle the imbalanced data set issue, the adaptive synthetic sampling (ADASYN), an oversampling technique, is adopted to synthetically generate new minority samples, thereby rectifying data imbalance. As for molecular characteristics, besides traditional features, we introduce three new characteristic types including residue interface preference proposed by us, residue fluctuation dynamics characteristics, and coevolutionary features. Combining the Boruta method with our previously developed Random Grouping strategy, we obtained an optimal set of features. Finally, a stacking classifier is constructed to output prediction results, which integrates three classical predictors, Support Vector Machine (SVM), XGBoost, and Artificial Neural Network (ANN) as the first layer, and Logistic Regression (LR) algorithm as the second one. Notably, ESPDHot outperforms the current state-of-the-art predictors, achieving superior performance on the independent test data set, with F1, MCC, and AUC reaching 0.571, 0.516, and 0.870, respectively.

Asunto(s)

ADN; Aprendizaje Automático; ADN/química; ADN/metabolismo; Unión Proteica; Redes Neurales de la Computación; Proteínas/química; Proteínas/metabolismo; Termodinámica; Proteínas de Unión al ADN/metabolismo; Proteínas de Unión al ADN/química; Máquina de Vectores de Soporte; Algoritmos

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Bases de datos: MEDLINE Asunto principal: ADN / Aprendizaje Automático Idioma: En Revista: J Chem Inf Model Asunto de la revista: INFORMATICA MEDICA / QUIMICA Año: 2024 Tipo del documento: Article País de afiliación: China

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google