DNAPred: Accurate Identification of DNA-Binding Sites from Protein Sequence by Ensembled Hyperplane-Distance-Based Support Vector Machines.

Zhu, Yi-Heng; Hu, Jun; Song, Xiao-Ning; Yu, Dong-Jun

Zhu, Yi-Heng; Hu, Jun; Song, Xiao-Ning; Yu, Dong-Jun.

Affiliation

Zhu YH; School of Computer Science and Engineering , Nanjing University of Science and Technology , Xiaolingwei 200 , Nanjing 210094 , P. R. China.
Hu J; College of Information Engineering , Zhejiang University of Technology , Hangzhou 310023 , P. R. China.
Song XN; School of Internet of Things , Jiangnan University , 1800 Lihu Road , Wuxi 214122 , P. R. China.
Yu DJ; School of Computer Science and Engineering , Nanjing University of Science and Technology , Xiaolingwei 200 , Nanjing 210094 , P. R. China.

J Chem Inf Model ; 59(6): 3057-3071, 2019 06 24.

Article in En | MEDLINE | ID: mdl-30943723

ABSTRACT

ABSTRACT

Accurate identification of protein-DNA binding sites is significant for both understanding protein function and drug design. Machine-learning-based methods have been extensively used for the prediction of protein-DNA binding sites. However, the data imbalance problem, in which the number of nonbinding residues (negative-class samples) is far larger than that of binding residues (positive-class samples), seriously restricts the performance improvements of machine-learning-based predictors. In this work, we designed a two-stage imbalanced learning algorithm, called ensembled hyperplane-distance-based support vector machines (E-HDSVM), to improve the prediction performance of protein-DNA binding sites. The first stage of E-HDSVM designs a new iterative sampling algorithm, called hyperplane-distance-based under-sampling (HD-US), to extract multiple subsets from the original imbalanced data set, each of which is used to train a support vector machine (SVM). Unlike traditional sampling algorithms, HD-US selects samples by calculating the distances between the samples and the separating hyperplane of the SVM. The second stage of E-HDSVM proposes an enhanced AdaBoost (EAdaBoost) algorithm to ensemble multiple trained SVMs. As an enhanced version of the original AdaBoost algorithm, EAdaBoost overcomes the overfitting problem. Stringent cross-validation and independent tests on benchmark data sets demonstrated the superiority of E-HDSVM over several popular imbalanced learning algorithms. Based on the proposed E-HDSVM algorithm, we further implemented a sequence-based protein-DNA binding site predictor, called DNAPred, which is freely available at http//csbio.njust.edu.cn/bioinf/dnapred/ for academic use. The computational experimental results showed that our predictor achieved an average overall accuracy of 91.7% and a Mathew's correlation coefficient of 0.395 on five benchmark data sets and outperformed several state-of-the-art sequence-based protein-DNA binding site predictors.

Subject(s)

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: DNA / Models, Molecular / DNA-Binding Proteins / Support Vector Machine Type of study: Diagnostic_studies / Prognostic_studies Language: En Journal: J Chem Inf Model Journal subject: INFORMATICA MEDICA / QUIMICA Year: 2019 Document type: Article

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google