StackDPP: a stacking ensemble based DNA-binding protein prediction model.

Ahmed, Sheikh Hasib; Bose, Dibyendu Brinto; Khandoker, Rafi; Rahman, M Saifur

Ahmed, Sheikh Hasib; Bose, Dibyendu Brinto; Khandoker, Rafi; Rahman, M Saifur.

Afiliação

Ahmed SH; Department of CSE, BUET, ECE Building, West Palashi, Dhaka, 1000, Bangladesh.
Bose DB; Department of CSE, BUET, ECE Building, West Palashi, Dhaka, 1000, Bangladesh.
Khandoker R; Department of CSE, BUET, ECE Building, West Palashi, Dhaka, 1000, Bangladesh.
Rahman MS; Department of CSE, BUET, ECE Building, West Palashi, Dhaka, 1000, Bangladesh. mrahman@cse.buet.ac.bd.

BMC Bioinformatics ; 25(1): 111, 2024 Mar 14.

Article em En | MEDLINE | ID: mdl-38486135

ABSTRACT

ABSTRACT

BACKGROUND:

DNA-binding proteins (DNA-BPs) are the proteins that bind and interact with DNA. DNA-BPs regulate and affect numerous biological processes, such as, transcription and DNA replication, repair, and organization of the chromosomal DNA. Very few proteins, however, are DNA-binding in nature. Therefore, it is necessary to develop an efficient predictor for identifying DNA-BPs.

RESULT:

In this work, we have proposed new benchmark datasets for the DNA-binding protein prediction problem. We discovered several quality concerns with the widely used benchmark datasets, PDB1075 (for training) and PDB186 (for independent testing), which necessitated the preparation of new benchmark datasets. Our proposed datasets UNIPROT1424 and UNIPROT356 can be used for model training and independent testing respectively. We have retrained selected state-of-the-art DNA-BP predictors in the new dataset and reported their performance results. We also trained a novel predictor using the new benchmark dataset. We extracted features from various feature categories, then used a Random Forest classifier and Recursive Feature Elimination with Cross-validation (RFECV) to select the optimal set of 452 features. We then proposed a stacking ensemble architecture as our final prediction model. Named Stacking Ensemble Model for DNA-binding Protein Prediction, or StackDPP in short, our model achieved 0.92, 0.92 and 0.93 accuracy in 10-fold cross-validation, jackknife and independent testing respectively.

CONCLUSION:

StackDPP has performed very well in cross-validation testing and has outperformed all the state-of-the-art prediction models in independent testing. Its performance scores in cross-validation testing generalized very well in the independent test set. The source code of the model is publicly available at https//github.com/HasibAhmed1624/StackDPP . Therefore, we expect this generalized model can be adopted by researchers and practitioners to identify novel DNA-binding proteins.

Assuntos

Algoritmos; Proteínas de Ligação a DNA; Proteínas de Ligação a DNA/metabolismo; Software; DNA/metabolismo

Palavras-chave

Classification; DNA-binding protein; Data imbalance; Recursive feature elimination; Sequence identity

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Algoritmos / Proteínas de Ligação a DNA Idioma: En Revista: BMC Bioinformatics Assunto da revista: INFORMATICA MEDICA Ano de publicação: 2024 Tipo de documento: Article País de afiliação: Bangladesh

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google