DPI_CDF: druggable protein identifier using cascade deep forest.

Arif, Muhammad; Fang, Ge; Ghulam, Ali; Musleh, Saleh; Alam, Tanvir

Arif, Muhammad; Fang, Ge; Ghulam, Ali; Musleh, Saleh; Alam, Tanvir.

Arif M; College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar.
Fang G; State Key Laboratory for Organic Electronics and Information Displays, Institute of Advanced Materials (IAM), Nanjing 210023, P. R. China, Nanjing 210023, China.
Ghulam A; Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bankok, 10700, Thailand.
Musleh S; Information Technology Centre, Sindh Agriculture University, Sindh, Pakistan.
Alam T; College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar.

BMC Bioinformatics ; 25(1): 145, 2024 Apr 05.

Article en En | MEDLINE | ID: mdl-38580921

ABSTRACT

ABSTRACT

BACKGROUND:

Drug targets in living beings perform pivotal roles in the discovery of potential drugs. Conventional wet-lab characterization of drug targets is although accurate but generally expensive, slow, and resource intensive. Therefore, computational methods are highly desirable as an alternative to expedite the large-scale identification of druggable proteins (DPs); however, the existing in silico predictor's performance is still not satisfactory.

METHODS:

In this study, we developed a novel deep learning-based model DPI_CDF for predicting DPs based on protein sequence only. DPI_CDF utilizes evolutionary-based (i.e., histograms of oriented gradients for position-specific scoring matrix), physiochemical-based (i.e., component protein sequence representation), and compositional-based (i.e., normalized qualitative characteristic) properties of protein sequence to generate features. Then a hierarchical deep forest model fuses these three encoding schemes to build the proposed model DPI_CDF.

RESULTS:

The empirical outcomes on 10-fold cross-validation demonstrate that the proposed model achieved 99.13 % accuracy and 0.982 of Matthew's-correlation-coefficient (MCC) on the training dataset. The generalization power of the trained model is further examined on an independent dataset and achieved 95.01% of maximum accuracy and 0.900 MCC. When compared to current state-of-the-art methods, DPI_CDF improves in terms of accuracy by 4.27% and 4.31% on training and testing datasets, respectively. We believe, DPI_CDF will support the research community to identify druggable proteins and escalate the drug discovery process.

AVAILABILITY:

The benchmark datasets and source codes are available in GitHub http//github.com/Muhammad-Arif-NUST/DPI_CDF .

Asunto(s)

Proteínas; Programas Informáticos; Secuencia de Aminoácidos; Posición Específica de Matrices de Puntuación; Evolución Biológica; Biología Computacional/métodos

Palabras clave

Bioinformatics; Cascade deep forest; Druggable proteins; PSSM; Physicochemical features

Texto completo

Imprimir

XML

PubMed Links

Search on Google

Texto completo: 1 Banco de datos: MEDLINE Asunto principal: Programas Informáticos / Proteínas Idioma: En Año: 2024 Tipo del documento: Article

Texto completo

Imprimir

XML

PubMed Links

Search on Google

Texto completo: 1 Banco de datos: MEDLINE Asunto principal: Programas Informáticos / Proteínas Idioma: En Año: 2024 Tipo del documento: Article