Pesquisa | Portal Regional da BVS

Identifying DNA-binding proteins based on multi-features and LASSO feature selection.

Zhang, Shengli; Zhu, Fu; Yu, Qianhao; Zhu, Xiaoyue.

Biopolymers ; 112(2): e23419, 2021 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-33476047

RESUMO

DNA-binding proteins perform an indispensable function in the maintenance and processing of genetic information and are inefficiently identified by traditional experimental methods due to their huge quantities. On the contrary, machine learning methods as an emerging technique demonstrate satisfactory speed and accuracy when used to study these molecules. This work focuses on extracting four different features from primary and secondary sequence features: Reduced sequence and index-vectors (RS), Pseudo-amino acid components (PseAACS), Position-specific scoring matrix-Auto Cross Covariance Transform (PSSM-ACCT), and Position-specific scoring matrix-Discrete Wavelet Transform (PSSM-DWT). Using the LASSO dimension reduction method, we experiment on the combination of feature submodels to obtain the optimized number of top rank features. These features are respectively input into the training Ensemble subspace discriminant, Ensemble bagged tree and KNN to predict the DNA-binding proteins. Three different datasets, PDB594, PDB1075, and PDB186, are adopted to evaluate the performance of the as-proposed approach in this work. The PDB1075 and PDB594 datasets are adopted for the five-fold cross-validation, and the PDB186 is used for the independent experiment. In the five-fold cross-validation, both the PDB1075 and PDB594 show extremely high accuracy, reaching 86.98% and 88.9% by Ensemble subspace discriminant, respectively. The accuracy of independent experiment by multi-classifiers voting is 83.33%, which suggests that the methodology proposed in this work is capable of predicting DNA-binding proteins effectively.

Assuntos

Algoritmos , Proteínas de Ligação a DNA/química , Proteômica/métodos , Bases de Dados de Proteínas , Matrizes de Pontuação de Posição Específica

iDHS-DSAMS: Identifying DNase I hypersensitive sites based on the dinucleotide property matrix and ensemble bagged tree.

Zhang, Shengli; Yu, Qianhao; He, Haoran; Zhu, Fu; Wu, Panjing; Gu, Lingzhi; Jiang, Sijie.

Genomics ; 112(2): 1282-1289, 2020 03.

Artigo em Inglês | MEDLINE | ID: mdl-31377426

RESUMO

DNase I hypersensitive site (DHS) is related to DNA regulatory elements, so the understanding of DHS sites is of great significance for biomedical research. However, traditional experiments are not very good at identifying recombinant sites of a large number of emerging DNA sequences by sequencing. Some machine learning methods have been proposed to identify DHS, but most methods ignore spatial autocorrelation of the DNA sequence. In this paper, we proposed a predictor called iDHS-DSAMS to identify DHS based on the benchmark datasets. We develop a feature extraction method called dinucleotide-based spatial autocorrelation (DSA). Then we use Min-Redundancy-Max-Relevance (mRMR) to remove irrelevant and redundant features and a 100-dimensional feature vector is selected. Finally, we utilize ensemble bagged tree as classifier, which is based on the oversampled datasets using SMOTE. Five-fold cross validation tests on two benchmark datasets indicate that the proposed method outperforms its existing counterparts on the individual accuracy (Acc), Matthews correlation coefficient (MCC), sensitivity (Sn) and specificity (Sp).

Assuntos

Análise de Sequência de DNA/métodos , Software , Animais , Sítios de Ligação , Desoxirribonuclease I/metabolismo , Humanos , Motivos de Nucleotídeos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA