Augmented sequence features and subcellular localization for functional characterization of unknown protein sequences.
Med Biol Eng Comput
; 59(11-12): 2297-2310, 2021 Nov.
Article
en En
| MEDLINE
| ID: mdl-34545514
Advances in high-throughput techniques lead to evolving a large number of unknown protein sequences (UPS). Functional characterization of UPS is significant for the investigation of disease symptoms and drug repositioning. Protein subcellular localization is imperative for the functional characterization of protein sequences. Diverse techniques are used on protein sequences for feature extraction. However, many times a single feature extraction technique leads to poor prediction performance. In this paper, two feature augmentations are described through sequence induced, physicochemical, and evolutionary information of the amino acid residues. While augmented features preserve the sequence-order-information and protein-residue-properties. Two bacterial protein datasets Gram-Positive (G +) and Gram-Negative (G-) are utilized for the experimental work. After performing essential preprocessing on protein datasets, two sets of feature vectors are obtained. These feature vectors are used separately to train the different individual and ensembles such as decision tree (C 4.5), k-nearest neighbor (k-NN), multi-layer perceptron (MLP), Naïve Bayes (NB), support vector machine (SVM), AdaBoost, gradient boosting machine (GBM), and random forest (RF) with fivefold cross-validation. Prediction results of the model demonstrate that overall accuracy reported by C4.5 is highest 99.57% on G + and 97.47% on G- datasets with known protein sequences. Similarly, for the UPS overall accuracy of G + is 85.17% with SVM and 82.45% with G- dataset using MLP.
Palabras clave
Texto completo:
1
Base de datos:
MEDLINE
Asunto principal:
Redes Neurales de la Computación
/
Máquina de Vectores de Soporte
Tipo de estudio:
Prognostic_studies
Idioma:
En
Revista:
Med Biol Eng Comput
Año:
2021
Tipo del documento:
Article