Augmented sequence features and subcellular localization for functional characterization of unknown protein sequences.

Agrawal, Saurabh; Sisodia, Dilip Singh; Nagwani, Naresh Kumar

Agrawal, Saurabh; Sisodia, Dilip Singh; Nagwani, Naresh Kumar.

Afiliación

Agrawal S; Department of Computer Science & Engineering, National Institute of Technology Raipur, GE Road, Raipur, Chhattisgarh, 492010, India. sagrawal.phd2018.cs@nitrr.ac.in.
Sisodia DS; Department of Computer Science & Engineering, National Institute of Technology Raipur, GE Road, Raipur, Chhattisgarh, 492010, India.
Nagwani NK; Department of Computer Science & Engineering, National Institute of Technology Raipur, GE Road, Raipur, Chhattisgarh, 492010, India.

Med Biol Eng Comput ; 59(11-12): 2297-2310, 2021 Nov.

Article en En | MEDLINE | ID: mdl-34545514

RESUMEN

Advances in high-throughput techniques lead to evolving a large number of unknown protein sequences (UPS). Functional characterization of UPS is significant for the investigation of disease symptoms and drug repositioning. Protein subcellular localization is imperative for the functional characterization of protein sequences. Diverse techniques are used on protein sequences for feature extraction. However, many times a single feature extraction technique leads to poor prediction performance. In this paper, two feature augmentations are described through sequence induced, physicochemical, and evolutionary information of the amino acid residues. While augmented features preserve the sequence-order-information and protein-residue-properties. Two bacterial protein datasets Gram-Positive (G +) and Gram-Negative (G-) are utilized for the experimental work. After performing essential preprocessing on protein datasets, two sets of feature vectors are obtained. These feature vectors are used separately to train the different individual and ensembles such as decision tree (C 4.5), k-nearest neighbor (k-NN), multi-layer perceptron (MLP), Naïve Bayes (NB), support vector machine (SVM), AdaBoost, gradient boosting machine (GBM), and random forest (RF) with fivefold cross-validation. Prediction results of the model demonstrate that overall accuracy reported by C4.5 is highest 99.57% on G + and 97.47% on G- datasets with known protein sequences. Similarly, for the UPS overall accuracy of G + is 85.17% with SVM and 82.45% with G- dataset using MLP.

Asunto(s)

Redes Neurales de la Computación; Máquina de Vectores de Soporte; Algoritmos; Secuencia de Aminoácidos; Teorema de Bayes; Proteínas

Palabras clave

Augmented sequence features; Evolutionary information; Functional characterization; Sequence features; Subcellular localization

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Base de datos: MEDLINE Asunto principal: Redes Neurales de la Computación / Máquina de Vectores de Soporte Tipo de estudio: Prognostic_studies Idioma: En Revista: Med Biol Eng Comput Año: 2021 Tipo del documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google