RESUMEN
DNA-binding proteins are a class of proteins that can interact with DNA molecules through physical and chemical interactions. Their main functions include regulating gene expression, maintaining chromosome structure and stability, and more. DNA-binding proteins play a crucial role in cellular and molecular biology, as they are essential for maintaining normal cellular physiological functions and adapting to environmental changes. The prediction of DNA-binding proteins has been a hot topic in the field of bioinformatics. The key to accurately classifying DNA-binding proteins is to find suitable feature sources and explore the information they contain. Although there are already many models for predicting DNA-binding proteins, there is still room for improvement in mining feature source information and calculation methods. In this study, we created a model called DBPboost to better identify DNA-binding proteins. The innovation of this study lies in the use of eight feature extraction methods, the improvement of the feature selection step, which involves selecting some features first and then performing feature selection again after feature fusion, and the optimization of the differential evolution algorithm in feature fusion, which improves the performance of feature fusion. The experimental results show that the prediction accuracy of the model on the UniSwiss dataset is 89.32%, and the sensitivity is 89.01%, which is better than most existing models.
Asunto(s)
Proteínas de Unión al ADN , Máquina de Vectores de Soporte , Proteínas de Unión al ADN/química , Algoritmos , ADN/química , Biología Computacional/métodosRESUMEN
Unconventional secretory proteins (USPs) are vital for cell-to-cell communication and are necessary for proper physiological processes. Unlike classical proteins that follow the conventional secretory pathway via the Golgi apparatus, these proteins are released using unconventional pathways. The primary modes of secretion for USPs are exosomes and ectosomes, which originate from the endoplasmic reticulum. Accurate and rapid identification of exosome-mediated secretory proteins is crucial for gaining valuable insights into the regulation of non-classical protein secretion and intercellular communication, as well as for the advancement of novel therapeutic approaches. Although computational methods based on amino acid sequence prediction exist for predicting unconventional proteins secreted by exosomes (UPSEs), they suffer from significant limitations in terms of algorithmic accuracy. In this study, we propose a novel approach to predict UPSEs by combining multiple deep learning models that incorporate both protein sequences and evolutionary information. Our approach utilizes a convolutional neural network (CNN) to extract protein sequence information, while various densely connected neural networks (DNNs) are employed to capture evolutionary conservation patterns.By combining six distinct deep learning models, we have created a superior framework that surpasses previous approaches, achieving an ACC score of 77.46% and an MCC score of 0.5406 on an independent test dataset.
Asunto(s)
Aprendizaje Profundo , Exosomas , Exosomas/metabolismo , Exosomas/química , Redes Neurales de la Computación , Humanos , Biología Computacional/métodos , Algoritmos , Secuencia de Aminoácidos , Proteínas/metabolismo , Proteínas/análisis , Proteínas/químicaRESUMEN
The Type III secretion effectors (T3SEs) are bacterial proteins synthesized by Gram-negative pathogens and delivered into host cells via the Type III secretion system (T3SS). These effectors usually play a pivotal role in the interactions between bacteria and hosts. Hence, the precise identification of T3SEs aids researchers in exploring the pathogenic mechanisms of bacterial infections. Since the diversity and complexity of T3SE sequences often make traditional experimental methods time-consuming, it is imperative to explore more efficient and convenient computational approaches for T3SE prediction. Inspired by the promising potential exhibited by pre-trained language models in protein recognition tasks, we proposed a method called PLM-T3SE that utilizes protein language models (PLMs) for effective recognition of T3SEs. First, we utilized PLM embeddings and evolutionary features from the position-specific scoring matrix (PSSM) profiles to transform protein sequences into fixed-length vectors for model training. Second, we employed the extreme gradient boosting (XGBoost) algorithm to rank these features based on their importance. Finally, a MLP neural network model was used to predict T3SEs based on the selected optimal feature set. Experimental results from the cross-validation and independent test demonstrated that our model exhibited superior performance compared to the existing models. Specifically, our model achieved an accuracy of 98.1%, which is 1.8%-42.4% higher than the state-of-the-art predictors based on the same independent data set test. These findings highlight the superiority of the PLM-T3SE and the remarkable characterization ability of PLM embeddings for T3SE prediction.
RESUMEN
In the past decade, convolutional neural networks (CNNs) have been used as powerful tools by scientists to solve visual data tasks. However, many efforts of convolutional neural networks in solving protein function prediction and extracting useful information from protein sequences have certain limitations. In this research, we propose a new method to improve the weaknesses of the previous method. mCNN-ETC is a deep learning model which can transform the protein evolutionary information into image-like data composed of 20 channels, which correspond to the 20 amino acids in the protein sequence. We constructed CNN layers with different scanning windows in parallel to enhance the useful pattern detection ability of the proposed model. Then we filtered specific patterns through the 1-max pooling layer before inputting them into the prediction layer. This research attempts to solve a basic problem in biology in terms of application: predicting electron transporters and classifying their corresponding complexes. The performance result reached an accuracy of 97.41%, which was nearly 6% higher than its predecessor. We have also published a web server on http://bio219.bioinfo.yzu.edu.tw, which can be used for research purposes free of charge.
Asunto(s)
Electrones , Redes Neurales de la Computación , Secuencia de Aminoácidos , Evolución Biológica , Humanos , Proteínas/químicaRESUMEN
The recognition of DNA-binding proteins (DBPs) is the crucial step to understanding their roles in various biological processes such as genetic regulation, gene expression, cell cycle control, DNA repair, and replication within cells. However, conventional experimental methods for identifying DBPs are usually time-consuming and expensive. Therefore, there is an urgent need to develop rapid and efficient computational methods for the prediction of DBPs. In this study, we proposed a novel predictor named PreDBP-PLMs to further improve the identification accuracy of DBPs by fusing the pre-trained protein language model (PLM) ProtT5 embedding with evolutionary features as input to the classic convolutional neural network (CNN) model. Firstly, the ProtT5 embedding was combined with different evolutionary features derived from the position-specific scoring matrix (PSSM) to represent protein sequences. Then, the optimal feature combination was selected and input to the CNN classifier for the prediction of DBPs. Finally, the 5-fold cross-validation (CV), the leave-one-out CV (LOOCV), and the independent set test were adopted to examine the performance of PreDBP-PLMs on the benchmark datasets. Compared to the existing state-of-the-art predictors, PreDBP-PLMs exhibits an accuracy improvement of 0.5 % and 5.2 % on the PDB186 and PDB2272 datasets, respectively. It demonstrated that the proposed method could serve as a useful tool for the recognition of DBPs.
Asunto(s)
Proteínas de Unión al ADN , Redes Neurales de la Computación , Proteínas de Unión al ADN/metabolismo , Proteínas de Unión al ADN/química , Biología Computacional/métodos , Bases de Datos de Proteínas , HumanosRESUMEN
BACKGROUND: Rational design of synthetic phage-displayed libraries requires the identification of the most appropriate positions for randomization using defined amino acid sets to recapitulate the natural occurrence. The present study uses position-specific scoring matrixes (PSSMs) for identifying and randomizing Camelidae nanobody (VHH) CDR3. The functionality of a synthetic VHH repertoire designed by this method was tested for discovering new VHH binders to recombinant coagulation factor VII (rfVII). METHODS: Based on PSSM analysis, the CDR3 of cAbBCII10 VHH framework was identified, and a set of amino acids for the substitution of each PSSM-CDR3 position was defined. Using the Rosetta design SwiftLib tool, the final repertoire was back-translated to a degenerate nucleotide sequence. A synthetic phage-displayed library was constructed based on this repertoire and screened for anti-rfVII binders. RESULTS: A synthetic phage-displayed VHH library with 1 × 108 variants was constructed. Three VHH binders to rfVII were isolated from this library with estimated dissociation constants (KD) of 1 × 10-8 M, 5.8 × 10-8 M and 2.6 × 10-7 M. CONCLUSION: PSSM analysis is a simple and efficient way to design synthetic phage-displayed libraries.
Asunto(s)
Biología Computacional , Biblioteca de Péptidos , Anticuerpos de Dominio Único , Anticuerpos de Dominio Único/genética , Anticuerpos de Dominio Único/química , Anticuerpos de Dominio Único/inmunología , Animales , Camelidae/genética , Camelidae/inmunología , Factor VII/genética , Factor VII/química , Factor VII/inmunología , Proteínas Recombinantes/genética , Proteínas Recombinantes/química , Proteínas Recombinantes/inmunología , Secuencia de AminoácidosRESUMEN
Major histocompatibility complex (MHC) possesses important research value in the treatment of complex human diseases. A plethora of computational tools has been developed to predict MHC class I binders. Here, we comprehensively reviewed 27 up-to-date MHC I binding prediction tools developed over the last decade, thoroughly evaluating feature representation methods, prediction algorithms and model training strategies on a benchmark dataset from Immune Epitope Database. A common limitation was identified during the review that all existing tools can only handle a fixed peptide sequence length. To overcome this limitation, we developed a bilateral and variable long short-term memory (BVLSTM)-based approach, named BVLSTM-MHC. It is the first variable-length MHC class I binding predictor. In comparison to the 10 mainstream prediction tools on an independent validation dataset, BVLSTM-MHC achieved the best performance in six out of eight evaluated metrics. A web server based on the BVLSTM-MHC model was developed to enable accurate and efficient MHC class I binder prediction in human, mouse, macaque and chimpanzee.
Asunto(s)
Sitios de Unión , Proteínas Portadoras/química , Biología Computacional/métodos , Antígenos de Histocompatibilidad Clase I/química , Redes Neurales de la Computación , Programas Informáticos , Secuencia de Aminoácidos , Proteínas Portadoras/metabolismo , Bases de Datos Factuales , Aprendizaje Profundo , Epítopos/química , Epítopos/inmunología , Epítopos/metabolismo , Antígenos de Histocompatibilidad Clase I/inmunología , Antígenos de Histocompatibilidad Clase I/metabolismo , Aprendizaje Automático , Unión Proteica , Curva ROC , Reproducibilidad de los Resultados , Navegador WebRESUMEN
Protein multiple sequence alignment information has long been important features to know about functions of proteins inferred from related sequences with known functions. It is therefore one of the underlying ideas of Alpha fold 2, a breakthrough study and model for the prediction of three-dimensional structures of proteins from their primary sequence. Our study used protein multiple sequence alignment information in the form of position-specific scoring matrices as input. We also refined the use of a convolutional neural network, a well-known deep-learning architecture with impressive achievement on image and image-like data. Specifically, we revisited the study of prediction of adenosine triphosphate (ATP)-binding sites with more efficient convolutional neural networks. We applied multiple convolutional window scanning filters of a convolutional neural network on position-specific scoring matrices for as much as useful information as possible. Furthermore, only the most specific motifs are retained at each feature map output through the one-max pooling layer before going to the next layer. We assumed that this way could help us retain the most conserved motifs which are discriminative information for prediction. Our experiment results show that a convolutional neural network with not too many convolutional layers can be enough to extract the conserved information of proteins, which leads to higher performance. Our best prediction models were obtained after examining them with different hyper-parameters. Our experiment results showed that our models were superior to traditional use of convolutional neural networks on the same datasets as well as other machine-learning classification algorithms.
Asunto(s)
Adenosina Trifosfato , Proteínas Portadoras , Algoritmos , Sitios de Unión , Aprendizaje Automático , Redes Neurales de la Computación , Proteínas/químicaRESUMEN
Protein-RNA interactions play essential roles in a wide variety of biological processes. Recognition of RNA-binding residues on proteins has been a challenging problem. Most of methods utilize the position-specific scoring matrix (PSSM). It has been found that considering the evolutionary information of sequence neighboring residues can improve the prediction. In this work, we introduce a novel method SNB-PSSM (spatial neighbor-based PSSM) combined with the structure window scheme where the evolutionary information of spatially neighboring residues is considered. The results show our method consistently outperforms the standard and smoothed PSSM methods. Tested on multiple datasets, this approach shows an encouraging performance compared with RNABindRPlus, BindN+, PPRInt, xypan, Predict_RBP, SpaPF, PRNA, and KYG, although is inferior to RNAProSite, RBscore, and aaRNA. In addition, since our method is not sensitive to protein structure changes, it can be applied well on binding site predictions of modeled structures. Thus, the result also suggests the evolution of binding sites is spatially cooperative. The proposed method as an effective tool of considering evolutionary information can be widely used for the nucleic acid-/protein-binding site prediction and functional motif finding.
Asunto(s)
Sitios de Unión/fisiología , Unión Proteica/fisiología , Proteínas de Unión al ARN/metabolismo , ARN/metabolismo , Algoritmos , Biología Computacional/métodos , Bases de Datos de Proteínas , Posición Específica de Matrices de PuntuaciónRESUMEN
DNA-binding proteins perform an indispensable function in the maintenance and processing of genetic information and are inefficiently identified by traditional experimental methods due to their huge quantities. On the contrary, machine learning methods as an emerging technique demonstrate satisfactory speed and accuracy when used to study these molecules. This work focuses on extracting four different features from primary and secondary sequence features: Reduced sequence and index-vectors (RS), Pseudo-amino acid components (PseAACS), Position-specific scoring matrix-Auto Cross Covariance Transform (PSSM-ACCT), and Position-specific scoring matrix-Discrete Wavelet Transform (PSSM-DWT). Using the LASSO dimension reduction method, we experiment on the combination of feature submodels to obtain the optimized number of top rank features. These features are respectively input into the training Ensemble subspace discriminant, Ensemble bagged tree and KNN to predict the DNA-binding proteins. Three different datasets, PDB594, PDB1075, and PDB186, are adopted to evaluate the performance of the as-proposed approach in this work. The PDB1075 and PDB594 datasets are adopted for the five-fold cross-validation, and the PDB186 is used for the independent experiment. In the five-fold cross-validation, both the PDB1075 and PDB594 show extremely high accuracy, reaching 86.98% and 88.9% by Ensemble subspace discriminant, respectively. The accuracy of independent experiment by multi-classifiers voting is 83.33%, which suggests that the methodology proposed in this work is capable of predicting DNA-binding proteins effectively.
Asunto(s)
Algoritmos , Proteínas de Unión al ADN/química , Proteómica/métodos , Bases de Datos de Proteínas , Posición Específica de Matrices de PuntuaciónRESUMEN
Identification of drug-target interactions (DTIs) is vital for drug discovery. However, traditional biological approaches have some unavoidable shortcomings, such as being time consuming and expensive. Therefore, there is an urgent need to develop novel and effective computational methods to predict DTIs in order to shorten the development cycles of new drugs. In this study, we present a novel computational approach to identify DTIs, which uses protein sequence information and the dual-tree complex wavelet transform (DTCWT). More specifically, a position-specific scoring matrix (PSSM) was performed on the target protein sequence to obtain its evolutionary information. Then, DTCWT was used to extract representative features from the PSSM, which were then combined with the drug fingerprint features to form the feature descriptors. Finally, these descriptors were sent to the Rotation Forest (RoF) model for classification. A 5-fold cross validation (CV) was adopted on four datasets (Enzyme, Ion Channel, GPCRs (G-protein-coupled receptors), and NRs (Nuclear Receptors)) to validate the proposed model; our method yielded high average accuracies of 89.21%, 85.49%, 81.02%, and 74.44%, respectively. To further verify the performance of our model, we compared the RoF classifier with two state-of-the-art algorithms: the support vector machine (SVM) and the k-nearest neighbor (KNN) classifier. We also compared it with some other published methods. Moreover, the prediction results for the independent dataset further indicated that our method is effective for predicting potential DTIs. Thus, we believe that our method is suitable for facilitating drug discovery and development.
Asunto(s)
Desarrollo de Medicamentos , Máquina de Vectores de Soporte , Análisis de Ondículas , Bases de Datos de Proteínas , Enzimas/química , Canales Iónicos/química , Receptores Citoplasmáticos y Nucleares/química , Receptores Acoplados a Proteínas G/químicaRESUMEN
Many gram-negative bacteria use type IV secretion systems to deliver effector molecules to a wide range of target cells. These substrate proteins, which are called type IV secreted effectors (T4SE), manipulate host cell processes during infection, often resulting in severe diseases or even death of the host. Therefore, identification of putative T4SEs has become a very active research topic in bioinformatics due to its vital roles in understanding host-pathogen interactions. PSI-BLAST profiles have been experimentally validated to provide important and discriminatory evolutionary information for various protein classification tasks. In the present study, an accurate computational predictor termed iT4SE-EP was developed for identifying T4SEs by extracting evolutionary features from the position-specific scoring matrix and the position-specific frequency matrix profiles. First, four types of encoding strategies were designed to transform protein sequences into fixed-length feature vectors based on the two profiles. Then, the feature selection technique based on the random forest algorithm was utilized to reduce redundant or irrelevant features without much loss of information. Finally, the optimal features were input into a support vector machine classifier to carry out the prediction of T4SEs. Our experimental results demonstrated that iT4SE-EP outperformed most of existing methods based on the independent dataset test.
Asunto(s)
Evolución Molecular , Bacterias Gramnegativas/genética , Interacciones Huésped-Patógeno/genética , Sistemas de Secreción Tipo IV/genética , Secuencia de Aminoácidos/genética , Infecciones Bacterianas/tratamiento farmacológico , Infecciones Bacterianas/genética , Infecciones Bacterianas/microbiología , Biología Computacional , Bacterias Gramnegativas/patogenicidad , Humanos , Sistemas de Secreción Tipo IV/químicaRESUMEN
BACKGROUND: Apoptosis, also called programmed cell death, refers to the spontaneous and orderly death of cells controlled by genes in order to maintain a stable internal environment. Identifying the subcellular location of apoptosis proteins is very helpful in understanding the mechanism of apoptosis and designing drugs. Therefore, the subcellular localization of apoptosis proteins has attracted increased attention in computational biology. Effective feature extraction methods play a critical role in predicting the subcellular location of proteins. RESULTS: In this paper, we proposed two novel feature extraction methods based on evolutionary information. One of the features obtained the evolutionary information via the transition matrix of the consensus sequence (CTM). And the other utilized the evolutionary information from PSSM based on absolute entropy correlation analysis (AECA-PSSM). After fusing the two kinds of features, linear discriminant analysis (LDA) was used to reduce the dimension of the proposed features. Finally, the support vector machine (SVM) was adopted to predict the protein subcellular locations. The proposed CTM-AECA-PSSM-LDA subcellular location prediction method was evaluated using the CL317 dataset and ZW225 dataset. By jackknife test, the overall accuracy was 99.7% (CL317) and 95.6% (ZW225) respectively. CONCLUSIONS: The experimental results show that the proposed method which is hopefully to be a complementary tool for the existing methods of subcellular localization, can effectively extract more abundant features of protein sequence and is feasible in predicting the subcellular location of apoptosis proteins.
Asunto(s)
Algoritmos , Proteínas Reguladoras de la Apoptosis/metabolismo , Análisis Discriminante , Evolución Molecular , Secuencia de Aminoácidos , Proteínas Reguladoras de la Apoptosis/química , Secuencia de Consenso , Bases de Datos de Proteínas , Entropía , Posición Específica de Matrices de Puntuación , Curva ROC , Fracciones Subcelulares/metabolismo , Máquina de Vectores de SoporteRESUMEN
The identification of drug-target interactions has great significance for pharmaceutical scientific research. Since traditional experimental methods identifying drug-target interactions is costly and time-consuming, the use of machine learning methods to predict potential drug-target interactions has attracted widespread attention. This paper presents a novel drug-target interactions prediction method called LRF-DTIs. Firstly, the pseudo-position specific scoring matrix (PsePSSM) and FP2 molecular fingerprinting were used to extract the features of drug-target. Secondly, using Lasso to reduce the dimension of the extracted feature information and then the Synthetic Minority Oversampling Technique (SMOTE) method was used to deal with unbalanced data. Finally, the processed feature vectors were input into a random forest (RF) classifier to predict drug-target interactions. Through 10 trials of 5-fold cross-validation, the overall prediction accuracies on the enzyme, ion channel (IC), G-protein-coupled receptor (GPCR) and nuclear receptor (NR) datasets reached 98.09%, 97.32%, 95.69%, and 94.88%, respectively, and compared with other prediction methods. In addition, we have tested and verified that our method not only could be applied to predict the new interactions but also could obtain a satisfactory result on the new dataset. All the experimental results indicate that our method can significantly improve the prediction accuracy of drug-target interactions and play a vital role in the new drug research and target protein development. The source code and all datasets are available at https://github.com/QUST-AIBBDRC/LRF-DTIs/ for academic use.
Asunto(s)
Bases de Datos de Proteínas , Canales Iónicos/genética , Aprendizaje Automático , Receptores Citoplasmáticos y Nucleares/genética , Receptores Acoplados a Proteínas G/genética , Programas Informáticos , Desarrollo de Medicamentos , Posición Específica de Matrices de Puntuación , Conformación ProteicaRESUMEN
BACKGROUND: Electron transport chain is a series of protein complexes embedded in the process of cellular respiration, which is an important process to transfer electrons and other macromolecules throughout the cell. It is also the major process to extract energy via redox reactions in the case of oxidation of sugars. Many studies have determined that the electron transport protein has been implicated in a variety of human diseases, i.e. diabetes, Parkinson, Alzheimer's disease and so on. Few bioinformatics studies have been conducted to identify the electron transport proteins with high accuracy, however, their performance results require a lot of improvements. Here, we present a novel deep neural network architecture to address this problem. RESULTS: Most of the previous studies could not use the original position specific scoring matrix (PSSM) profiles to feed into neural networks, leading to a lack of information and the neural networks consequently could not achieve the best results. In this paper, we present a novel approach by using deep gated recurrent units (GRU) on full PSSMs to resolve this problem. Our approach can precisely predict the electron transporters with the cross-validation and independent test accuracy of 93.5 and 92.3%, respectively. Our approach demonstrates superior performance to all of the state-of-the-art predictors on electron transport proteins. CONCLUSIONS: Through the proposed study, we provide ET-GRU, a web server for discriminating electron transport proteins in particular and other protein functions in general. Also, our achievement could promote the use of GRU in computational biology, especially in protein function prediction.
Asunto(s)
Proteínas del Complejo de Cadena de Transporte de Electrón/química , Redes Neurales de la Computación , Programas Informáticos , Transporte de Electrón , Humanos , Posición Específica de Matrices de PuntuaciónRESUMEN
Protein function prediction is one of the well-known problems in proteome research, attracting the attention of numerous researchers. However, the implementation of deep neural networks, which helps to increase the protein function prediction, still poses a big challenge. This study proposes a deep learning approach namely Fertility-GRU that incorporates gated recurrent units and position-specific scoring matrix profiles to predict the function of fertility-related protein, which is a highly crucial biological function. Fertility-related proteins also have been proven to be important in many biological entities (i.e., bone marrow and peripheral blood, postnatal mammalian ovary) and parameters (i.e., daily sperm production). As a result, our model can achieve a cross-validation accuracy of 85.8% and an independent accuracy of 91.1%. We also solve the problem of overfitting in the data set by adding dropout layers in the deep learning model. The independent testing results showed sensitivity, specificity, and Matthews correlation coefficient (MCC) values of 90.5%, 91.7%, and 0.82, respectively. Fertility-GRU demonstrates superiority in performance against the state-of-the-art predictor on the same data set. In our proposed study, we provided a method that enables more proteins to be discovered, especially proteins associated with fertility. Moreover, our achievement could promote the use of recurrent networks and gated recurrent units in proteome research. The source code and data set are freely accessible via https://github.com/khanhlee/fertility-gru .
Asunto(s)
Fertilidad/genética , Proteínas/genética , Proteómica/métodos , Programas Informáticos , Algoritmos , Bases de Datos Genéticas , Aprendizaje Profundo , Desarrollo Embrionario/genética , Femenino , Humanos , Masculino , Redes Neurales de la Computación , Oogénesis/genética , Posición Específica de Matrices de Puntuación , Proteínas/clasificación , Proteínas/aislamiento & purificación , Proteómica/estadística & datos numéricos , Espermatogénesis/genéticaRESUMEN
The movement of ions across the cell membrane is an essential for many biological processes. This study is focused on ion channels and ion transporters (pumps) as types of border guards control the incessant traffic of ions across cell membranes. Ion channels and ion transporters function to regulate membrane potential and electrical signaling and play important roles in cell proliferation, migration, apoptosis, and differentiation. In their behaviors, it is found that ion channels differ significantly from ion transporters. Therefore, a method for automatically classifying ion transporters and ion channels from membrane proteins is proposed by training deep neural networks and using the position-specific scoring matrix profile as an input. The key of novelty is the three-stage approach, in which five techniques for data normalization are used; next three imbalanced data techniques are applied to the minority classes and then, six classifiers are compared with the proposed method. © 2019 Wiley Periodicals, Inc.
Asunto(s)
Aprendizaje Profundo , Canales Iónicos/química , Canales Iónicos/clasificación , Automatización , Humanos , Transporte IónicoRESUMEN
Membrane protein is a pivotal constituent of a cell that exerts a crucial influence on diverse biological processes. The accurate identification of membrane protein types is deeply essential for revealing molecular mechanisms and drug development. Primarily, several traditional methods were exploited to classify these types. However, experimental methods are laborious, time-consuming, and costly due to rapid exploration of uncharacterized protein sequences generated in the postgenomic era. Hence, machine learning-based methods are more indispensable for reliable and fast identification of membrane protein types. A variety of state-of-the-art investigations have been elucidated to improve prediction performance, but predictive validity is still insufficient. Motivated by this, we designed a promising sequential support vector machine based predictor called TargetHMP to predict types of membrane proteins. We captured the local informative features by exploring evolutionary profiles through a novel method called the segmentation-based pseudo position-specific scoring matrix (Seg-PsePSSM). TargetHMP attained high accuracy of 94.99%, 93.48%, and 90.36% on the S1, S2, and S3 datasets, respectively, using a vigorous leave-one-out-cross-validation test. The results indicate that the performance of the proposed method outperformed prior predictors. We expect that the proposed approach will help research academia in general and pharmaceutical drug discovery in particular.
Asunto(s)
Proteínas de la Membrana/análisis , Algoritmos , Biología Computacional/métodos , Bases de Datos de Proteínas , Proteínas de la Membrana/clasificación , Proteínas de la Membrana/genética , Máquina de Vectores de SoporteRESUMEN
Motor proteins are the driving force behind muscle contraction and are responsible for the active transportation of most proteins and vesicles in the cytoplasm. There are three superfamilies of cytoskeletal motor proteins with various molecular functions and structures: dynein, kinesin, and myosin. The functional loss of a specific motor protein molecular function has linked to a variety of human diseases, e.g., Charcot-Marie-Tooth disease, kidney disease. Therefore, creating a precise model to classify motor proteins is essential for helping biologists understand their molecular functions and design drug targets according to their impact on human diseases. Here we attempt to classify cytoskeleton motor proteins using deep learning, which has been increasingly and widely used to address numerous problems in a variety of fields resulting in state-of-the-art results. Our effective deep convolutional neural network is able to achieve an independent test accuracy of 97.5%, 96.4%, and 96.1% for each superfamily, respectively. Compared to other state-of-the-art methods, our approach showed a significant improvement in performance across a range of evaluation metrics. Through the proposed study, we provide an effective model for classifying motor proteins and a basis for further research that can enhance the performance of protein function classification using deep learning.
Asunto(s)
Proteínas del Citoesqueleto/fisiología , Proteínas Motoras Moleculares/fisiología , Redes Neurales de la Computación , Algoritmos , Humanos , Aprendizaje AutomáticoRESUMEN
RNA-protein interaction (RPI) plays an important role in the basic cellular processes of organisms. Unfortunately, due to time and cost constraints, it is difficult for biological experiments to determine the relationship between RNA and protein to a large extent. So there is an urgent need for reliable computational methods to quickly and accurately predict RNA-protein interaction. In this study, we propose a novel computational method RPIFSE (predicting RPI with Feature Selection Ensemble method) based on RNA and protein sequence information to predict RPI. Firstly, RPIFSE disturbs the features extracted by the convolution neural network (CNN) and generates multiple data sets according to the weight of the feature, and then use extreme learning machine (ELM) classifier to classify these data sets. Finally, the results of each classifier are combined, and the highest score is chosen as the final prediction result by weighting voting method. In 5-fold cross-validation experiments, RPIFSE achieved 91.87%, 89.74%, 97.76% and 98.98% accuracy on RPI369, RPI2241, RPI488 and RPI1807 data sets, respectively. To further evaluate the performance of RPIFSE, we compare it with the state-of-the-art support vector machine (SVM) classifier and other exiting methods on those data sets. Furthermore, we also predicted the RPI on the independent data set NPInter2.0 and drew the network graph based on the prediction results. These promising comparison results demonstrated the effectiveness of RPIFSE and indicated that RPIFSE could be a useful tool for predicting RPI.