Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 96
Filter
Add more filters

Publication year range
1.
Brief Bioinform ; 25(1)2023 11 22.
Article in English | MEDLINE | ID: mdl-38189542

ABSTRACT

Non-coding RNAs (ncRNAs) are a class of RNA molecules that do not have the potential to encode proteins. Meanwhile, they can occupy a significant portion of the human genome and participate in gene expression regulation through various mechanisms. Gestational diabetes mellitus (GDM) is a pathologic condition of carbohydrate intolerance that begins or is first detected during pregnancy, making it one of the most common pregnancy complications. Although the exact pathogenesis of GDM remains unclear, several recent studies have shown that ncRNAs play a crucial regulatory role in GDM. Herein, we present a comprehensive review on the multiple mechanisms of ncRNAs in GDM along with their potential role as biomarkers. In addition, we investigate the contribution of deep learning-based models in discovering disease-specific ncRNA biomarkers and elucidate the underlying mechanisms of ncRNA. This might assist community-wide efforts to obtain insights into the regulatory mechanisms of ncRNAs in disease and guide a novel approach for early diagnosis and treatment of disease.


Subject(s)
Carbohydrate Metabolism, Inborn Errors , Diabetes, Gestational , Malabsorption Syndromes , Humans , Female , Pregnancy , Diabetes, Gestational/genetics , Genome, Human , RNA, Untranslated/genetics , Biomarkers
2.
BMC Bioinformatics ; 24(1): 301, 2023 Jul 28.
Article in English | MEDLINE | ID: mdl-37507654

ABSTRACT

BACKGROUND: The identification of tumor T cell antigens (TTCAs) is crucial for providing insights into their functional mechanisms and utilizing their potential in anticancer vaccines development. In this context, TTCAs are highly promising. Meanwhile, experimental technologies for discovering and characterizing new TTCAs are expensive and time-consuming. Although many machine learning (ML)-based models have been proposed for identifying new TTCAs, there is still a need to develop a robust model that can achieve higher rates of accuracy and precision. RESULTS: In this study, we propose a new stacking ensemble learning-based framework, termed StackTTCA, for accurate and large-scale identification of TTCAs. Firstly, we constructed 156 different baseline models by using 12 different feature encoding schemes and 13 popular ML algorithms. Secondly, these baseline models were trained and employed to create a new probabilistic feature vector. Finally, the optimal probabilistic feature vector was determined based the feature selection strategy and then used for the construction of our stacked model. Comparative benchmarking experiments indicated that StackTTCA clearly outperformed several ML classifiers and the existing methods in terms of the independent test, with an accuracy of 0.932 and Matthew's correlation coefficient of 0.866. CONCLUSIONS: In summary, the proposed stacking ensemble learning-based framework of StackTTCA could help to precisely and rapidly identify true TTCAs for follow-up experimental verification. In addition, we developed an online web server ( http://2pmlab.camt.cmu.ac.th/StackTTCA ) to maximize user convenience for high-throughput screening of novel TTCAs.


Subject(s)
Computational Biology , Neoplasms , Humans , Computational Biology/methods , Algorithms , Machine Learning , T-Lymphocytes
3.
BMC Bioinformatics ; 24(1): 356, 2023 Sep 21.
Article in English | MEDLINE | ID: mdl-37735626

ABSTRACT

BACKGROUND: Tyrosinase is an enzyme involved in melanin production in the skin. Several hyperpigmentation disorders involve the overproduction of melanin and instability of tyrosinase activity resulting in darker, discolored patches on the skin. Therefore, discovering tyrosinase inhibitory peptides (TIPs) is of great significance for basic research and clinical treatments. However, the identification of TIPs using experimental methods is generally cost-ineffective and time-consuming. RESULTS: Herein, a stacked ensemble learning approach, called TIPred, is proposed for the accurate and quick identification of TIPs by using sequence information. TIPred explored a comprehensive set of various baseline models derived from well-known machine learning (ML) algorithms and heterogeneous feature encoding schemes from multiple perspectives, such as chemical structure properties, physicochemical properties, and composition information. Subsequently, 130 baseline models were trained and optimized to create new probabilistic features. Finally, the feature selection approach was utilized to determine the optimal feature vector for developing TIPred. Both tenfold cross-validation and independent test methods were employed to assess the predictive capability of TIPred by using the stacking strategy. Experimental results showed that TIPred significantly outperformed the state-of-the-art method in terms of the independent test, with an accuracy of 0.923, MCC of 0.757 and an AUC of 0.977. CONCLUSIONS: The proposed TIPred approach could be a valuable tool for rapidly discovering novel TIPs and effectively identifying potential TIP candidates for follow-up experimental validation. Moreover, an online webserver of TIPred is publicly available at http://pmlabstack.pythonanywhere.com/TIPred .


Subject(s)
Melanins , Monophenol Monooxygenase , Algorithms , Machine Learning , Peptides
4.
Brief Bioinform ; 22(6)2021 11 05.
Article in English | MEDLINE | ID: mdl-33963832

ABSTRACT

The release of interleukin (IL)-6 is stimulated by antigenic peptides from pathogens as well as by immune cells for activating aggressive inflammation. IL-6 inducing peptides are derived from pathogens and can be used as diagnostic biomarkers for predicting various stages of disease severity as well as being used as IL-6 inhibitors for the suppression of aggressive multi-signaling immune responses. Thus, the accurate identification of IL-6 inducing peptides is of great importance for investigating their mechanism of action as well as for developing diagnostic and immunotherapeutic applications. This study proposes a novel stacking ensemble model (termed StackIL6) for accurately identifying IL-6 inducing peptides. More specifically, StackIL6 was constructed from twelve different feature descriptors derived from three major groups of features (composition-based features, composition-transition-distribution-based features and physicochemical properties-based features) and five popular machine learning algorithms (extremely randomized trees, logistic regression, multi-layer perceptron, support vector machine and random forest). To enhance the utility of baseline models, they were effectively and systematically integrated through a stacking strategy to build the final meta-based model. Extensive benchmarking experiments demonstrated that StackIL6 could achieve significantly better performance than the existing method (IL6PRED) and outperformed its constituent baseline models on both training and independent test datasets, which thereby support its excellent discrimination and generalization abilities. To facilitate easy access to the StackIL6 model, it was established as a freely available web server accessible at http://camt.pythonanywhere.com/StackIL6. It is anticipated that StackIL6 can help to facilitate rapid screening of promising IL-6 inducing peptides for the development of diagnostic and immunotherapeutic applications in the future.


Subject(s)
Computational Biology/methods , Interleukin-6/biosynthesis , Peptides/metabolism , Algorithms , Amino Acid Sequence , Benchmarking , Chemical Phenomena , Humans , Machine Learning , Peptides/chemistry , ROC Curve , Reproducibility of Results
5.
Brief Bioinform ; 22(6)2021 11 05.
Article in English | MEDLINE | ID: mdl-33975333

ABSTRACT

Neuropeptides (NPs) are the most versatile neurotransmitters in the immune systems that regulate various central anxious hormones. An efficient and effective bioinformatics tool for rapid and accurate large-scale identification of NPs is critical in immunoinformatics, which is indispensable for basic research and drug development. Although a few NP prediction tools have been developed, it is mandatory to improve their NPs' prediction performances. In this study, we have developed a machine learning-based meta-predictor called NeuroPred-FRL by employing the feature representation learning approach. First, we generated 66 optimal baseline models by employing 11 different encodings, six different classifiers and a two-step feature selection approach. The predicted probability scores of NPs based on the 66 baseline models were combined to be deemed as the input feature vector. Second, in order to enhance the feature representation ability, we applied the two-step feature selection approach to optimize the 66-D probability feature vector and then inputted the optimal one into a random forest classifier for the final meta-model (NeuroPred-FRL) construction. Benchmarking experiments based on both cross-validation and independent tests indicate that the NeuroPred-FRL achieves a superior prediction performance of NPs compared with the other state-of-the-art predictors. We believe that the proposed NeuroPred-FRL can serve as a powerful tool for large-scale identification of NPs, facilitating the characterization of their functional mechanisms and expediting their applications in clinical therapy. Moreover, we interpreted some model mechanisms of NeuroPred-FRL by leveraging the robust SHapley Additive exPlanation algorithm.


Subject(s)
Computational Biology/methods , Machine Learning , Neuropeptides/chemistry , Software , Algorithms , Consensus Sequence , Databases, Genetic , Internet-Based Intervention , Neuropeptides/metabolism , Position-Specific Scoring Matrices , Reproducibility of Results , Workflow
6.
J Chem Inf Model ; 63(22): 7239-7257, 2023 Nov 27.
Article in English | MEDLINE | ID: mdl-37947586

ABSTRACT

Understanding the pathogenicity of missense mutation (MM) is essential for shed light on genetic diseases, gene functions, and individual variations. In this study, we propose a novel computational approach, called MMPatho, for enhancing missense mutation pathogenic prediction. First, we established a large-scale nonredundant MM benchmark data set based on the entire Ensembl database, complemented by a focused blind test set specifically for pathogenic GOF/LOF MM. Based on this data set, for each mutation, we utilized Ensembl VEP v104 and dbNSFP v4.1a to extract variant-level, amino acid-level, individuals' outputs, and genome-level features. Additionally, protein sequences were generated using ENSP identifiers with the Ensembl API, and then encoded. The mutant sites' ESM-1b and ProtTrans-T5 embeddings were subsequently extracted. Then, our model group (MMPatho) was developed by leveraging upon these efforts, which comprised ConsMM and EvoIndMM. To be specific, ConsMM employs individuals' outputs and XGBoost with SHAP explanation analysis, while EvoIndMM investigates the potential enhancement of predictive capability by incorporating evolutionary information from ESM-1b and ProtT5-XL-U50, large protein language embeddings. Through rigorous comparative experiments, both ConsMM and EvoIndMM were capable of achieving remarkable AUROC (0.9836 and 0.9854) and AUPR (0.9852 and 0.9902) values on the blind test set devoid of overlapping variations and proteins from the training data, thus highlighting the superiority of our computational approach in the prediction of MM pathogenicity. Our Web server, available at http://csbio.njust.edu.cn/bioinf/mmpatho/, allows researchers to predict the pathogenicity (alongside the reliability index score) of MMs using the ConsMM and EvoIndMM models and provides extensive annotations for user input. Additionally, the newly constructed benchmark data set and blind test set can be accessed via the data page of our web server.


Subject(s)
Computational Biology , Mutation, Missense , Humans , Reproducibility of Results , Consensus , Proteins
7.
Methods ; 204: 189-198, 2022 08.
Article in English | MEDLINE | ID: mdl-34883239

ABSTRACT

The development of efficient and effective bioinformatics tools and pipelines for identifying peptides with dipeptidyl peptidase IV (DPP-IV) inhibitory activities from large-scale protein datasets is of great importance for the discovery and development of potential and promising antidiabetic drugs. In this study, we present a novel stacking-based ensemble learning predictor (termed StackDPPIV) designed for identification of DPP-IV inhibitory peptides. Unlike the existing method, which is based on single-feature-based methods, we combined five popular machine learning algorithms in conjunction with ten different feature encodings from multiple perspectives to generate a pool of various baseline models. Subsequently, the probabilistic features derived from these baseline models were systematically integrated and deemed as new feature representations. Finally, in order to improve the predictive performance, the genetic algorithm based on the self-assessment-report was utilized to determine a set of informative probabilistic features and then used the optimal one for developing the final meta-predictor (StackDPPIV). Experiment results demonstrated that StackDPPIV could outperform its constituent baseline models on both the training and independent datasets. Furthermore, StackDPPIV achieved an accuracy of 0.891, MCC of 0.784 and AUC of 0.961, which were 9.4%, 19.0% and 11.4%, respectively, higher than that of the existing method on the independent test. Feature analysis demonstrated that our feature representations had more discriminative ability as compared to conventional feature descriptors, which highlights the combination of different features was essential for the performance improvement. In order to implement the proposed predictor, we had built a user-friendly online web server at http://pmlabstack.pythonanywhere.com/StackDPPIV.


Subject(s)
Dipeptidyl Peptidase 4 , Peptides , Computational Biology , Dipeptidyl Peptidase 4/metabolism , Machine Learning , Peptides/pharmacology , Proteins
8.
Bioinformatics ; 37(17): 2556-2562, 2021 Sep 09.
Article in English | MEDLINE | ID: mdl-33638635

ABSTRACT

MOTIVATION: The identification of bitter peptides through experimental approaches is an expensive and time-consuming endeavor. Due to the huge number of newly available peptide sequences in the post-genomic era, the development of automated computational models for the identification of novel bitter peptides is highly desirable. RESULTS: In this work, we present BERT4Bitter, a bidirectional encoder representation from transformers (BERT)-based model for predicting bitter peptides directly from their amino acid sequence without using any structural information. To the best of our knowledge, this is the first time a BERT-based model has been employed to identify bitter peptides. Compared to widely used machine learning models, BERT4Bitter achieved the best performance with an accuracy of 0.861 and 0.922 for cross-validation and independent tests, respectively. Furthermore, extensive empirical benchmarking experiments on the independent dataset demonstrated that BERT4Bitter clearly outperformed the existing method with improvements of 8.0% accuracy and 16.0% Matthews coefficient correlation, highlighting the effectiveness and robustness of BERT4Bitter. We believe that the BERT4Bitter method proposed herein will be a useful tool for rapidly screening and identifying novel bitter peptides for drug development and nutritional research. AVAILABILITYAND IMPLEMENTATION: The user-friendly web server of the proposed BERT4Bitter is freely accessible at http://pmlab.pythonanywhere.com/BERT4Bitter. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

9.
J Comput Aided Mol Des ; 36(11): 781-796, 2022 11.
Article in English | MEDLINE | ID: mdl-36284036

ABSTRACT

The blood-brain barrier (BBB) is the primary barrier with a highly selective semipermeable border between blood vascular endothelial cells and the central nervous system. Since BBB can prevent drugs circulating in the blood from crossing into the interstitial fluid of the brain where neurons reside, many researchers are working hard on developing drug delivery systems to penetrate the BBB which currently poses a challenge. Thus, blood-brain barrier penetrating peptides (B3PPs) are an alternative neurotherapeutic for brain-related disorder since they can facilitate drug delivery into the brain. In the meanwhile, developing computational methods that are effective for both the identification and characterization of B3PPs in a cost-effective manner plays an important role for basic reach and in the pharmaceutical industry. Even though few computational methods for B3PP identification have been developed, their performance might fail in terms of generalization ability and interpretability. In this study, a novel and efficient scoring card method-based predictor (termed SCMB3PP) is presented for improving B3PP identification and characterization. To overcome the limitation of black-box computational approaches, the SCMB3PP predictor can automatically estimate amino acid and dipeptide propensities to be B3PPs. Both cross-validation and independent tests indicate that SCMB3PP can achieve impressive performance and outperform various popular machine learning-based methods and the existing methods on multiple independent test datasets. Furthermore, SCMB3PP-derived amino acid propensities were utilized to identify informative biophysical and biochemical properties for characterizing B3PPs. Finally, an online user-friendly web server ( http://pmlabstack.pythonanywhere.com/SCMB3PP ) is established to identify novel and potential B3PP cost-effectively. This novel computational approach is anticipated to facilitate the large-scale identification of high potential B3PP candidates for follow-up experimental validation.


Subject(s)
Blood-Brain Barrier , Dipeptides , Dipeptides/chemistry , Dipeptides/metabolism , Propensity Score , Endothelial Cells , Peptides/metabolism , Amino Acids/chemistry
10.
Genomics ; 113(1 Pt 2): 689-698, 2021 01.
Article in English | MEDLINE | ID: mdl-33017626

ABSTRACT

Fast, accurate identification and characterization of amyloid proteins at a large-scale is essential for understating their role in therapeutic intervention strategies. As a matter of fact, there exist only one in silico model for amyloid protein identification using the random forest (RF) model in conjunction with various feature types namely the RFAmy. However, it suffers from low interpretability for biologists. Thus, it is highly desirable to develop a simple and easily interpretable prediction method with robust accuracy as compared to the existing complicated model. In this study, we propose iAMY-SCM, the first scoring card method-based predictor for predicting and analyzing amyloid proteins. Herein, the iAMY-SCM made use of a simple weighted-sum function in conjunction with the propensity scores of dipeptides for the amyloid protein identification. Cross-validation results indicated that iAMY-SCM provided an accuracy of 0.895 that corresponded to 10-22% higher performance than that of widely used machine learning models. Furthermore, iAMY-SCM achieving an accuracy of 0.827 as evaluated by an independent test, which was found to be comparable to that of RFAmy and was approximately 9-13% higher than widely used machine learning models. Furthermore, the analysis of estimated propensity scores of amino acids and dipeptides were performed to provide insights into the biophysical and biochemical properties of amyloid proteins. As such, this demonstrates that the proposed iAMY-SCM is efficient and reliable in terms of simplicity, interpretability and implementation. To facilitate ease of use of the proposed iAMY-SCM, a user-friendly and publicly accessible web server at http://camt.pythonanywhere.com/iAMY-SCM has been established. We anticipate that that iAMY-SCM will be an important tool for facilitating the large-scale prediction and characterization of amyloid protein.


Subject(s)
Amyloid/chemistry , Sequence Analysis, Protein/methods , Software , Amyloid/genetics , Amyloid/metabolism , Machine Learning , Propensity Score , Protein Conformation , Protein Multimerization
11.
Bioinformatics ; 36(11): 3350-3356, 2020 06 01.
Article in English | MEDLINE | ID: mdl-32145017

ABSTRACT

MOTIVATION: Therapeutic peptides failing at clinical trials could be attributed to their toxicity profiles like hemolytic activity, which hamper further progress of peptides as drug candidates. The accurate prediction of hemolytic peptides (HLPs) and its activity from the given peptides is one of the challenging tasks in immunoinformatics, which is essential for drug development and basic research. Although there are a few computational methods that have been proposed for this aspect, none of them are able to identify HLPs and their activities simultaneously. RESULTS: In this study, we proposed a two-layer prediction framework, called HLPpred-Fuse, that can accurately and automatically predict both hemolytic peptides (HLPs or non-HLPs) as well as HLPs activity (high and low). More specifically, feature representation learning scheme was utilized to generate 54 probabilistic features by integrating six different machine learning classifiers and nine different sequence-based encodings. Consequently, the 54 probabilistic features were fused to provide sufficiently converged sequence information which was used as an input to extremely randomized tree for the development of two final prediction models which independently identify HLP and its activity. Performance comparisons over empirical cross-validation analysis, independent test and case study against state-of-the-art methods demonstrate that HLPpred-Fuse consistently outperformed these methods in the identification of hemolytic activity. AVAILABILITY AND IMPLEMENTATION: For the convenience of experimental scientists, a web-based tool has been established at http://thegleelab.org/HLPpred-Fuse. CONTACT: glee@ajou.ac.kr or watshara.sho@mahidol.ac.th or bala@ajou.ac.kr. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Machine Learning , Peptides
12.
J Comput Aided Mol Des ; 35(3): 315-323, 2021 03.
Article in English | MEDLINE | ID: mdl-33392948

ABSTRACT

Redox-sensitive cysteine (RSC) thiol contributes to many biological processes. The identification of RSC plays an important role in clarifying some mechanisms of redox-sensitive factors; nonetheless, experimental investigation of RSCs is expensive and time-consuming. The computational approaches that quickly and accurately identify candidate RSCs using the sequence information are urgently needed. Herein, an improved and robust computational predictor named IRC-Fuse was developed to identify the RSC by fusing of multiple feature representations. To enhance the performance of our model, we integrated the probability scores evaluated by the random forest models implementing different encoding schemes. Cross-validation results exhibited that the IRC-Fuse achieved accuracy and AUC of 0.741 and 0.807, respectively. The IRC-Fuse outperformed exiting methods with improvement of 10% and 13% on accuracy and MCC, respectively, over independent test data. Comparative analysis suggested that the IRC-Fuse was more effective and promising than the existing predictors. For the convenience of experimental scientists, the IRC-Fuse online web server was implemented and publicly accessible at http://kurata14.bio.kyutech.ac.jp/IRC-Fuse/ .


Subject(s)
Benchmarking/methods , Cysteine/chemistry , Proteins/chemistry , Amino Acid Sequence , Computational Biology , Databases, Factual , Machine Learning , Models, Molecular , Oxidation-Reduction , Sulfhydryl Compounds/chemistry
13.
J Comput Aided Mol Des ; 35(10): 1037-1053, 2021 10.
Article in English | MEDLINE | ID: mdl-34622387

ABSTRACT

Fast and accurate identification of inhibitors with potency against HCV NS5B polymerase is currently a challenging task. As conventional experimental methods is the gold standard method for the design and development of new HCV inhibitors, they often require costly investment of time and resources. In this study, we develop a novel machine learning-based meta-predictor (termed StackHCV) for accurate and large-scale identification of HCV inhibitors. Unlike the existing method, which is based on single-feature-based approach, we first constructed a pool of various baseline models by employing a wide range of heterogeneous molecular fingerprints with five popular machine learning algorithms (k-nearest neighbor, multi-layer perceptron, partial least squares, random forest and support vectors machine). Secondly, we integrated these baseline models in order to develop the final meta-based model by means of the stacking strategy. Extensive benchmarking experiments showed that StackHCV achieved a more accurate and stable performance as compared to its constituent baseline models on the training dataset and also outperformed the existing predictor on the independent test dataset. To facilitate the high-throughput identification of HCV inhibitors, we built a web server that can be freely accessed at http://camt.pythonanywhere.com/StackHCV . It is expected that StackHCV could be a useful tool for fast and precise identification of potential drugs against HCV NS5B particularly for liver cancer therapy and other clinical applications.


Subject(s)
Antiviral Agents/pharmacology , Enzyme Inhibitors/pharmacology , Hepacivirus/drug effects , Hepatitis C/drug therapy , Internet/statistics & numerical data , Machine Learning , RNA-Dependent RNA Polymerase/antagonists & inhibitors , Viral Nonstructural Proteins/antagonists & inhibitors , Algorithms , Antiviral Agents/isolation & purification , Enzyme Inhibitors/isolation & purification , Hepacivirus/isolation & purification , Hepatitis C/virology , Humans , Support Vector Machine
14.
Genomics ; 112(4): 2813-2822, 2020 07.
Article in English | MEDLINE | ID: mdl-32234434

ABSTRACT

In general, hydrolyzed proteins, plant-derived alkaloids and toxins displays unpleasant bitter taste. Thus, the perception of bitter taste plays a crucial role in protecting animals from poisonous plants and environmental toxins. Therapeutic peptides have attracted great attention as a new drug class. The successful identification and characterization of bitter peptides are essential for drug development and nutritional research. Owing to the large volume of peptides generated in the post-genomic era, there is an urgent need to develop computational methods for rapidly and effectively discriminating bitter peptides from non-bitter peptides. To the best of our knowledge, there is yet no computational model for predicting and analyzing bitter peptides using sequence information. In this study, we present for the first time a computational model called the iBitter-SCM that can predict the bitterness of peptides directly from their amino acid sequence without any dependence on their functional domain or structural information. iBitter-SCM is a simple and effective method that was built using the scoring card method (SCM) with estimated propensity scores of amino acids and dipeptides. Our benchmarking results demonstrated that iBitter-SCM achieved an accuracy and Matthews coefficient correlation of 84.38% and 0.688, respectively, on the independent dataset. Rigorous independent test indicated that iBitter-SCM was superior to those of other widely used machine-learning classifiers (e.g. k-nearest neighbor, naive Bayes, decision tree and random forest) owing to its simplicity, interpretability and implementation. Furthermore, the analysis of estimated propensity scores of amino acids and dipeptides were performed to provide a better understanding of the biophysical and biochemical properties of bitter peptides. For the convenience of experimental scientists, a web server is provided publicly at http://camt.pythonanywhere.com/iBitter-SCM. It is anticipated that iBitter-SCM can serve as an important tool to facilitate the high-throughput prediction and de novo design of bitter peptides.


Subject(s)
Dipeptides/chemistry , Sequence Analysis, Protein/methods , Software , Taste , Amino Acids/chemistry , Hydrophobic and Hydrophilic Interactions , Machine Learning , Propensity Score , Sequence Alignment
15.
Int J Mol Sci ; 22(5)2021 Mar 08.
Article in English | MEDLINE | ID: mdl-33800121

ABSTRACT

Nitrotyrosine, which is generated by numerous reactive nitrogen species, is a type of protein post-translational modification. Identification of site-specific nitration modification on tyrosine is a prerequisite to understanding the molecular function of nitrated proteins. Thanks to the progress of machine learning, computational prediction can play a vital role before the biological experimentation. Herein, we developed a computational predictor PredNTS by integrating multiple sequence features including K-mer, composition of k-spaced amino acid pairs (CKSAAP), AAindex, and binary encoding schemes. The important features were selected by the recursive feature elimination approach using a random forest classifier. Finally, we linearly combined the successive random forest (RF) probability scores generated by the different, single encoding-employing RF models. The resultant PredNTS predictor achieved an area under a curve (AUC) of 0.910 using five-fold cross validation. It outperformed the existing predictors on a comprehensive and independent dataset. Furthermore, we investigated several machine learning algorithms to demonstrate the superiority of the employed RF algorithm. The PredNTS is a useful computational resource for the prediction of nitrotyrosine sites. The web-application with the curated datasets of the PredNTS is publicly available.


Subject(s)
Computational Biology , Machine Learning , Protein Processing, Post-Translational , Proteins/genetics , Sequence Analysis, Protein , Support Vector Machine , Tyrosine/analogs & derivatives , Tyrosine/genetics
16.
Int J Mol Sci ; 22(23)2021 Dec 04.
Article in English | MEDLINE | ID: mdl-34884927

ABSTRACT

Umami ingredients have been identified as important factors in food seasoning and production. Traditional experimental methods for characterizing peptides exhibiting umami sensory properties (umami peptides) are time-consuming, laborious, and costly. As a result, it is preferable to develop computational tools for the large-scale identification of available sequences in order to identify novel peptides with umami sensory properties. Although a computational tool has been developed for this purpose, its predictive performance is still insufficient. In this study, we use a feature representation learning approach to create a novel machine-learning meta-predictor called UMPred-FRL for improved umami peptide identification. We combined six well-known machine learning algorithms (extremely randomized trees, k-nearest neighbor, logistic regression, partial least squares, random forest, and support vector machine) with seven different feature encodings (amino acid composition, amphiphilic pseudo-amino acid composition, dipeptide composition, composition-transition-distribution, and pseudo-amino acid composition) to develop the final meta-predictor. Extensive experimental results demonstrated that UMPred-FRL was effective and achieved more accurate performance on the benchmark dataset compared to its baseline models, and consistently outperformed the existing method on the independent test dataset. Finally, to aid in the high-throughput identification of umami peptides, the UMPred-FRL web server was established and made freely available online. It is expected that UMPred-FRL will be a powerful tool for the cost-effective large-scale screening of candidate peptides with potential umami sensory properties.


Subject(s)
Computational Biology/methods , Machine Learning , Peptides/chemistry , Algorithms , Databases, Protein , Dietary Proteins/chemistry , Internet , Support Vector Machine , Taste
17.
Int J Mol Sci ; 22(16)2021 Aug 19.
Article in English | MEDLINE | ID: mdl-34445663

ABSTRACT

Accurate identification of bitter peptides is of great importance for better understanding their biochemical and biophysical properties. To date, machine learning-based methods have become effective approaches for providing a good avenue for identifying potential bitter peptides from large-scale protein datasets. Although few machine learning-based predictors have been developed for identifying the bitterness of peptides, their prediction performances could be improved. In this study, we developed a new predictor (named iBitter-Fuse) for achieving more accurate identification of bitter peptides. In the proposed iBitter-Fuse, we have integrated a variety of feature encoding schemes for providing sufficient information from different aspects, namely consisting of compositional information and physicochemical properties. To enhance the predictive performance, the customized genetic algorithm utilizing self-assessment-report (GA-SAR) was employed for identifying informative features followed by inputting optimal ones into a support vector machine (SVM)-based classifier for developing the final model (iBitter-Fuse). Benchmarking experiments based on both 10-fold cross-validation and independent tests indicated that the iBitter-Fuse was able to achieve more accurate performance as compared to state-of-the-art methods. To facilitate the high-throughput identification of bitter peptides, the iBitter-Fuse web server was established and made freely available online. It is anticipated that the iBitter-Fuse will be a useful tool for aiding the discovery and de novo design of bitter peptides.


Subject(s)
Algorithms , Machine Learning , Peptide Fragments/chemistry , Software , Support Vector Machine , Taste , Benchmarking , Humans , Predictive Value of Tests
18.
Int J Mol Sci ; 22(4)2021 Feb 20.
Article in English | MEDLINE | ID: mdl-33672741

ABSTRACT

Pupylation is a type of reversible post-translational modification of proteins, which plays a key role in the cellular function of microbial organisms. Several proteomics methods have been developed for the prediction and analysis of pupylated proteins and pupylation sites. However, the traditional experimental methods are laborious and time-consuming. Hence, computational algorithms are highly needed that can predict potential pupylation sites using sequence features. In this research, a new prediction model, PUP-Fuse, has been developed for pupylation site prediction by integrating multiple sequence representations. Meanwhile, we explored the five types of feature encoding approaches and three machine learning (ML) algorithms. In the final model, we integrated the successive ML scores using a linear regression model. The PUP-Fuse achieved a Mathew correlation value of 0.768 by a 10-fold cross-validation test. It also outperformed existing predictors in an independent test. The web server of the PUP-Fuse with curated datasets is freely available.


Subject(s)
Algorithms , Computational Biology/methods , Protein Processing, Post-Translational , Proteins/chemistry , Proteins/metabolism , Amino Acid Sequence , Databases, Protein
19.
J Proteome Res ; 19(10): 4125-4136, 2020 10 02.
Article in English | MEDLINE | ID: mdl-32897718

ABSTRACT

The inhibition of dipeptidyl peptidase IV (DPP-IV, E.C.3.4.14.5) is well recognized as a new avenue for the treatment of Type 2 diabetes (T2D). Until now, peptide-like DDP-IV inhibitors have been shown to normalize the blood glucose concentration in T2D subjects. To the best of our knowledge, there is yet no computational model for predicting and analyzing DPP-IV inhibitory peptides using sequence information. In this study, we present for the first time a simple and easily interpretable sequence-based predictor using the scoring card method (SCM) for modeling the bioactivity of DPP-IV inhibitory peptides (iDPPIV-SCM). Particularly, the iDPPIV-SCM was developed by employing the SCM method together with the propensity scores of amino acids. Rigorous independent test results demonstrated that the proposed iDPPIV-SCM was found to be superior to those of well-known machine learning (ML) classifiers (e.g., k-nearest neighbor, logistic regression, and decision tree) with demonstrated improvements of 2-11, 4-22, and 7-10% for accuracy, MCC, and AUC, respectively, while also achieving comparable results to that of the support vector machine. Furthermore, the analysis of estimated propensity scores of amino acids as derived from the iDPPIV-SCM was performed so as to provide a more in-depth understanding on the molecular basis for enhancing the DPP-IV inhibitory potency. Taken together, these results revealed that iDPPIV-SCM was superior to those of other well-known ML classifiers owing to its simplicity, interpretability, and validity. For the convenience of biologists, the predictive model is deployed as a publicly accessible web server at http://camt.pythonanywhere.com/iDPPIV-SCM. It is anticipated that iDPPIV-SCM can serve as an important tool for the rapid screening of promising DPP-IV inhibitory peptides prior to their synthesis.


Subject(s)
Diabetes Mellitus, Type 2 , Dipeptidyl Peptidase 4 , Amino Acids , Diabetes Mellitus, Type 2/drug therapy , Humans , Peptides , Support Vector Machine
20.
Plant Mol Biol ; 103(1-2): 225-234, 2020 May.
Article in English | MEDLINE | ID: mdl-32140819

ABSTRACT

DNA N6-methyladenine (6 mA) is one of the most vital epigenetic modifications and involved in controlling the various gene expression levels. With the avalanche of DNA sequences generated in numerous databases, the accurate identification of 6 mA plays an essential role for understanding molecular mechanisms. Because the experimental approaches are time-consuming and costly, it is desirable to develop a computation model for rapidly and accurately identifying 6 mA. To the best of our knowledge, we first proposed a computational model named i6mA-Fuse to predict 6 mA sites from the Rosaceae genomes, especially in Rosa chinensis and Fragaria vesca. We implemented the five encoding schemes, i.e., mononucleotide binary, dinucleotide binary, k-space spectral nucleotide, k-mer, and electron-ion interaction pseudo potential compositions, to build the five, single-encoding random forest (RF) models. The i6mA-Fuse uses a linear regression model to combine the predicted probability scores of the five, single encoding-based RF models. The resultant species-specific i6mA-Fuse achieved remarkably high performances with AUCs of 0.982 and 0.978 and with MCCs of 0.869 and 0.858 on the independent datasets of Rosa chinensis and Fragaria vesca, respectively. In the F. vesca-specific i6mA-Fuse, the MBE and EIIP contributed to 75% and 25% of the total prediction; in the R. chinensis-specific i6mA-Fuse, Kmer, MBE, and EIIP contribute to 15%, 65%, and 20% of the total prediction. To assist high-throughput prediction for DNA 6 mA identification, the i6mA-Fuse is publicly accessible at https://kurata14.bio.kyutech.ac.jp/i6mA-Fuse/.


Subject(s)
Adenine/analogs & derivatives , DNA, Plant/metabolism , Rosaceae/metabolism , Adenine/metabolism , Algorithms , Binding Sites , Computational Biology , Datasets as Topic , Machine Learning , Models, Genetic , Rosaceae/genetics
SELECTION OF CITATIONS
SEARCH DETAIL