Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 106
Filter
1.
Comput Biol Med ; 179: 108859, 2024 Jul 18.
Article in English | MEDLINE | ID: mdl-39029431

ABSTRACT

O-linked glycosylation is a complex post-translational modification (PTM) in human proteins that plays a critical role in regulating various cellular metabolic and signaling pathways. In contrast to N-linked glycosylation, O-linked glycosylation lacks specific sequence features and maintains an unstable core structure. Identifying O-linked threonine glycosylation sites (OTGs) remains challenging, requiring extensive experimental tests. While bioinformatics tools have emerged for predicting OTGs, their reliance on limited conventional features and absence of well-defined feature selection strategies limit their effectiveness. To address these limitations, we introduced HOTGpred (Human O-linked Threonine Glycosylation predictor), employing a multi-stage feature selection process to identify the optimal feature set for accurately identifying OTGs. Initially, we assessed 25 different feature sets derived from various pretrained protein language model (PLM)-based embeddings and conventional feature descriptors using nine classifiers. Subsequently, we integrated the top five embeddings linearly and determined the most effective scoring function for ranking hybrid features, identifying the optimal feature set through a process of sequential forward search. Among the classifiers, the extreme gradient boosting (XGBT)-based model, using the optimal feature set (HOTGpred), achieved 92.03 % accuracy on the training dataset and 88.25 % on the balanced independent dataset. Notably, HOTGpred significantly outperformed the current state-of-the-art methods on both the balanced and imbalanced independent datasets, demonstrating its superior prediction capabilities. Additionally, SHapley Additive exPlanations (SHAP) and ablation analyses were conducted to identify the features contributing most significantly to HOTGpred. Finally, we developed an easy-to-navigate web server, accessible at https://balalab-skku.org/HOTGpred/, to support glycobiologists in their research on glycosylation structure and function.

2.
PLoS One ; 19(6): e0305406, 2024.
Article in English | MEDLINE | ID: mdl-38924058

ABSTRACT

2'-O-methylation (2-OM or Nm) is a widespread RNA modification observed in various RNA types like tRNA, mRNA, rRNA, miRNA, piRNA, and snRNA, which plays a crucial role in several biological functional mechanisms and innate immunity. To comprehend its modification mechanisms and potential epigenetic regulation, it is necessary to accurately identify 2-OM sites. However, biological experiments can be tedious, time-consuming, and expensive. Furthermore, currently available computational methods face challenges due to inadequate datasets and limited classification capabilities. To address these challenges, we proposed Meta-2OM, a cutting-edge predictor that can accurately identify 2-OM sites in human RNA. In brief, we applied a meta-learning approach that considered eight conventional machine learning algorithms, including tree-based classifiers and decision boundary-based classifiers, and eighteen different feature encoding algorithms that cover physicochemical, compositional, position-specific and natural language processing information. The predicted probabilities of 2-OM sites from the baseline models are then combined and trained using logistic regression to generate the final prediction. Consequently, Meta-2OM achieved excellent performance in both 5-fold cross-validation training and independent testing, outperforming all existing state-of-the-art methods. Specifically, on the independent test set, Meta-2OM achieved an overall accuracy of 0.870, sensitivity of 0.836, specificity of 0.904, and Matthew's correlation coefficient of 0.743. To facilitate its use, a user-friendly web server and standalone program have been developed and freely available at http://kurata35.bio.kyutech.ac.jp/Meta-2OM and https://github.com/kuratahiroyuki/Meta-2OM.


Subject(s)
Algorithms , RNA , Humans , RNA/genetics , RNA/chemistry , Methylation , Machine Learning , Software , Computational Biology/methods
3.
Int J Biol Macromol ; 273(Pt 2): 133085, 2024 Jul.
Article in English | MEDLINE | ID: mdl-38871100

ABSTRACT

Allergy is a hypersensitive condition in which individuals develop objective symptoms when exposed to harmless substances at a dose that would cause no harm to a "normal" person. Most current computational methods for allergen identification rely on homology or conventional machine learning using limited set of feature descriptors or validation on specific datasets, making them inefficient and inaccurate. Here, we propose SEP-AlgPro for the accurate identification of allergen protein from sequence information. We analyzed 10 conventional protein-based features and 14 different features derived from protein language models to gauge their effectiveness in differentiating allergens from non-allergens using 15 different classifiers. However, the final optimized model employs top 10 feature descriptors with top seven machine learning classifiers. Results show that the features derived from protein language models exhibit superior discriminative capabilities compared to traditional feature sets. This enabled us to select the most discriminatory baseline models, whose predicted outputs were aggregated and used as input to a deep neural network for the final allergen prediction. Extensive case studies showed that SEP-AlgPro outperforms state-of-the-art predictors in accurately identifying allergens. A user-friendly web server was developed and made freely available at https://balalab-skku.org/SEP-AlgPro/, making it a powerful tool for identifying potential allergens.


Subject(s)
Allergens , Deep Learning , Machine Learning , Allergens/immunology , Allergens/chemistry , Software , Computational Biology/methods , Humans , Neural Networks, Computer
4.
Methods ; 229: 133-146, 2024 Jun 28.
Article in English | MEDLINE | ID: mdl-38944134

ABSTRACT

Asparagine peptide lyase (APL) is among the seven groups of proteases, also known as proteolytic enzymes, which are classified according to their catalytic residue. APLs are synthesized as precursors or propeptides that undergo self-cleavage through autoproteolytic reaction. At present, APLs are grouped into 10 families belonging to six different clans of proteases. Recognizing their critical roles in many biological processes including virus maturation, and virulence, accurate identification and characterization of APLs is indispensable. Experimental identification and characterization of APLs is laborious and time-consuming. Here, we developed APLpred, a novel support vector machine (SVM) based predictor that can predict APLs from the primary sequences. APLpred was developed using Boruta-based optimal features derived from seven encodings and subsequently trained using five machine learning algorithms. After evaluating each model on an independent dataset, we selected APLpred (an SVM-based model) due to its consistent performance during cross-validation and independent evaluation. We anticipate APLpred will be an effective tool for identifying APLs. This could aid in designing inhibitors against these enzymes and exploring their functions. The APLpred web server is freely available at https://procarb.org/APLpred/.

5.
Methods ; 227: 37-47, 2024 Jul.
Article in English | MEDLINE | ID: mdl-38729455

ABSTRACT

RNA modification serves as a pivotal component in numerous biological processes. Among the prevalent modifications, 5-methylcytosine (m5C) significantly influences mRNA export, translation efficiency and cell differentiation and are also associated with human diseases, including Alzheimer's disease, autoimmune disease, cancer, and cardiovascular diseases. Identification of m5C is critically responsible for understanding the RNA modification mechanisms and the epigenetic regulation of associated diseases. However, the large-scale experimental identification of m5C present significant challenges due to labor intensity and time requirements. Several computational tools, using machine learning, have been developed to supplement experimental methods, but identifying these sites lack accuracy and efficiency. In this study, we introduce a new predictor, MLm5C, for precise prediction of m5C sites using sequence data. Briefly, we evaluated eleven RNA sequence-derived features with four basic machine learning algorithms to generate baseline models. From these 44 models, we ranked them based on their performance and subsequently stacked the Top 20 baseline models as the best model, named MLm5C. The MLm5C outperformed the-state-of-the-art predictors. Notably, the optimization of the sequence length surrounding the modification sites significantly improved the prediction performance. MLm5C is an invaluable tool in accelerating the detection of m5C sites within the human genome, thereby facilitating in the characterization of their roles in post-transcriptional regulation.


Subject(s)
5-Methylcytosine , Machine Learning , RNA , Humans , 5-Methylcytosine/metabolism , 5-Methylcytosine/chemistry , RNA/genetics , RNA/chemistry , RNA/metabolism , Computational Biology/methods , RNA Processing, Post-Transcriptional , Algorithms
6.
Mol Ther Nucleic Acids ; 35(2): 102192, 2024 Jun 11.
Article in English | MEDLINE | ID: mdl-38779332

ABSTRACT

RNA N4-acetylcytidine (ac4C) is a highly conserved RNA modification that plays a crucial role in controlling mRNA stability, processing, and translation. Consequently, accurate identification of ac4C sites across the genome is critical for understanding gene expression regulation mechanisms. In this study, we have developed ac4C-AFL, a bioinformatics tool that precisely identifies ac4C sites from primary RNA sequences. In ac4C-AFL, we identified the optimal sequence length for model building and implemented an adaptive feature representation strategy that is capable of extracting the most representative features from RNA. To identify the most relevant features, we proposed a novel ensemble feature importance scoring strategy to rank features effectively. We then used this information to conduct the sequential forward search, which individually determine the optimal feature set from the 16 sequence-derived feature descriptors. Utilizing these optimal feature descriptors, we constructed 176 baseline models using 11 popular classifiers. The most efficient baseline models were identified using the two-step feature selection approach, whose predicted scores were integrated and trained with the appropriate classifier to develop the final prediction model. Our rigorous cross-validations and independent tests demonstrate that ac4C-AFL surpasses contemporary tools in predicting ac4C sites. Moreover, we have developed a publicly accessible web server at https://balalab-skku.org/ac4C-AFL/.

7.
Methods ; 229: 1-8, 2024 May 18.
Article in English | MEDLINE | ID: mdl-38768932

ABSTRACT

SARS-CoV-2's global spread has instigated a critical health and economic emergency, impacting countless individuals. Understanding the virus's phosphorylation sites is vital to unravel the molecular intricacies of the infection and subsequent changes in host cellular processes. Several computational methods have been proposed to identify phosphorylation sites, typically focusing on specific residue (S/T) or Y phosphorylation sites. Unfortunately, current predictive tools perform best on these specific residues and may not extend their efficacy to other residues, emphasizing the urgent need for enhanced methodologies. In this study, we developed a novel predictor that integrated all the residues (STY) phosphorylation sites information. We extracted ten different feature descriptors, primarily derived from composition, evolutionary, and position-specific information, and assessed their discriminative power through five classifiers. Our results indicated that Light Gradient Boosting (LGB) showed superior performance, and five descriptors displayed excellent discriminative capabilities. Subsequently, we identified the top two integrated features have high discriminative capability and trained with LGB to develop the final prediction model, LGB-IPs. The proposed approach shows an excellent performance on 10-fold cross-validation with an ACC, MCC, and AUC values of 0.831, 0.662, 0.907, respectively. Notably, these performances are replicated in the independent evaluation. Consequently, our approach may provide valuable insights into the phosphorylation mechanisms in SARS-CoV-2 infection for biomedical researchers.

8.
Comput Biol Med ; 171: 108229, 2024 Mar.
Article in English | MEDLINE | ID: mdl-38447500

ABSTRACT

Conventional COVID-19 testing methods have some flaws: they are expensive and time-consuming. Chest X-ray (CXR) diagnostic approaches can alleviate these flaws to some extent. However, there is no accurate and practical automatic diagnostic framework with good interpretability. The application of artificial intelligence (AI) technology to medical radiography can help to accurately detect the disease, reduce the burden on healthcare organizations, and provide good interpretability. Therefore, this study proposes a new deep neural network (CNN) based on CXR for COVID-19 diagnosis - CodeNet. This method uses contrastive learning to make full use of latent image data to enhance the model's ability to extract features and generalize across different data domains. On the evaluation dataset, the proposed method achieves an accuracy as high as 94.20%, outperforming several other existing methods used for comparison. Ablation studies validate the efficacy of the proposed method, while interpretability analysis shows that the method can effectively guide clinical professionals. This work demonstrates the superior detection performance of a CNN using contrastive learning techniques on CXR images, paving the way for computer vision and artificial intelligence technologies to leverage massive medical data for disease diagnosis.


Subject(s)
COVID-19 , Deep Learning , Humans , COVID-19/diagnostic imaging , COVID-19 Testing , Artificial Intelligence , Neural Networks, Computer
9.
Comput Biol Med ; 168: 107688, 2024 01.
Article in English | MEDLINE | ID: mdl-37988788

ABSTRACT

BACKGROUND: Amyotrophic lateral sclerosis (ALS) is a serious neurodegenerative disorder affecting nerve cells in the brain and spinal cord that is caused by mutations in the superoxide dismutase 1 (SOD1) enzyme. ALS-related mutations cause misfolding, dimerisation instability, and increased formation of aggregates. The underlying allosteric mechanisms, however, remain obscure as far as details of their fundamental atomistic structure are concerned. Hence, this gap in knowledge limits the development of novel SOD1 inhibitors and the understanding of how disease-associated mutations in distal sites affect enzyme activity. METHODS: We combined microsecond-scale based unbiased molecular dynamics (MD) simulation with network analysis to elucidate the local and global conformational changes and allosteric communications in SOD1 Apo (unmetallated form), Holo, Apo_CallA (mutant and unmetallated form), and Holo_CallA (mutant form) systems. To identify hotspot residues involved in SOD1 signalling and allosteric communications, we performed network centrality, community network, and path analyses. RESULTS: Structural analyses showed that unmetallated SOD1 systems and cysteine mutations displayed large structural variations in the catalytic sites, affecting structural stability. Inter- and intra H-bond analyses identified several important residues crucial for maintaining interfacial stability, structural stability, and enzyme catalysis. Dynamic motion analysis demonstrated more balanced atomic displacement and highly correlated motions in the Holo system. The rationale for structural disparity observed in the disulfide bond formation and R143 configuration in Apo and Holo systems were elucidated using distance and dihedral probability distribution analyses. CONCLUSION: Our study highlights the efficiency of combining extensive MD simulations with network analyses to unravel the features of protein allostery.


Subject(s)
Amyotrophic Lateral Sclerosis , Molecular Dynamics Simulation , Humans , Superoxide Dismutase-1/genetics , Superoxide Dismutase-1/metabolism , Superoxide Dismutase/chemistry , Superoxide Dismutase/genetics , Superoxide Dismutase/metabolism , Amyotrophic Lateral Sclerosis/genetics , Mutation , Protein Folding
10.
Comput Biol Med ; 169: 107848, 2024 Feb.
Article in English | MEDLINE | ID: mdl-38145601

ABSTRACT

Dihydrouridine (DHU, D) is one of the most abundant post-transcriptional uridine modifications found in tRNA, mRNA, and snoRNA, closely associated with disease pathogenesis and various biological processes in eukaryotes. Identifying D sites is important for understanding the modification mechanisms and/or epigenetic regulation. However, biological experiments for detecting D sites are time-consuming and expensive. Given these challenges, computational methods have been developed for accurately identifying the D sites in genome-wide datasets. However, existing methods have some limitations, and their prediction performance needs to be improved. In this work, we have developed a new computational predictor for accurately identifying D sites called Stack-DHUpred. Briefly, we trained 66 baseline models or single-feature models by connecting six machine learning classifiers with eleven different feature encoding methods and stacked different baseline models to build stacked ensemble learning models. Subsequently, the optimal combination of the baseline models was identified for the construction of the final stacked model. Remarkably, the Stack-DHUpred outperformed the existing predictors on our new independent dataset, indicating that the stacking approach significantly improved the prediction performance. We have made Stack-DHUpred available to the public through a web server (http://kurata35.bio.kyutech.ac.jp/Stack-DHUpred) and a standalone program (https://github.com/kuratahiroyuki/Stack-DHUpred). We believe that Stack-DHUpred will be a valuable tool for accelerating the discovery of D modifications and understanding their role in post-transcriptional regulation.


Subject(s)
Epigenesis, Genetic , Genome , RNA, Messenger , Computational Biology
11.
Brief Bioinform ; 25(1)2023 11 22.
Article in English | MEDLINE | ID: mdl-38058187

ABSTRACT

The worldwide appearance of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has generated significant concern and posed a considerable challenge to global health. Phosphorylation is a common post-translational modification that affects many vital cellular functions and is closely associated with SARS-CoV-2 infection. Precise identification of phosphorylation sites could provide more in-depth insight into the processes underlying SARS-CoV-2 infection and help alleviate the continuing COVID-19 crisis. Currently, available computational tools for predicting these sites lack accuracy and effectiveness. In this study, we designed an innovative meta-learning model, Meta-Learning for Serine/Threonine Phosphorylation (MeL-STPhos), to precisely identify protein phosphorylation sites. We initially performed a comprehensive assessment of 29 unique sequence-derived features, establishing prediction models for each using 14 renowned machine learning methods, ranging from traditional classifiers to advanced deep learning algorithms. We then selected the most effective model for each feature by integrating the predicted values. Rigorous feature selection strategies were employed to identify the optimal base models and classifier(s) for each cell-specific dataset. To the best of our knowledge, this is the first study to report two cell-specific models and a generic model for phosphorylation site prediction by utilizing an extensive range of sequence-derived features and machine learning algorithms. Extensive cross-validation and independent testing revealed that MeL-STPhos surpasses existing state-of-the-art tools for phosphorylation site prediction. We also developed a publicly accessible platform at https://balalab-skku.org/MeL-STPhos. We believe that MeL-STPhos will serve as a valuable tool for accelerating the discovery of serine/threonine phosphorylation sites and elucidating their role in post-translational regulation.


Subject(s)
COVID-19 , SARS-CoV-2 , Humans , Phosphorylation , SARS-CoV-2/metabolism , Serine/metabolism , Threonine/metabolism
12.
Comput Biol Med ; 165: 107386, 2023 10.
Article in English | MEDLINE | ID: mdl-37619323

ABSTRACT

Diabetes mellitus has become a major public health concern associated with high mortality and reduced life expectancy and can cause blindness, heart attacks, kidney failure, lower limb amputations, and strokes. A new generation of antidiabetic peptides (ADPs) that act on ß-cells or T-cells to regulate insulin production is being developed to alleviate the effects of diabetes. However, the lack of effective peptide-mining tools has hampered the discovery of these promising drugs. Hence, novel computational tools need to be developed urgently. In this study, we present ADP-Fuse, a novel two-layer prediction framework capable of accurately identifying ADPs or non-ADPs and categorizing them into type 1 and type 2 ADPs. First, we comprehensively evaluated 22 peptide sequence-derived features coupled with eight notable machine learning algorithms. Subsequently, the most suitable feature descriptors and classifiers for both layers were identified. The output of these single-feature models, embedded with multiview information, was trained with an appropriate classifier to provide the final prediction. Comprehensive cross-validation and independent tests substantiate that ADP-Fuse surpasses single-feature models and the feature fusion approach for the prediction of ADPs and their types. In addition, the SHapley Additive exPlanation method was used to elucidate the contributions of individual features to the prediction of ADPs and their types. Finally, a user-friendly web server for ADP-Fuse was developed and made publicly accessible (https://balalab-skku.org/ADP-Fuse), enabling the swift screening and identification of novel ADPs and their types. This framework is expected to contribute significantly to antidiabetic peptide identification.


Subject(s)
Diabetes Mellitus , Hypoglycemic Agents , Peptides , Amino Acid Sequence , Algorithms , Machine Learning , Computational Biology
13.
Int J Biol Sci ; 19(12): 3640-3660, 2023.
Article in English | MEDLINE | ID: mdl-37564212

ABSTRACT

Both AP-1 and PRMT1 are vital molecules in variety of cellular progresssion, but the interaction between these proteins in the context of cellular functions is less clear. Gastric cancer (GC) is one of the pernicious diseases worldwide. An in-depth understanding of the molecular mode of action underlying gastric tumorigenesis is still elusive. In this study, we found that PRMT1 directly interacts with c-Fos and enhances AP-1 activation. PRMT1-mediated arginine methylation (mono- and dimethylation) of c-Fos synergistically enhances c-Fos-mediated AP-1 liveliness and consequently increases c-Fos protein stabilization. Consistent with this finding, PRMT1 knockdown decreases the protein level of c-Fos. We discovered that the c-Fos protein undergoes autophagic degradation and found that PRMT1-mediated methylation at R287 protects c-Fos from autophagosomal degradation and is linked to clinicopathologic variables as well as prognosis in stomach tumor. Together, our data demonstrate that PRMT1-mediated c-Fos protein stabilization promotes gastric tumorigenesis. We contend that targeting this modification could constitute a new therapeutic strategy in gastric cancer.


Subject(s)
Proto-Oncogene Proteins c-fos , Stomach Neoplasms , Humans , Methylation , Proto-Oncogene Proteins c-fos/genetics , Proto-Oncogene Proteins c-fos/metabolism , Stomach Neoplasms/genetics , Transcription Factor AP-1/metabolism , Protein-Arginine N-Methyltransferases/genetics , Protein-Arginine N-Methyltransferases/metabolism , Carcinogenesis/genetics , Cell Transformation, Neoplastic , Arginine , Repressor Proteins/genetics , Repressor Proteins/metabolism
14.
Comput Biol Med ; 162: 107065, 2023 08.
Article in English | MEDLINE | ID: mdl-37267826

ABSTRACT

The Src Homology 2 (SH2) domain plays an important role in the signal transmission mechanism in organisms. It mediates the protein-protein interactions based on the combination between phosphotyrosine and motifs in SH2 domain. In this study, we designed a method to identify SH2 domain-containing proteins and non-SH2 domain-containing proteins through deep learning technology. Firstly, we collected SH2 and non-SH2 domain-containing protein sequences including multiple species. We built six deep learning models through DeepBIO after data preprocessing and compared their performance. Secondly, we selected the model with the strongest comprehensive ability to conduct training and test separately again, and analyze the results visually. It was found that 288-dimensional (288D) feature could effectively identify two types of proteins. Finally, motifs analysis discovered the specific motif YKIR and revealed its function in signal transduction. In summary, we successfully identified SH2 domain and non-SH2 domain proteins through deep learning method, and obtained 288D features that perform best. In addition, we found a new motif YKIR in SH2 domain, and analyzed its function which helps to further understand the signaling mechanisms within the organism.


Subject(s)
Deep Learning , src Homology Domains/physiology , Proteins/genetics , Proteins/metabolism , Signal Transduction/physiology , Phosphotyrosine/metabolism , Protein Binding , Binding Sites
15.
Comput Biol Med ; 161: 106946, 2023 07.
Article in English | MEDLINE | ID: mdl-37244151

ABSTRACT

Drug-target interactions (DTI) prediction is a crucial task in drug discovery. Existing computational methods accelerate the drug discovery in this respect. However, most of them suffer from low feature representation ability, significantly affecting the predictive performance. To address the problem, we propose a novel neural network architecture named DrugormerDTI, which uses Graph Transformer to learn both sequential and topological information through the input molecule graph and Resudual2vec to learn the underlying relation between residues from proteins. By conducting ablation experiments, we verify the importance of each part of the DrugormerDTI. We also demonstrate the good feature extraction and expression capabilities of our model via comparing the mapping results of the attention layer and molecular docking results. Experimental results show that our proposed model performs better than baseline methods on four benchmarks. We demonstrate that the introduction of Graph Transformer and the design of residue are appropriate for drug-target prediction.


Subject(s)
Drug Development , Neural Networks, Computer , Molecular Docking Simulation , Drug Development/methods , Drug Discovery/methods , Proteins/chemistry , Drug Interactions
16.
Bioinformatics ; 39(5)2023 05 04.
Article in English | MEDLINE | ID: mdl-37129547

ABSTRACT

Detection and analysis of viral genomes with Nanopore sequencing has shown great promise in the surveillance of pathogen outbreaks. However, the number of virus detection pipelines supporting Nanopore sequencing is very limited. Here, we present VirPipe, a new pipeline for the detection of viral genomes from Nanopore or Illumina sequencing input featuring streamlined installation and customization. AVAILABILITY AND IMPLEMENTATION: VirPipe source code and documentation are freely available for download at https://github.com/KijinKims/VirPipe, implemented in Python and Nextflow.


Subject(s)
Nanopore Sequencing , Nanopores , Software , Genome, Viral , High-Throughput Nucleotide Sequencing
17.
Proteomics ; 23(13-14): e2200409, 2023 Jul.
Article in English | MEDLINE | ID: mdl-37021401

ABSTRACT

Enhancers are non-coding DNA elements that play a crucial role in enhancing the transcription rate of a specific gene in the genome. Experiments for identifying enhancers can be restricted by their conditions and involve complicated, time-consuming, laborious, and costly steps. To overcome these challenges, computational platforms have been developed to complement experimental methods that enable high-throughput identification of enhancers. Over the last few years, the development of various enhancer computational tools has resulted in significant progress in predicting putative enhancers. Thus, researchers are now able to use a variety of strategies to enhance and advance enhancer study. In this review, an overview of machine learning (ML)-based prediction methods for enhancer identification and related databases has been provided. The existing enhancer-prediction methods have also been reviewed regarding their algorithms, feature selection processes, validation techniques, and software utility. In addition, the advantages and drawbacks of these ML approaches and guidelines for developing bioinformatic tools have been highlighted for a more efficient enhancer prediction. This review will serve as a useful resource for experimentalists in selecting the appropriate ML tool for their study, and for bioinformaticians in developing more accurate and advanced ML-based predictors.


Subject(s)
Enhancer Elements, Genetic , Genome, Human , Humans , Computational Biology/methods , Algorithms , Machine Learning
18.
Research (Wash D C) ; 6: 0016, 2023.
Article in English | MEDLINE | ID: mdl-36930763

ABSTRACT

Tomato yellow leaf curl virus (TYLCV) dispersed across different countries, specifically to subtropical regions, associated with more severe symptoms. Since TYLCV was first isolated in 1931, it has been a menace to tomato industrial production worldwide over the past century. Three groups were newly isolated from TYLCV-resistant tomatoes in 2022; however, their functions are unknown. The development of machine learning (ML)-based models using characterized sequences and evaluating blind predictions is one of the major challenges in interdisciplinary research. The purpose of this study was to develop an integrated computational framework for the accurate identification of symptoms (mild or severe) based on TYLCV sequences (isolated in Korea). For the development of the framework, we first extracted 11 different feature encodings and hybrid features from the training data and then explored 8 different classifiers and developed their respective prediction models by using randomized 10-fold cross-validation. Subsequently, we carried out a systematic evaluation of these 96 developed models and selected the top 90 models, whose predicted class labels were combined and considered as reduced features. On the basis of these features, a multilayer perceptron was applied and developed the final prediction model (IML-TYLCVs). We conducted blind prediction on 3 groups using IML-TYLCVs, and the results indicated that 2 groups were severe and 1 group was mild. Furthermore, we confirmed the prediction with virus-challenging experiments of tomato plant phenotypes using infectious clones from 3 groups. Plant virologists and plant breeding professionals can access the user-friendly online IML-TYLCVs web server at https://balalab-skku.org/IML-TYLCVs, which can guide them in developing new protection strategies for newly emerging viruses.

20.
Comput Biol Med ; 158: 106784, 2023 05.
Article in English | MEDLINE | ID: mdl-36989748

ABSTRACT

Quorum sensing peptides (QSPs) are microbial signaling molecules involved in several cellular processes, such as cellular communication, virulence expression, bioluminescence, and swarming, in various bacterial species. Understanding QSPs is essential for identifying novel drug targets for controlling bacterial populations and pathogenicity. In this study, we present a novel computational approach (PSRQSP) for improving the prediction and analysis of QSPs. In PSRQSP, we develop a novel propensity score representation learning (PSR) scheme. Specifically, we utilized the PSR approach to extract and learn a comprehensive set of estimated propensities of 20 amino acids, 400 dipeptides, and 400 g-gap dipeptides from a pool of scoring card method-based models. Finally, to maximize the utility of the propensity scores, we explored a set of optimal propensity scores and combined them to construct a final meta-predictor. Our experimental results showed that combining multiview propensity scores was more beneficial for identifying QSPs than the conventional feature descriptors. Moreover, extensive benchmarking experiments based on the independent test were sufficient to demonstrate the predictive capability and effectiveness of PSRQSP by outperforming the conventional ML-based and existing methods, with an accuracy of 94.44% and AUC of 0.967. PSR-derived propensity scores were employed to determine the crucial physicochemical properties for a better understanding of the functional mechanisms of QSPs. Finally, we constructed an easy-to-use web server for the PSRQSP (http://pmlabstack.pythonanywhere.com/PSRQSP). PSRQSP is anticipated to be an efficient computational tool for accelerating the data-driven discovery of potential QSPs for drug discovery and development.


Subject(s)
Peptides , Quorum Sensing , Propensity Score , Peptides/chemistry , Dipeptides/chemistry , Bacteria
SELECTION OF CITATIONS
SEARCH DETAIL