RESUMEN
Phosphorylation of proteins is one of the most significant post-translational modifications (PTMs) and plays a crucial role in plant functionality due to its impact on signaling, gene expression, enzyme kinetics, protein stability and interactions. Accurate prediction of plant phosphorylation sites (p-sites) is vital as abnormal regulation of phosphorylation usually leads to plant diseases. However, current experimental methods for PTM prediction suffers from high-computational cost and are error-prone. The present study develops machine learning-based prediction techniques, including a high-performance interpretable deep tabular learning network (TabNet) to improve the prediction of protein p-sites in soybean. Moreover, we use a hybrid feature set of sequential-based features, physicochemical properties and position-specific scoring matrices to predict serine (Ser/S), threonine (Thr/T) and tyrosine (Tyr/Y) p-sites in soybean for the first time. The experimentally verified p-sites data of soybean proteins are collected from the eukaryotic phosphorylation sites database and database post-translational modification. We then remove the redundant set of positive and negative samples by dropping protein sequences with >40% similarity. It is found that the developed techniques perform >70% in terms of accuracy. The results demonstrate that the TabNet model is the best performing classifier using hybrid features and with window size of 13, resulted in 78.96 and 77.24% sensitivity and specificity, respectively. The results indicate that the TabNet method has advantages in terms of high-performance and interpretability. The proposed technique can automatically analyze the data without any measurement errors and any human intervention. Furthermore, it can be used to predict putative protein p-sites in plants effectively. The collected dataset and source code are publicly deposited at https://github.com/Elham-khalili/Soybean-P-sites-Prediction.
Asunto(s)
Glycine max , Procesamiento Proteico-Postraduccional , Secuencia de Aminoácidos , Biología Computacional/métodos , Humanos , Aprendizaje Automático , Fosforilación , Glycine max/genéticaRESUMEN
Leveraging the potential of machine learning and recognizing the broad applications of binary classification, it becomes essential to develop platforms that are not only powerful but also transparent, interpretable, and user friendly. We introduce alphaML, a user-friendly platform that provides clear, legible, explainable, transparent, and elucidative (CLETE) binary classification models with comprehensive customization options. AlphaML offers feature selection, hyperparameter search, sampling, and normalization methods, along with 15 machine learning algorithms with global and local interpretation. We have integrated a custom metric for hyperparameter search that considers both training and validation scores, safeguarding against under- or overfitting. Additionally, we employ the NegLog2RMSL scoring method, which uses both training and test scores for a thorough model evaluation. The platform has been tested using datasets from multiple domains and offers a graphical interface, removing the need for programming expertise. Consequently, alphaML exhibits versatility, demonstrating promising applicability across a broad spectrum of tabular data configurations.