Your browser doesn't support javascript.
loading
Glypred: Lysine Glycation Site Prediction via CCU-LightGBM-BiLSTM Framework with Multi-Head Attention Mechanism.
Zuo, Yun; Zhang, Bangyi; Dong, Yinkang; He, Wenying; Bi, Yue; Liu, Xiangrong; Zeng, Xiangxiang; Deng, Zhaohong.
Affiliation
  • Zuo Y; School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China.
  • Zhang B; School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China.
  • Dong Y; School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China.
  • He W; School of Artificial Intelligence, Hebei University of Technology, Tianjin 300130, China.
  • Bi Y; Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Clayton 3800, Australia.
  • Liu X; Department of Computer Science and Technology, National Institute for Data Science in Health and Medicine, Xiamen Key Laboratory of Intelligent Storage and Computing, Xiamen University, Xiamen 361005, China.
  • Zeng X; School of Information Science and Engineering, Hunan University, Changsha 410012, China.
  • Deng Z; School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China.
J Chem Inf Model ; 64(16): 6699-6711, 2024 Aug 26.
Article in En | MEDLINE | ID: mdl-39121059
ABSTRACT
Glycation, a type of posttranslational modification, preferentially occurs on lysine and arginine residues, impairing protein functionality and altering characteristics. This process is linked to diseases such as Alzheimer's, diabetes, and atherosclerosis. Traditional wet lab experiments are time-consuming, whereas machine learning has significantly streamlined the prediction of protein glycation sites. Despite promising results, challenges remain, including data imbalance, feature redundancy, and suboptimal classifier performance. This research introduces Glypred, a lysine glycation site prediction model combining ClusterCentroids Undersampling (CCU), LightGBM, and bidirectional long short-term memory network (BiLSTM) methodologies, with an additional multihead attention mechanism integrated into the BiLSTM. To achieve this, the study undertakes several key

steps:

selecting diverse feature types to capture comprehensive protein information, employing a cluster-based undersampling strategy to balance the data set, using LightGBM for feature selection to enhance model performance, and implementing a bidirectional LSTM network for accurate classification. Together, these approaches ensure that Glypred effectively identifies glycation sites with high accuracy and robustness. For feature encoding, five distinct feature types─AAC, KMER, DR, PWAA, and EBGW─were selected to capture a broad spectrum of protein sequence and biological information. These encoded features were integrated and validated to ensure comprehensive protein information acquisition. To address the issue of highly imbalanced positive and negative samples, various undersampling algorithms, including random undersampling, NearMiss, edited nearest neighbor rule, and CCU, were evaluated. CCU was ultimately chosen to remove redundant nonglycated training data, establishing a balanced data set that enhances the model's accuracy and robustness. For feature selection, the LightGBM ensemble learning algorithm was employed to reduce feature dimensionality by identifying the most significant features. This approach accelerates model training, enhances generalization capabilities, and ensures good transferability of the model. Finally, a bidirectional long short-term memory network was used as the classifier, with a network structure designed to capture glycation modification site features from both forward and backward directions. To prevent overfitting, appropriate regularization parameters and dropout rates were introduced, achieving efficient classification. Experimental results show that Glypred achieved optimal performance. This model provides new insights for bioinformatics and encourages the application of similar strategies in other fields. A lysine glycation site prediction software tool was also developed using the PyQt5 library, offering researchers an auxiliary screening tool to reduce workload and improve efficiency. The software and data sets are available on GitHub https//github.com/ZBYnb/Glypred.
Subject(s)

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Lysine Limits: Humans Language: En Journal: J Chem Inf Model Journal subject: INFORMATICA MEDICA / QUIMICA Year: 2024 Document type: Article Affiliation country: China Country of publication: United States

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Lysine Limits: Humans Language: En Journal: J Chem Inf Model Journal subject: INFORMATICA MEDICA / QUIMICA Year: 2024 Document type: Article Affiliation country: China Country of publication: United States