Your browser doesn't support javascript.
loading
Integrating Embeddings from Multiple Protein Language Models to Improve Protein O-GlcNAc Site Prediction.
Pokharel, Suresh; Pratyush, Pawel; Ismail, Hamid D; Ma, Junfeng; Kc, Dukka B.
Afiliación
  • Pokharel S; Department of Computer Science, Michigan Technological University, Houghton, MI 49931, USA.
  • Pratyush P; Department of Computer Science, Michigan Technological University, Houghton, MI 49931, USA.
  • Ismail HD; Department of Computer Science, Michigan Technological University, Houghton, MI 49931, USA.
  • Ma J; Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Georgetown University, Washington, DC 20057, USA.
  • Kc DB; Department of Computer Science, Michigan Technological University, Houghton, MI 49931, USA.
Int J Mol Sci ; 24(21)2023 Nov 06.
Article en En | MEDLINE | ID: mdl-37958983
ABSTRACT
O-linked ß-N-acetylglucosamine (O-GlcNAc) is a distinct monosaccharide modification of serine (S) or threonine (T) residues of nucleocytoplasmic and mitochondrial proteins. O-GlcNAc modification (i.e., O-GlcNAcylation) is involved in the regulation of diverse cellular processes, including transcription, epigenetic modifications, and cell signaling. Despite the great progress in experimentally mapping O-GlcNAc sites, there is an unmet need to develop robust prediction tools that can effectively locate the presence of O-GlcNAc sites in protein sequences of interest. In this work, we performed a comprehensive evaluation of a framework for prediction of protein O-GlcNAc sites using embeddings from pre-trained protein language models. In particular, we compared the performance of three protein sequence-based large protein language models (pLMs), Ankh, ESM-2, and ProtT5, for prediction of O-GlcNAc sites and also evaluated various ensemble strategies to integrate embeddings from these protein language models. Upon investigation, the decision-level fusion approach that integrates the decisions of the three embedding models, which we call LM-OGlcNAc-Site, outperformed the models trained on these individual language models as well as other fusion approaches and other existing predictors in almost all of the parameters evaluated. The precise prediction of O-GlcNAc sites will facilitate the probing of O-GlcNAc site-specific functions of proteins in physiology and diseases. Moreover, these findings also indicate the effectiveness of combined uses of multiple protein language models in post-translational modification prediction and open exciting avenues for further research and exploration in other protein downstream tasks. LM-OGlcNAc-Site's web server and source code are publicly available to the community.
Asunto(s)
Palabras clave

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Asunto principal: Proteínas / Procesamiento Proteico-Postraduccional Idioma: En Revista: Int J Mol Sci Año: 2023 Tipo del documento: Article País de afiliación: Estados Unidos

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Asunto principal: Proteínas / Procesamiento Proteico-Postraduccional Idioma: En Revista: Int J Mol Sci Año: 2023 Tipo del documento: Article País de afiliación: Estados Unidos
...