Your browser doesn't support javascript.
loading
SPRoBERTa: protein embedding learning with local fragment modeling.
Wu, Lijun; Yin, Chengcan; Zhu, Jinhua; Wu, Zhen; He, Liang; Xia, Yingce; Xie, Shufang; Qin, Tao; Liu, Tie-Yan.
Afiliação
  • Wu L; Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China.
  • Yin C; National Key Laboratory for Novel Software Technology, Nanjing University, 163 Xianlin Road, Qixia District, 210023, Nanjing, Jiangsu Province, China.
  • Zhu J; CAS Key Laboratory of GIPAS, EEIS Department, University of Science and Technology of China, No.96, JinZhai Road Baohe District, 230026, Hefei, Anhui Province, China.
  • Wu Z; National Key Laboratory for Novel Software Technology, Nanjing University, 163 Xianlin Road, Qixia District, 210023, Nanjing, Jiangsu Province, China.
  • He L; Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China.
  • Xia Y; Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China.
  • Xie S; Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China.
  • Qin T; Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China.
  • Liu TY; Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China.
Brief Bioinform ; 23(6)2022 11 19.
Article em En | MEDLINE | ID: mdl-36136367
ABSTRACT
Well understanding protein function and structure in computational biology helps in the understanding of human beings. To face the limited proteins that are annotated structurally and functionally, the scientific community embraces the self-supervised pre-training methods from large amounts of unlabeled protein sequences for protein embedding learning. However, the protein is usually represented by individual amino acids with limited vocabulary size (e.g. 20 type proteins), without considering the strong local semantics existing in protein sequences. In this work, we propose a novel pre-training modeling approach SPRoBERTa. We first present an unsupervised protein tokenizer to learn protein representations with local fragment pattern. Then, a novel framework for deep pre-training model is introduced to learn protein embeddings. After pre-training, our method can be easily fine-tuned for different protein tasks, including amino acid-level prediction task (e.g. secondary structure prediction), amino acid pair-level prediction task (e.g. contact prediction) and also protein-level prediction task (remote homology prediction, protein function prediction). Experiments show that our approach achieves significant improvements in all tasks and outperforms the previous methods. We also provide detailed ablation studies and analysis for our protein tokenizer and training framework.
Assuntos
Palavras-chave

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Proteínas / Biologia Computacional Tipo de estudo: Prognostic_studies Limite: Humans Idioma: En Ano de publicação: 2022 Tipo de documento: Article

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Proteínas / Biologia Computacional Tipo de estudo: Prognostic_studies Limite: Humans Idioma: En Ano de publicação: 2022 Tipo de documento: Article