Your browser doesn't support javascript.
loading
Multiple sequence alignment-based RNA language model and its application to structural inference.
Zhang, Yikun; Lang, Mei; Jiang, Jiuhong; Gao, Zhiqiang; Xu, Fan; Litfin, Thomas; Chen, Ke; Singh, Jaswinder; Huang, Xiansong; Song, Guoli; Tian, Yonghong; Zhan, Jian; Chen, Jie; Zhou, Yaoqi.
Affiliation
  • Zhang Y; School of Electronic and Computer Engineering, Peking University, Shenzhen 518055, China.
  • Lang M; AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, Shenzen 518055, China.
  • Jiang J; Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518107, China.
  • Gao Z; Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518107, China.
  • Xu F; Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China.
  • Litfin T; Peng Cheng Laboratory, Shenzhen 518066, China.
  • Chen K; Peng Cheng Laboratory, Shenzhen 518066, China.
  • Singh J; Institute for Glycomics, Griffith University, Parklands Dr, Southport, QLD 4215, Australia.
  • Huang X; Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518107, China.
  • Song G; Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518107, China.
  • Tian Y; Peng Cheng Laboratory, Shenzhen 518066, China.
  • Zhan J; Peng Cheng Laboratory, Shenzhen 518066, China.
  • Chen J; Peng Cheng Laboratory, Shenzhen 518066, China.
  • Zhou Y; Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518107, China.
Nucleic Acids Res ; 52(1): e3, 2024 Jan 11.
Article in En | MEDLINE | ID: mdl-37941140
ABSTRACT
Compared with proteins, DNA and RNA are more difficult languages to interpret because four-letter coded DNA/RNA sequences have less information content than 20-letter coded protein sequences. While BERT (Bidirectional Encoder Representations from Transformers)-like language models have been developed for RNA, they are ineffective at capturing the evolutionary information from homologous sequences because unlike proteins, RNA sequences are less conserved. Here, we have developed an unsupervised multiple sequence alignment-based RNA language model (RNA-MSM) by utilizing homologous sequences from an automatic pipeline, RNAcmap, as it can provide significantly more homologous sequences than manually annotated Rfam. We demonstrate that the resulting unsupervised, two-dimensional attention maps and one-dimensional embeddings from RNA-MSM contain structural information. In fact, they can be directly mapped with high accuracy to 2D base pairing probabilities and 1D solvent accessibilities, respectively. Further fine-tuning led to significantly improved performance on these two downstream tasks compared with existing state-of-the-art techniques including SPOT-RNA2 and RNAsnap2. By comparison, RNA-FM, a BERT-based RNA language model, performs worse than one-hot encoding with its embedding in base pair and solvent-accessible surface area prediction. We anticipate that the pre-trained RNA-MSM model can be fine-tuned on many other tasks related to RNA structure and function.
Subject(s)

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: RNA / Sequence Alignment / Machine Learning Language: En Journal: Nucleic Acids Res Year: 2024 Type: Article Affiliation country: China

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: RNA / Sequence Alignment / Machine Learning Language: En Journal: Nucleic Acids Res Year: 2024 Type: Article Affiliation country: China