Your browser doesn't support javascript.
loading
msBERT-Promoter: a multi-scale ensemble predictor based on BERT pre-trained model for the two-stage prediction of DNA promoters and their strengths.
Li, Yazi; Wei, Xiaoman; Yang, Qinglin; Xiong, An; Li, Xingfeng; Zou, Quan; Cui, Feifei; Zhang, Zilong.
Affiliation
  • Li Y; School of Mathematics and Statistics, Hainan University, Haikou, 570228, China.
  • Wei X; School of Computer Science and Technology, Hainan University, Haikou, 570228, China.
  • Yang Q; School of Computer Science and Technology, Hainan University, Haikou, 570228, China.
  • Xiong A; School of Computer Science and Technology, Hainan University, Haikou, 570228, China.
  • Li X; School of Computer Science and Technology, Hainan University, Haikou, 570228, China.
  • Zou Q; Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054, China.
  • Cui F; Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, 324000, China.
  • Zhang Z; School of Computer Science and Technology, Hainan University, Haikou, 570228, China. feifeicui@hainanu.edu.cn.
BMC Biol ; 22(1): 126, 2024 May 30.
Article in En | MEDLINE | ID: mdl-38816885
ABSTRACT

BACKGROUND:

A promoter is a specific sequence in DNA that has transcriptional regulatory functions, playing a role in initiating gene expression. Identifying promoters and their strengths can provide valuable information related to human diseases. In recent years, computational methods have gained prominence as an effective means for identifying promoter, offering a more efficient alternative to labor-intensive biological approaches.

RESULTS:

In this study, a two-stage integrated predictor called "msBERT-Promoter" is proposed for identifying promoters and predicting their strengths. The model incorporates multi-scale sequence information through a tokenization strategy and fine-tunes the DNABERT model. Soft voting is then used to fuse the multi-scale information, effectively addressing the issue of insufficient DNA sequence information extraction in traditional models. To the best of our knowledge, this is the first time an integrated approach has been used in the DNABERT model for promoter identification and strength prediction. Our model achieves accuracy rates of 96.2% for promoter identification and 79.8% for promoter strength prediction, significantly outperforming existing methods. Furthermore, through attention mechanism analysis, we demonstrate that our model can effectively combine local and global sequence information, enhancing its interpretability.

CONCLUSIONS:

msBERT-Promoter provides an effective tool that successfully captures sequence-related attributes of DNA promoters and can accurately identify promoters and predict their strengths. This work paves a new path for the application of artificial intelligence in traditional biology.
Subject(s)
Key words

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Promoter Regions, Genetic Limits: Humans Language: En Journal: BMC Biol Year: 2024 Document type: Article

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Promoter Regions, Genetic Limits: Humans Language: En Journal: BMC Biol Year: 2024 Document type: Article