Your browser doesn't support javascript.
loading
TSSNote-CyaPromBERT: Development of an integrated platform for highly accurate promoter prediction and visualization of Synechococcus sp. and Synechocystis sp. through a state-of-the-art natural language processing model BERT.
Mai, Dung Hoang Anh; Nguyen, Linh Thanh; Lee, Eun Yeol.
Affiliation
  • Mai DHA; Department of Chemical Engineering (BK21 FOUR Integrated Engineering Program), Kyung Hee University, Yongin-si, South Korea.
  • Nguyen LT; Department of Chemical Engineering (BK21 FOUR Integrated Engineering Program), Kyung Hee University, Yongin-si, South Korea.
  • Lee EY; Department of Chemical Engineering (BK21 FOUR Integrated Engineering Program), Kyung Hee University, Yongin-si, South Korea.
Front Genet ; 13: 1067562, 2022.
Article in En | MEDLINE | ID: mdl-36523764
ABSTRACT
Since the introduction of the first transformer model with a unique self-attention mechanism, natural language processing (NLP) models have attained state-of-the-art (SOTA) performance on various tasks. As DNA is the blueprint of life, it can be viewed as an unusual language, with its characteristic lexicon and grammar. Therefore, NLP models may provide insights into the meaning of the sequential structure of DNA. In the current study, we employed and compared the performance of popular SOTA NLP models (i.e., XLNET, BERT, and a variant DNABERT trained on the human genome) to predict and analyze the promoters in freshwater cyanobacterium Synechocystis sp. PCC 6803 and the fastest growing cyanobacterium Synechococcus elongatus sp. UTEX 2973. These freshwater cyanobacteria are promising hosts for phototrophically producing value-added compounds from CO2. Through a custom pipeline, promoters and non-promoters from Synechococcus elongatus sp. UTEX 2973 were used to train the model. The trained model achieved an AUROC score of 0.97 and F1 score of 0.92. During cross-validation with promoters from Synechocystis sp. PCC 6803, the model achieved an AUROC score of 0.96 and F1 score of 0.91. To increase accessibility, we developed an integrated platform (TSSNote-CyaPromBERT) to facilitate large dataset extraction, model training, and promoter prediction from public dRNA-seq datasets. Furthermore, various visualization tools have been incorporated to address the "black box" issue of deep learning and feature analysis. The learning transfer ability of large language models may help identify and analyze promoter regions for newly isolated strains with similar lineages.
Key words

Full text: 1 Collection: 01-internacional Database: MEDLINE Type of study: Prognostic_studies / Risk_factors_studies Language: En Journal: Front Genet Year: 2022 Document type: Article Affiliation country: South Korea

Full text: 1 Collection: 01-internacional Database: MEDLINE Type of study: Prognostic_studies / Risk_factors_studies Language: En Journal: Front Genet Year: 2022 Document type: Article Affiliation country: South Korea