Your browser doesn't support javascript.
loading
DNABERT-based explainable lncRNA identification in plant genome assemblies.
Danilevicz, Monica F; Gill, Mitchell; Fernandez, Cassandria G Tay; Petereit, Jakob; Upadhyaya, Shriprabha R; Batley, Jacqueline; Bennamoun, Mohammed; Edwards, David; Bayer, Philipp E.
Afiliação
  • Danilevicz MF; School of Biological Sciences, University of Western Australia, Australia.
  • Gill M; School of Biological Sciences, University of Western Australia, Australia.
  • Fernandez CGT; School of Biological Sciences, University of Western Australia, Australia.
  • Petereit J; School of Biological Sciences, University of Western Australia, Australia.
  • Upadhyaya SR; School of Biological Sciences, University of Western Australia, Australia.
  • Batley J; School of Biological Sciences, University of Western Australia, Australia.
  • Bennamoun M; School of Physics, Mathematics and Computing, University of Western Australia, Australia.
  • Edwards D; School of Biological Sciences, University of Western Australia, Australia.
  • Bayer PE; School of Biological Sciences, University of Western Australia, Australia.
Comput Struct Biotechnol J ; 21: 5676-5685, 2023.
Article em En | MEDLINE | ID: mdl-38058296
ABSTRACT
Long non-coding ribonucleic acids (lncRNAs) have been shown to play an important role in plant gene regulation, involving both epigenetic and transcript regulation. LncRNAs are transcripts longer than 200 nucleotides that are not translated into functional proteins but can be translated into small peptides. Machine learning models have predominantly used transcriptome data with manually defined features to detect lncRNAs, however, they often underrepresent the abundance of lncRNAs and can be biased in their detection. Here we present a study using Natural Language Processing (NLP) models to identify plant lncRNAs from genomic sequences rather than transcriptomic data. The NLP models were trained to predict lncRNAs for seven model and crop species (Zea mays, Arabidopsis thaliana, Brassica napus, Brassica oleracea, Brassica rapa, Glycine max and Oryza sativa) using publicly available genomic references. We demonstrated that lncRNAs can be accurately predicted from genomic sequences with the highest accuracy of 83.4% for Z. mays and the lowest accuracy of 57.9% for B. rapa, revealing that genome assembly quality might affect the accuracy of lncRNA identification. Furthermore, we demonstrated the potential of using NLP models for cross-species prediction with an average of 63.1% accuracy using target species not previously seen by the model. As more species are incorporated into the training datasets, we expect the accuracy to increase, becoming a more reliable tool for uncovering novel lncRNAs. Finally, we show that the models can be interpreted using explainable artificial intelligence to identify motifs important to lncRNA prediction and that these motifs frequently flanked the lncRNA sequence.
Palavras-chave

Texto completo: 1 Base de dados: MEDLINE Idioma: En Ano de publicação: 2023 Tipo de documento: Article

Texto completo: 1 Base de dados: MEDLINE Idioma: En Ano de publicação: 2023 Tipo de documento: Article