RDscan: Extracting RNA-disease relationship from the literature based on pre-training model.

Zhang, Yang; Yang, Yu; Ren, Liping; Ning, Lin; Zou, Quan; Luo, Nanchao; Zhang, Yinghui; Liu, Ruijun

Zhang, Yang; Yang, Yu; Ren, Liping; Ning, Lin; Zou, Quan; Luo, Nanchao; Zhang, Yinghui; Liu, Ruijun.

Afiliação

Zhang Y; Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 611731, China; School of Healthcare Technology, Chengdu Neusoft University, Chengdu 611844, China. Electronic address: zhy1001@alu.uestc.edu.cn.
Yang Y; School of Healthcare Technology, Chengdu Neusoft University, Chengdu 611844, China.
Ren L; School of Healthcare Technology, Chengdu Neusoft University, Chengdu 611844, China.
Ning L; School of Healthcare Technology, Chengdu Neusoft University, Chengdu 611844, China.
Zou Q; Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 611731, China.
Luo N; School of Computer Science and Technology, Aba Teachers College, WenChuan, Sichuan, 623002, China.
Zhang Y; School of Healthcare Technology, Chengdu Neusoft University, Chengdu 611844, China. Electronic address: zyh@nsu.edu.cn.
Liu R; School of Software, Beihang University, Beijing 100191, China. Electronic address: liuruijun@buaa.edu.cn.

Methods ; 228: 48-54, 2024 Aug.

Article em En | MEDLINE | ID: mdl-38789016

ABSTRACT

ABSTRACT

With the rapid advancements in molecular biology and genomics, a multitude of connections between RNA and diseases has been unveiled, making the efficient and accurate extraction of RNA-disease (RD) relationships from extensive biomedical literature crucial for advancing research in this field. This study introduces RDscan, a novel text mining method developed based on the pre-training and fine-tuning strategy, aimed at automatically extracting RD-related information from a vast corpus of literature using pre-trained biomedical large language models (LLM). Initially, we constructed a dedicated RD corpus by manually curating from literature, comprising 2,082 positive and 2,000 negative sentences, alongside an independent test dataset (comprising 500 positive and 500 negative sentences) for training and evaluating RDscan. Subsequently, by fine-tuning the Bioformer and BioBERT pre-trained models, RDscan demonstrated exceptional performance in text classification and named entity recognition (NER) tasks. In 5-fold cross-validation, RDscan significantly outperformed traditional machine learning methods (Support Vector Machine, Logistic Regression and Random Forest). In addition, we have developed an accessible webserver that assists users in extracting RD relationships from text. In summary, RDscan represents the first text mining tool specifically designed for RD relationship extraction, and is poised to emerge as an invaluable tool for researchers dedicated to exploring the intricate interactions between RNA and diseases. Webserver of RDscan is free available at https//cellknowledge.com.cn/RDscan/.

Assuntos

Mineração de Dados; RNA; Mineração de Dados/métodos; RNA/genética; Humanos; Aprendizado de Máquina; Doença/genética; Máquina de Vetores de Suporte; Software

Palavras-chave

Disease; Large language model; Pre-training; RNA; Text mining

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: RNA / Mineração de Dados Limite: Humans Idioma: En Revista: Methods Assunto da revista: BIOQUIMICA Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google