An improved data augmentation approach and its application in medical named entity recognition.

Chen, Hongyu; Dan, Li; Lu, Yonghe; Chen, Minghong; Zhang, Jinxia

Chen, Hongyu; Dan, Li; Lu, Yonghe; Chen, Minghong; Zhang, Jinxia.

Afiliación

Chen H; School of Information Management, Sun Yat-Sen University, Guangzhou, 510006, China.
Dan L; School of Information Management, Sun Yat-Sen University, Guangzhou, 510006, China.
Lu Y; School of Artificial Intelligence, Sun Yat-Sen University, Zhuhai, 519082, China. luyonghe@mail.sysu.edu.cn.
Chen M; School of Information Management, Sun Yat-Sen University, Guangzhou, 510006, China.
Zhang J; Department of Cardiology, General Hospital of Southern Theatre Command of PLA, Guangzhou, 510010, China. zhjinxia@foxmail.com.

BMC Med Inform Decis Mak ; 24(1): 221, 2024 Aug 05.

Article en En | MEDLINE | ID: mdl-39103849

ABSTRACT

ABSTRACT

Performing data augmentation in medical named entity recognition (NER) is crucial due to the unique challenges posed by this field. Medical data is characterized by high acquisition costs, specialized terminology, imbalanced distributions, and limited training resources. These factors make achieving high performance in medical NER particularly difficult. Data augmentation methods help to mitigate these issues by generating additional training samples, thus balancing data distribution, enriching the training dataset, and improving model generalization. This paper proposes two data augmentation methods-Contextual Random Replacement based on Word2Vec Augmentation (CRR) and Targeted Entity Random Replacement Augmentation (TER)-aimed at addressing the scarcity and imbalance of data in the medical domain. When combined with a deep learning-based Chinese NER model, these methods can significantly enhance performance and recognition accuracy under limited resources. Experimental results demonstrate that both augmentation methods effectively improve the recognition capability of medical named entities. Specifically, the BERT-BiLSTM-CRF model achieved the highest F1 score of 83.587%, representing a 1.49% increase over the baseline model. This validates the importance and effectiveness of data augmentation in medical NER.

Asunto(s)

Aprendizaje Profundo; Humanos; Procesamiento de Lenguaje Natural

Palabras clave

Data augmentation; Deep learning; Medical named entity recognition; Replacement augmentation; Text features

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Aprendizaje Profundo Límite: Humans Idioma: En Revista: BMC Med Inform Decis Mak Asunto de la revista: INFORMATICA MEDICA Año: 2024 Tipo del documento: Article País de afiliación: China

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google