Your browser doesn't support javascript.
loading
A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text.
Xiong, Ying; Wang, Zhongmin; Jiang, Dehuan; Wang, Xiaolong; Chen, Qingcai; Xu, Hua; Yan, Jun; Tang, Buzhou.
Afiliação
  • Xiong Y; Department of Computer Science, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China.
  • Wang Z; Department of Information Technology, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China.
  • Jiang D; Department of Computer Science, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China.
  • Wang X; Department of Computer Science, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China.
  • Chen Q; Department of Computer Science, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China.
  • Xu H; School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA.
  • Yan J; Yidu Cloud (Beijing) Technology Co.,Ltd, Beijing, China.
  • Tang B; Department of Computer Science, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China. tangbuzhou@gmail.com.
BMC Med Inform Decis Mak ; 19(Suppl 2): 66, 2019 04 09.
Article em En | MEDLINE | ID: mdl-30961602
ABSTRACT

BACKGROUND:

Chinese word segmentation (CWS) and part-of-speech (POS) tagging are two fundamental tasks of Chinese text processing. They are usually preliminary steps for lots of Chinese natural language processing (NLP) tasks. There have been a large number of studies on CWS and POS tagging in various domains, however, few studies have been proposed for CWS and POS tagging in the clinical domain as it is not easy to determine granularity of words.

METHODS:

In this paper, we investigated CWS and POS tagging for Chinese clinical text at a fine-granularity level, and manually annotated a corpus. On the corpus, we compared two state-of-the-art methods, i.e., conditional random fields (CRF) and bidirectional long short-term memory (BiLSTM) with a CRF layer. In order to validate the plausibility of the fine-grained annotation, we further investigated the effect of CWS and POS tagging on Chinese clinical named entity recognition (NER) on another independent corpus.

RESULTS:

When only CWS was considered, CRF achieved higher precision, recall and F-measure than BiLSTM-CRF. When both CWS and POS tagging were considered, CRF also gained an advantage over BiLSTM. CRF outperformed BiLSTM-CRF by 0.14% in F-measure on CWS and by 0.34% in F-measure on POS tagging. The CWS information brought a greatest improvement of 0.34% in F-measure, while the CWS&POS information brought a greatest improvement of 0.74% in F-measure.

CONCLUSIONS:

Our proposed fine-grained CWS and POS tagging corpus is reliable and meaningful as the output of the CWS and POS tagging systems developed on this corpus improved the performance of a Chinese clinical NER system on another independent corpus.
Assuntos
Palavras-chave

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Fala / Processamento de Linguagem Natural / Armazenamento e Recuperação da Informação / Registros Eletrônicos de Saúde Limite: Humans País/Região como assunto: Asia Idioma: En Ano de publicação: 2019 Tipo de documento: Article

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Fala / Processamento de Linguagem Natural / Armazenamento e Recuperação da Informação / Registros Eletrônicos de Saúde Limite: Humans País/Região como assunto: Asia Idioma: En Ano de publicação: 2019 Tipo de documento: Article