Speculation detection for Chinese clinical notes: Impacts of word segmentation and embedding models.

Zhang, Shaodian; Kang, Tian; Zhang, Xingting; Wen, Dong; Elhadad, Noémie; Lei, Jianbo

Zhang, Shaodian; Kang, Tian; Zhang, Xingting; Wen, Dong; Elhadad, Noémie; Lei, Jianbo.

Affiliation

Zhang S; Department of Biomedical Informatics, Columbia University, New York, USA.
Kang T; Department of Biomedical Informatics, Columbia University, New York, USA.
Zhang X; Center for Medical Informatics, Peking University, Beijing, China.
Wen D; Center for Medical Informatics, Peking University, Beijing, China.
Elhadad N; Department of Biomedical Informatics, Columbia University, New York, USA.
Lei J; Center for Medical Informatics, Peking University, Beijing, China. Electronic address: jblei@hsc.pku.edu.cn.

J Biomed Inform ; 60: 334-41, 2016 Apr.

Article de En | MEDLINE | ID: mdl-26923634

RÉSUMÉ

Speculations represent uncertainty toward certain facts. In clinical texts, identifying speculations is a critical step of natural language processing (NLP). While it is a nontrivial task in many languages, detecting speculations in Chinese clinical notes can be particularly challenging because word segmentation may be necessary as an upstream operation. The objective of this paper is to construct a state-of-the-art speculation detection system for Chinese clinical notes and to investigate whether embedding features and word segmentations are worth exploiting toward this overall task. We propose a sequence labeling based system for speculation detection, which relies on features from bag of characters, bag of words, character embedding, and word embedding. We experiment on a novel dataset of 36,828 clinical notes with 5103 gold-standard speculation annotations on 2000 notes, and compare the systems in which word embeddings are calculated based on word segmentations given by general and by domain specific segmenters respectively. Our systems are able to reach performance as high as 92.2% measured by F score. We demonstrate that word segmentation is critical to produce high quality word embedding to facilitate downstream information extraction applications, and suggest that a domain dependent word segmenter can be vital to such a clinical NLP task in Chinese language.

Sujet(s)
Mots clés

Chinese NLP; Clinical NLP; Natural language processing; Speculation detection; Word embedding; Word segmentation

Texte intégral

Ajouter à My VHL

Imprimer

XML

PubMed Links

Recherche sur Google

Texte intégral: 1 Collection: 01-internacional Base de données: MEDLINE Sujet principal: Traitement du langage naturel / Dossiers médicaux électroniques / Fouille de données Type d'étude: Diagnostic_studies / Prognostic_studies Limites: Humans Pays/Région comme sujet: Asia Langue: En Journal: J Biomed Inform Sujet du journal: INFORMATICA MEDICA Année: 2016 Type de document: Article Pays d'affiliation: États-Unis d'Amérique Pays de publication: États-Unis d'Amérique

Texte intégral

Ajouter à My VHL

Imprimer

XML

PubMed Links

Recherche sur Google