Protected Health Information Recognition by Fine-Tuning a Pre-training Transformer Model

Seo-Hyun OH; Min KANG; Youngho LEE

Protected Health Information Recognition by Fine-Tuning a Pre-training Transformer Model / 대한의료정보학회지

Seo-Hyun OH; Min KANG; Youngho LEE.

Healthcare Informatics Research ; : 16-24, 2022.

Article in English | WPRIM | ID: wpr-914496

ABSTRACT

ABSTRACT

Objectives@#De-identifying protected health information (PHI) in medical documents is important, and a prerequisite to deidentification is the identification of PHI entity names in clinical documents. This study aimed to compare the performance of three pre-training models that have recently attracted significant attention and to determine which model is more suitable for PHI recognition. @*Methods@#We compared the PHI recognition performance of deep learning models using the i2b2 2014 dataset. We used the three pre-training models—namely, bidirectional encoder representations from transformers (BERT), robustly optimized BERT pre-training approach (RoBERTa), and XLNet (model built based on Transformer-XL)—to detect PHI. After the dataset was tokenized, it was processed using an inside-outside-beginning tagging scheme and WordPiecetokenized to place it into these models. Further, the PHI recognition performance was investigated using BERT, RoBERTa, and XLNet. @*Results@#Comparing the PHI recognition performance of the three models, it was confirmed that XLNet had a superior F1-score of 96.29%. In addition, when checking PHI entity performance evaluation, RoBERTa and XLNet showed a 30% improvement in performance compared to BERT. @*Conclusions@#Among the pre-training models used in this study, XLNet exhibited superior performance because word embedding was well constructed using the two-stream self-attention method. In addition, compared to BERT, RoBERTa and XLNet showed superior performance, indicating that they were more effective in grasping the context.

Fulltext

XML

Search on Google

Full text: Available Index: WPRIM (Western Pacific) Type of study: Prognostic study Language: English Journal: Healthcare Informatics Research Year: 2022 Type: Article

Similar

MEDLINE

LILACS

LIS

Fulltext

XML

Search on Google

Full text: Available Index: WPRIM (Western Pacific) Type of study: Prognostic study Language: English Journal: Healthcare Informatics Research Year: 2022 Type: Article