When BERT meets Bilbo: a learning curve analysis of pretrained language model on disease classification.

Li, Xuedong; Yuan, Walter; Peng, Dezhong; Mei, Qiaozhu; Wang, Yue

Li, Xuedong; Yuan, Walter; Peng, Dezhong; Mei, Qiaozhu; Wang, Yue.

Afiliação

Li X; College of Computer Science, Sichuan University, Chengdu, China.
Yuan W; MobLab Inc., Pasadena, CA, USA.
Peng D; College of Computer Science, Sichuan University, Chengdu, China.
Mei Q; School of Information, University of Michigan, Ann Arbor, MI, USA.
Wang Y; School of Information and Library Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA. wangyue@email.unc.edu.

BMC Med Inform Decis Mak ; 21(Suppl 9): 377, 2022 04 05.

Article em En | MEDLINE | ID: mdl-35382811

RESUMO

BACKGROUND: Natural language processing (NLP) tasks in the health domain often deal with limited amount of labeled data due to high annotation costs and naturally rare observations. To compensate for the lack of training data, health NLP researchers often have to leverage knowledge and resources external to a task at hand. Recently, pretrained large-scale language models such as the Bidirectional Encoder Representations from Transformers (BERT) have been proven to be a powerful way of learning rich linguistic knowledge from massive unlabeled text and transferring that knowledge to downstream tasks. However, previous downstream tasks often used training data at such a large scale that is unlikely to obtain in the health domain. In this work, we aim to study whether BERT can still benefit downstream tasks when training data are relatively small in the context of health NLP. METHOD: We conducted a learning curve analysis to study the behavior of BERT and baseline models as training data size increases. We observed the classification performance of these models on two disease diagnosis data sets, where some diseases are naturally rare and have very limited observations (fewer than 2 out of 10,000). The baselines included commonly used text classification models such as sparse and dense bag-of-words models, long short-term memory networks, and their variants that leveraged external knowledge. To obtain learning curves, we incremented the amount of training examples per disease from small to large, and measured the classification performance in macro-averaged [Formula: see text] score. RESULTS: On the task of classifying all diseases, the learning curves of BERT were consistently above all baselines, significantly outperforming them across the spectrum of training data sizes. But under extreme situations where only one or two training documents per disease were available, BERT was outperformed by linear classifiers with carefully engineered bag-of-words features. CONCLUSION: As long as the amount of training documents is not extremely few, fine-tuning a pretrained BERT model is a highly effective approach to health NLP tasks like disease classification. However, in extreme cases where each class has only one or two training documents and no more will be available, simple linear models using bag-of-words features shall be considered.

Assuntos

Curva de Aprendizado; Processamento de Linguagem Natural; Humanos; Idioma

Palavras-chave

Bidirectional encoder representations from transformers; Disease classification; Learning curve

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Processamento de Linguagem Natural / Curva de Aprendizado Tipo de estudo: Prognostic_studies Limite: Humans Idioma: En Ano de publicação: 2022 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google