Evaluating large language models for health-related text classification tasks with public social media data.

Guo, Yuting; Ovadje, Anthony; Al-Garadi, Mohammed Ali; Sarker, Abeed

Guo, Yuting; Ovadje, Anthony; Al-Garadi, Mohammed Ali; Sarker, Abeed.

Affiliation

Guo Y; Department of Biomedical Informatics, Emory University, Atlanta, GA 30322, United States.
Ovadje A; Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA 30332, United States.
Al-Garadi MA; Department of Biomedical Informatics, Vanderbilt University, Nashville, TN 37235, United States.
Sarker A; Department of Biomedical Informatics, Emory University, Atlanta, GA 30322, United States.

J Am Med Inform Assoc ; 2024 Aug 09.

Article in En | MEDLINE | ID: mdl-39121174

ABSTRACT

ABSTRACT

OBJECTIVES:

Large language models (LLMs) have demonstrated remarkable success in natural language processing (NLP) tasks. This study aimed to evaluate their performances on social media-based health-related text classification tasks. MATERIALS AND

METHODS:

We benchmarked 1 Support Vector Machine (SVM), 3 supervised pretrained language models (PLMs), and 2 LLMs-based classifiers across 6 text classification tasks. We developed 3 approaches for leveraging LLMs employing LLMs as zero-shot classifiers, using LLMs as data annotators, and utilizing LLMs with few-shot examples for data augmentation.

RESULTS:

Across all tasks, the mean (SD) F1 score differences for RoBERTa, BERTweet, and SocBERT trained on human-annotated data were 0.24 (±0.10), 0.25 (±0.11), and 0.23 (±0.11), respectively, compared to those trained on the data annotated using GPT3.5, and were 0.16 (±0.07), 0.16 (±0.08), and 0.14 (±0.08) using GPT4, respectively. The GPT3.5 and GPT4 zero-shot classifiers outperformed SVMs in a single task and in 5 out of 6 tasks, respectively. When leveraging LLMs for data augmentation, the RoBERTa models trained on GPT4-augmented data demonstrated superior or comparable performance compared to those trained on human-annotated data alone.

DISCUSSION:

The results revealed that using LLM-annotated data only for training supervised classification models was ineffective. However, employing the LLM as a zero-shot classifier exhibited the potential to outperform traditional SVM models and achieved a higher recall than the advanced transformer-based model RoBERTa. Additionally, our results indicated that utilizing GPT3.5 for data augmentation could potentially harm model performance. In contrast, data augmentation with GPT4 demonstrated improved model performances, showcasing the potential of LLMs in reducing the need for extensive training data.

CONCLUSIONS:

By leveraging the data augmentation strategy, we can harness the power of LLMs to develop smaller, more effective domain-specific NLP models. Using LLM-annotated data without human guidance for training lightweight supervised classification models is an ineffective strategy. However, LLM, as a zero-shot classifier, shows promise in excluding false negatives and potentially reducing the human effort required for data annotation.

Key words

large language models; natural language processing; text classification

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Language: En Journal: J Am Med Inform Assoc Journal subject: INFORMATICA MEDICA Year: 2024 Document type: Article Affiliation country:

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google