Your browser doesn't support javascript.
loading
Comparison of Prompt Engineering and Fine-Tuning Strategies in Large Language Models in the Classification of Clinical Notes.
Zhang, Xiaodan; Talukdar, Nabasmita; Vemulapalli, Sandeep; Ahn, Sumyeong; Wang, Jiankun; Meng, Han; Murtaza, Sardar Mehtab Bin; Leshchiner, Dmitry; Dave, Aakash Ajay; Joseph, Dimitri F; Witteveen-Lane, Martin; Chesla, Dave; Zhou, Jiayu; Chen, Bin.
Affiliation
  • Zhang X; Department of Pediatrics and Human Development, College of Human Medicine, Michigan State University, Grand Rapids, MI, USA.
  • Talukdar N; Department of Pediatrics and Human Development, College of Human Medicine, Michigan State University, Grand Rapids, MI, USA.
  • Vemulapalli S; Department of Pediatrics and Human Development, College of Human Medicine, Michigan State University, Grand Rapids, MI, USA.
  • Ahn S; Office of Research, Spectrum Health, Grand Rapids, MI, USA.
  • Wang J; Department of Computer Science and Engineering, College of Engineering, Michigan State University, East Lansing, MI, USA.
  • Meng H; Department of Computer Science and Engineering, College of Engineering, Michigan State University, East Lansing, MI, USA.
  • Murtaza SMB; Department of Computer Science and Engineering, College of Engineering, Michigan State University, East Lansing, MI, USA.
  • Leshchiner D; Department of Computer Science and Engineering, College of Engineering, Michigan State University, East Lansing, MI, USA.
  • Dave AA; Department of Pediatrics and Human Development, College of Human Medicine, Michigan State University, Grand Rapids, MI, USA.
  • Joseph DF; Department of Pediatrics and Human Development, College of Human Medicine, Michigan State University, Grand Rapids, MI, USA.
  • Witteveen-Lane M; Center for Bioethics and Social Justice, Michigan State University, Grand Rapids, MI, USA.
  • Chesla D; Department of Pharmacology and Toxicology, College of Human Medicine, Michigan State University, Grand Rapids, MI, USA.
  • Zhou J; Office of Research, Spectrum Health, Grand Rapids, MI, USA.
  • Chen B; Office of Research, Spectrum Health, Grand Rapids, MI, USA.
medRxiv ; 2024 Feb 08.
Article in En | MEDLINE | ID: mdl-38370673
ABSTRACT
The emerging large language models (LLMs) are actively evaluated in various fields including healthcare. Most studies have focused on established benchmarks and standard parameters; however, the variation and impact of prompt engineering and fine-tuning strategies have not been fully explored. This study benchmarks GPT-3.5 Turbo, GPT-4, and Llama-7B against BERT models and medical fellows' annotations in identifying patients with metastatic cancer from discharge summaries. Results revealed that clear, concise prompts incorporating reasoning steps significantly enhanced performance. GPT-4 exhibited superior performance among all models. Notably, one-shot learning and fine-tuning provided no incremental benefit. The model's accuracy sustained even when keywords for metastatic cancer were removed or when half of the input tokens were randomly discarded. These findings underscore GPT-4's potential to substitute specialized models, such as PubMedBERT, through strategic prompt engineering, and suggest opportunities to improve open-source models, which are better suited to use in clinical settings.

Full text: 1 Collection: 01-internacional Database: MEDLINE Language: En Journal: MedRxiv Year: 2024 Document type: Article Affiliation country: United States Country of publication: United States

Full text: 1 Collection: 01-internacional Database: MEDLINE Language: En Journal: MedRxiv Year: 2024 Document type: Article Affiliation country: United States Country of publication: United States