Your browser doesn't support javascript.
loading
The potential and pitfalls of using a large language model such as ChatGPT, GPT-4, or LLaMA as a clinical assistant.
Zhang, Jingqing; Sun, Kai; Jagadeesh, Akshay; Falakaflaki, Parastoo; Kayayan, Elena; Tao, Guanyu; Haghighat Ghahfarokhi, Mahta; Gupta, Deepa; Gupta, Ashok; Gupta, Vibhor; Guo, Yike.
Afiliação
  • Zhang J; Pangaea Data Limited, London, SE1 7LY, United Kingdom.
  • Sun K; Data Science Institute, Imperial College London, London, SW7 2AZ, United Kingdom.
  • Jagadeesh A; Pangaea Data Limited, London, SE1 7LY, United Kingdom.
  • Falakaflaki P; Data Science Institute, Imperial College London, London, SW7 2AZ, United Kingdom.
  • Kayayan E; Pangaea Data Limited, London, SE1 7LY, United Kingdom.
  • Tao G; Pangaea Data Limited, London, SE1 7LY, United Kingdom.
  • Haghighat Ghahfarokhi M; Pangaea Data Limited, London, SE1 7LY, United Kingdom.
  • Gupta D; Pangaea Data Limited, London, SE1 7LY, United Kingdom.
  • Gupta A; Pangaea Data Limited, London, SE1 7LY, United Kingdom.
  • Gupta V; Pangaea Data Limited, London, SE1 7LY, United Kingdom.
  • Guo Y; Pangaea Data Limited, London, SE1 7LY, United Kingdom.
Article em En | MEDLINE | ID: mdl-39018498
ABSTRACT

OBJECTIVES:

This study aims to evaluate the utility of large language models (LLMs) in healthcare, focusing on their applications in enhancing patient care through improved diagnostic, decision-making processes, and as ancillary tools for healthcare professionals. MATERIALS AND

METHODS:

We evaluated ChatGPT, GPT-4, and LLaMA in identifying patients with specific diseases using gold-labeled Electronic Health Records (EHRs) from the MIMIC-III database, covering three prevalent diseases-Chronic Obstructive Pulmonary Disease (COPD), Chronic Kidney Disease (CKD)-along with the rare condition, Primary Biliary Cirrhosis (PBC), and the hard-to-diagnose condition Cancer Cachexia.

RESULTS:

In patient identification, GPT-4 had near similar or better performance compared to the corresponding disease-specific Machine Learning models (F1-score ≥ 85%) on COPD, CKD, and PBC. GPT-4 excelled in the PBC use case, achieving a 4.23% higher F1-score compared to disease-specific "Traditional Machine Learning" models. ChatGPT and LLaMA3 demonstrated lower performance than GPT-4 across all diseases and almost all metrics. Few-shot prompts also help ChatGPT, GPT-4, and LLaMA3 achieve higher precision and specificity but lower sensitivity and Negative Predictive Value.

DISCUSSION:

The study highlights the potential and limitations of LLMs in healthcare. Issues with errors, explanatory limitations and ethical concerns like data privacy and model transparency suggest that these models would be supplementary tools in clinical settings. Future studies should improve training datasets and model designs for LLMs to gain better utility in healthcare.

CONCLUSION:

The study shows that LLMs have the potential to assist clinicians for tasks such as patient identification but false positives and false negatives must be mitigated before LLMs are adequate for real-world clinical assistance.
Palavras-chave

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Revista: J Am Med Inform Assoc Assunto da revista: INFORMATICA MEDICA Ano de publicação: 2024 Tipo de documento: Article País de afiliação: Reino Unido

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Revista: J Am Med Inform Assoc Assunto da revista: INFORMATICA MEDICA Ano de publicação: 2024 Tipo de documento: Article País de afiliação: Reino Unido