Your browser doesn't support javascript.
loading
Evaluating large language models on medical evidence summarization.
Tang, Liyan; Sun, Zhaoyi; Idnay, Betina; Nestor, Jordan G; Soroush, Ali; Elias, Pierre A; Xu, Ziyang; Ding, Ying; Durrett, Greg; Rousseau, Justin F; Weng, Chunhua; Peng, Yifan.
Afiliação
  • Tang L; School of Information, The University of Texas at Austin, Austin, TX, USA.
  • Sun Z; Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA.
  • Idnay B; Department of Biomedical Informatics, Columbia University, New York, NY, USA.
  • Nestor JG; Department of Medicine, Columbia University, New York, NY, USA.
  • Soroush A; Department of Medicine, Columbia University, New York, NY, USA.
  • Elias PA; Department of Biomedical Informatics, Columbia University, New York, NY, USA.
  • Xu Z; Department of Medicine, Massachusetts General Hospital, Boston, MA, USA.
  • Ding Y; School of Information, The University of Texas at Austin, Austin, TX, USA.
  • Durrett G; Department of Computer Science, The University of Texas at Austin, Austin, TX, USA.
  • Rousseau JF; Departments of Population Health and Neurology, Dell Medical School, The University of Texas at Austin, Austin, TX, USA. justin.rousseau@utsouthwestern.edu.
  • Weng C; Department of Neurology, University of Texas Southwestern Medical Center, Dallas, TX, USA. justin.rousseau@utsouthwestern.edu.
  • Peng Y; Department of Biomedical Informatics, Columbia University, New York, NY, USA. cw2384@cumc.columbia.edu.
NPJ Digit Med ; 6(1): 158, 2023 Aug 24.
Article em En | MEDLINE | ID: mdl-37620423
ABSTRACT
Recent advances in large language models (LLMs) have demonstrated remarkable successes in zero- and few-shot performance on various downstream tasks, paving the way for applications in high-stakes domains. In this study, we systematically examine the capabilities and limitations of LLMs, specifically GPT-3.5 and ChatGPT, in performing zero-shot medical evidence summarization across six clinical domains. We conduct both automatic and human evaluations, covering several dimensions of summary quality. Our study demonstrates that automatic metrics often do not strongly correlate with the quality of summaries. Furthermore, informed by our human evaluations, we define a terminology of error types for medical evidence summarization. Our findings reveal that LLMs could be susceptible to generating factually inconsistent summaries and making overly convincing or uncertain statements, leading to potential harm due to misinformation. Moreover, we find that models struggle to identify the salient information and are more error-prone when summarizing over longer textual contexts.

Texto completo: 1 Base de dados: MEDLINE Tipo de estudo: Prognostic_studies Idioma: En Ano de publicação: 2023 Tipo de documento: Article

Texto completo: 1 Base de dados: MEDLINE Tipo de estudo: Prognostic_studies Idioma: En Ano de publicação: 2023 Tipo de documento: Article