Your browser doesn't support javascript.
loading
Large language models' responses to liver cancer surveillance, diagnosis, and management questions: accuracy, reliability, readability.
Cao, Jennie J; Kwon, Daniel H; Ghaziani, Tara T; Kwo, Paul; Tse, Gary; Kesselman, Andrew; Kamaya, Aya; Tse, Justin R.
Afiliación
  • Cao JJ; Department of Radiology, Stanford University School of Medicine, 300 Pasteur Drive, Room H-1307, Stanford, CA, 94305, USA.
  • Kwon DH; Department of Medicine, San Francisco School of Medicine, University of California, 505 Parnassus Ave, MC1286C, San Francisco, CA, 94144, USA.
  • Ghaziani TT; Department of Medicine, Stanford University School of Medicine, 430 Broadway St MC 6341, Redwood City, CA, 94063, USA.
  • Kwo P; Department of Medicine, Stanford University School of Medicine, 430 Broadway St MC 6341, Redwood City, CA, 94063, USA.
  • Tse G; Department of Radiological Sciences, Los Angeles David Geffen School of Medicine, University of California, 757 Westwood Plaza Los Angeles, Los Angeles, CA, 90095, USA.
  • Kesselman A; Department of Radiology, Stanford University School of Medicine, 875 Blake Wilbur Drive Palo Alto, Stanford, CA, 94304, USA.
  • Kamaya A; Department of Radiology, Stanford University School of Medicine, 300 Pasteur Drive, Room H-1307, Stanford, CA, 94305, USA.
  • Tse JR; Department of Radiology, Stanford University School of Medicine, 300 Pasteur Drive, Room H-1307, Stanford, CA, 94305, USA. jrtse@stanford.edu.
Abdom Radiol (NY) ; 2024 Aug 01.
Article en En | MEDLINE | ID: mdl-39088019
ABSTRACT

PURPOSE:

To assess the accuracy, reliability, and readability of publicly available large language models in answering fundamental questions on hepatocellular carcinoma diagnosis and management.

METHODS:

Twenty questions on liver cancer diagnosis and management were asked in triplicate to ChatGPT-3.5 (OpenAI), Gemini (Google), and Bing (Microsoft). Responses were assessed by six fellowship-trained physicians from three academic liver transplant centers who actively diagnose and/or treat liver cancer. Responses were categorized as accurate (score 1; all information is true and relevant), inadequate (score 0; all information is true, but does not fully answer the question or provides irrelevant information), or inaccurate (score - 1; any information is false). Means with standard deviations were recorded. Responses were considered as a whole accurate if mean score was > 0 and reliable if mean score was > 0 across all responses for the single question. Responses were also quantified for readability using the Flesch Reading Ease Score and Flesch-Kincaid Grade Level. Readability and accuracy across 60 responses were compared using one-way ANOVAs with Tukey's multiple comparison tests.

RESULTS:

Of the twenty questions, ChatGPT answered nine (45%), Gemini answered 12 (60%), and Bing answered six (30%) questions accurately; however, only six (30%), eight (40%), and three (15%), respectively, were both accurate and reliable. There were no significant differences in accuracy between any chatbot. ChatGPT responses were the least readable (mean Flesch Reading Ease Score 29; college graduate), followed by Gemini (30; college) and Bing (40; college; p < 0.001).

CONCLUSION:

Large language models provide complex responses to basic questions on hepatocellular carcinoma diagnosis and management that are seldomly accurate, reliable, or readable.
Palabras clave

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Idioma: En Revista: Abdom Radiol (NY) Año: 2024 Tipo del documento: Article País de afiliación: Estados Unidos

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Idioma: En Revista: Abdom Radiol (NY) Año: 2024 Tipo del documento: Article País de afiliación: Estados Unidos