Urology consultants versus large language models: Potentials and hazards for medical advice in urology.

Eckrich, Johanna; Ellinger, Jörg; Cox, Alexander; Stein, Johannes; Ritter, Manuel; Blaikie, Andrew; Kuhn, Sebastian; Buhr, Christoph Raphael

Eckrich, Johanna; Ellinger, Jörg; Cox, Alexander; Stein, Johannes; Ritter, Manuel; Blaikie, Andrew; Kuhn, Sebastian; Buhr, Christoph Raphael.

Afiliación

Eckrich J; Department of Urology University Hospital Bonn Bonn Germany.
Ellinger J; Department of Urology University Hospital Bonn Bonn Germany.
Cox A; Department of Urology University Hospital Bonn Bonn Germany.
Stein J; Department of Urology University Hospital Bonn Bonn Germany.
Ritter M; Department of Urology University Hospital Bonn Bonn Germany.
Blaikie A; School of Medicine University of St Andrews St Andrews UK.
Kuhn S; Institute of Digital Medicine Philipps-University Marburg and University Hospital of Giessen and Marburg Marburg Germany.
Buhr CR; School of Medicine University of St Andrews St Andrews UK.

BJUI Compass ; 5(5): 438-444, 2024 May.

Article en En | MEDLINE | ID: mdl-38751951

ABSTRACT

ABSTRACT

Background:

Current interest surrounding large language models (LLMs) will lead to an increase in their use for medical advice. Although LLMs offer huge potential, they also pose potential misinformation hazards.

Objective:

This study evaluates three LLMs answering urology-themed clinical case-based questions by comparing the quality of answers to those provided by urology consultants.

Methods:

Forty-five case-based questions were answered by consultants and LLMs (ChatGPT 3.5, ChatGPT 4, Bard). Answers were blindly rated using a six-step Likert scale by four consultants in the categories 'medical adequacy', 'conciseness', 'coherence' and 'comprehensibility'. Possible misinformation hazards were identified; a modified Turing test was included, and the character count was matched.

Results:

Higher ratings in every category were recorded for the consultants. LLMs' overall performance in language-focused categories (coherence and comprehensibility) was relatively high. Medical adequacy was significantly poorer compared with the consultants. Possible misinformation hazards were identified in 2.8% to 18.9% of answers generated by LLMs compared with <1% of consultant's answers. Poorer conciseness rates and a higher character count were provided by LLMs. Among individual LLMs, ChatGPT 4 performed best in medical accuracy (p < 0.0001) and coherence (p = 0.001), whereas Bard received the lowest scores. Generated responses were accurately associated with their source with 98% accuracy in LLMs and 99% with consultants.

Conclusions:

The quality of consultant answers was superior to LLMs in all categories. High semantic scores for LLM answers were found; however, the lack of medical accuracy led to potential misinformation hazards from LLM 'consultations'. Further investigations are necessary for new generations.

Palabras clave

Bard; ChatGPT; artificial intelligence (AI); chatbots; digital health; global health; large language models (LLMs); low and middleincome countries (LMICs); telehealth; telemedicine; urology

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Idioma: En Revista: BJUI Compass Año: 2024 Tipo del documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Idioma: En Revista: BJUI Compass Año: 2024 Tipo del documento: Article