Reference Hallucination Score for Medical Artificial Intelligence Chatbots: Development and Usability Study.

Aljamaan, Fadi; Temsah, Mohamad-Hani; Altamimi, Ibraheem; Al-Eyadhy, Ayman; Jamal, Amr; Alhasan, Khalid; Mesallam, Tamer A; Farahat, Mohamed; Malki, Khalid H

Aljamaan, Fadi; Temsah, Mohamad-Hani; Altamimi, Ibraheem; Al-Eyadhy, Ayman; Jamal, Amr; Alhasan, Khalid; Mesallam, Tamer A; Farahat, Mohamed; Malki, Khalid H.

Afiliación

Aljamaan F; College of Medicine, King Saud University, Riyadh, Saudi Arabia.
Temsah MH; College of Medicine, King Saud University, Riyadh, Saudi Arabia.
Altamimi I; College of Medicine, King Saud University, Riyadh, Saudi Arabia.
Al-Eyadhy A; College of Medicine, King Saud University, Riyadh, Saudi Arabia.
Jamal A; College of Medicine, King Saud University, Riyadh, Saudi Arabia.
Alhasan K; College of Medicine, King Saud University, Riyadh, Saudi Arabia.
Mesallam TA; Department of Otolaryngology, College of Medicine, Research Chair of Voice, Swallowing, and Communication Disorders, King Saud University, Riyadh, Saudi Arabia.
Farahat M; Department of Otolaryngology, College of Medicine, Research Chair of Voice, Swallowing, and Communication Disorders, King Saud University, Riyadh, Saudi Arabia.
Malki KH; Department of Otolaryngology, College of Medicine, Research Chair of Voice, Swallowing, and Communication Disorders, King Saud University, Riyadh, Saudi Arabia.

JMIR Med Inform ; 12: e54345, 2024 Jul 31.

Article en En | MEDLINE | ID: mdl-39083799

ABSTRACT

ABSTRACT

BACKGROUND:

Artificial intelligence (AI) chatbots have recently gained use in medical practice by health care practitioners. Interestingly, the output of these AI chatbots was found to have varying degrees of hallucination in content and references. Such hallucinations generate doubts about their output and their implementation.

OBJECTIVE:

The aim of our study was to propose a reference hallucination score (RHS) to evaluate the authenticity of AI chatbots' citations.

METHODS:

Six AI chatbots were challenged with the same 10 medical prompts, requesting 10 references per prompt. The RHS is composed of 6 bibliographic items and the reference's relevance to prompts' keywords. RHS was calculated for each reference, prompt, and type of prompt (basic vs complex). The average RHS was calculated for each AI chatbot and compared across the different types of prompts and AI chatbots.

RESULTS:

Bard failed to generate any references. ChatGPT 3.5 and Bing generated the highest RHS (score=11), while Elicit and SciSpace generated the lowest RHS (score=1), and Perplexity generated a middle RHS (score=7). The highest degree of hallucination was observed for reference relevancy to the prompt keywords (308/500, 61.6%), while the lowest was for reference titles (169/500, 33.8%). ChatGPT and Bing had comparable RHS (ß coefficient=-0.069; P=.32), while Perplexity had significantly lower RHS than ChatGPT (ß coefficient=-0.345; P<.001). AI chatbots generally had significantly higher RHS when prompted with scenarios or complex format prompts (ß coefficient=0.486; P<.001).

CONCLUSIONS:

The variation in RHS underscores the necessity for a robust reference evaluation tool to improve the authenticity of AI chatbots. Further, the variations highlight the importance of verifying their output and citations. Elicit and SciSpace had negligible hallucination, while ChatGPT and Bing had critical hallucination levels. The proposed AI chatbots' RHS could contribute to ongoing efforts to enhance AI's general reliability in medical research.

Palabras clave

Bing; ChatGPT; Elicit; Perplexity; SciSpace; artificial intelligence (AI) chatbots; bibliographic verification; reference hallucination

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Banco de datos: MEDLINE Idioma: En Año: 2024 Tipo del documento: Article

Texto completo

Imprimir

XML

PubMed Links