Search | VHL Regional Portal

Popular large language model chatbots' accuracy, comprehensiveness, and self-awareness in answering ocular symptom queries.

Pushpanathan, Krithi; Lim, Zhi Wei; Er Yew, Samantha Min; Chen, David Ziyou; Hui'En Lin, Hazel Anne; Lin Goh, Jocelyn Hui; Wong, Wendy Meihua; Wang, Xiaofei; Jin Tan, Marcus Chun; Chang Koh, Victor Teck; Tham, Yih-Chung.

iScience ; 26(11): 108163, 2023 Nov 17.

Article in English | MEDLINE | ID: mdl-37915603

ABSTRACT

In light of growing interest in using emerging large language models (LLMs) for self-diagnosis, we systematically assessed the performance of ChatGPT-3.5, ChatGPT-4.0, and Google Bard in delivering proficient responses to 37 common inquiries regarding ocular symptoms. Responses were masked, randomly shuffled, and then graded by three consultant-level ophthalmologists for accuracy (poor, borderline, good) and comprehensiveness. Additionally, we evaluated the self-awareness capabilities (ability to self-check and self-correct) of the LLM-Chatbots. 89.2% of ChatGPT-4.0 responses were 'good'-rated, outperforming ChatGPT-3.5 (59.5%) and Google Bard (40.5%) significantly (all p < 0.001). All three LLM-Chatbots showed optimal mean comprehensiveness scores as well (ranging from 4.6 to 4.7 out of 5). However, they exhibited subpar to moderate self-awareness capabilities. Our study underscores the potential of ChatGPT-4.0 in delivering accurate and comprehensive responses to ocular symptom inquiries. Future rigorous validation of their performance is crucial to ensure their reliability and appropriateness for actual clinical use.

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL