Influence of Model Evolution and System Roles on ChatGPT's Performance in Chinese Medical Licensing Exams: Comparative Study.

Ming, Shuai; Guo, Qingge; Cheng, Wenjun; Lei, Bo

Ming, Shuai; Guo, Qingge; Cheng, Wenjun; Lei, Bo.

Affiliation

Ming S; Department of Ophthalmology, Henan Eye Hospital, Henan Provincial People's Hospital, Zhengzhou, China.
Guo Q; Eye Institute, Henan Academy of Innovations in Medical Science, Zhengzhou, China.
Cheng W; Henan Clinical Research Center for Ocular Diseases, People's Hospital of Zhengzhou University, Zhengzhou, China.
Lei B; Department of Ophthalmology, Henan Eye Hospital, Henan Provincial People's Hospital, Zhengzhou, China.

JMIR Med Educ ; 10: e52784, 2024 Aug 13.

Article in En | MEDLINE | ID: mdl-39140269

ABSTRACT

ABSTRACT

Background:

With the increasing application of large language models like ChatGPT in various industries, its potential in the medical domain, especially in standardized examinations, has become a focal point of research.

Objective:

The aim of this study is to assess the clinical performance of ChatGPT, focusing on its accuracy and reliability in the Chinese National Medical Licensing Examination (CNMLE).

Methods:

The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into 15 medical subspecialties. Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15, 2023. Three key factors were considered the version of GPT-3.5 and 4.0, the prompt's designation of system roles tailored to medical subspecialties, and repetition for coherence. A passing accuracy threshold was established as 60%. The χ2 tests and κ values were employed to evaluate the model's accuracy and consistency.

Results:

GPT-4.0 achieved a passing accuracy of 72.7%, which was significantly higher than that of GPT-3.5 (54%; P<.001). The variability rate of repeated responses from GPT-4.0 was lower than that of GPT-3.5 (9% vs 19.5%; P<.001). However, both models showed relatively good response coherence, with κ values of 0.778 and 0.610, respectively. System roles numerically increased accuracy for both GPT-4.0 (0.3%-3.7%) and GPT-3.5 (1.3%-4.5%), and reduced variability by 1.7% and 1.8%, respectively (P>.05). In subgroup analysis, ChatGPT achieved comparable accuracy among different question types (P>.05). GPT-4.0 surpassed the accuracy threshold in 14 of 15 subspecialties, while GPT-3.5 did so in 7 of 15 on the first response.

Conclusions:

GPT-4.0 passed the CNMLE and outperformed GPT-3.5 in key areas such as accuracy, consistency, and medical subspecialty expertise. Adding a system role insignificantly enhanced the model's reliability and answer coherence. GPT-4.0 showed promising potential in medical education and clinical practice, meriting further study.

Subject(s)

Educational Measurement; Licensure, Medical; Humans; China; Educational Measurement/methods; Educational Measurement/standards; Reproducibility of Results; Clinical Competence/standards

Key words

ChatGPT; China; Chinese; Chinese National Medical Licensing Examination; LLM; LLMs; OpenAI; accuracy; answer; answers; artificial intelligence; chatbot; chatbots; conversational agent; conversational agents; exam; examination; examinations; exams; language model; language models; large language models; medical education; performance; response; responses; system role

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Educational Measurement / Licensure, Medical Limits: Humans Country/Region as subject: Asia Language: En Journal: JMIR Med Educ Year: 2024 Document type: Article Affiliation country: China Country of publication: Canada

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google