Comparison of the problem-solving performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Korean emergency medicine board examination question bank.

Lee, Go Un; Hong, Dae Young; Kim, Sin Young; Kim, Jong Won; Lee, Young Hwan; Park, Sang O; Lee, Kyeong Ryong

Lee, Go Un; Hong, Dae Young; Kim, Sin Young; Kim, Jong Won; Lee, Young Hwan; Park, Sang O; Lee, Kyeong Ryong.

Affiliation

Lee GU; Department of Emergency Medicine, Konkuk University Medical Center, Seoul, Republic of Korea.
Hong DY; Department of Emergency Medicine, Konkuk University School of Medicine, Seoul, Republic of Korea.
Kim SY; Department of Emergency Medicine, Konkuk University Medical Center, Seoul, Republic of Korea.
Kim JW; Department of Emergency Medicine, Konkuk University School of Medicine, Seoul, Republic of Korea.
Lee YH; Department of Emergency Medicine, Konkuk University School of Medicine, Seoul, Republic of Korea.
Park SO; Department of Emergency Medicine, Konkuk University School of Medicine, Seoul, Republic of Korea.
Lee KR; Department of Emergency Medicine, Konkuk University School of Medicine, Seoul, Republic of Korea.

Medicine (Baltimore) ; 103(9): e37325, 2024 Mar 01.

Article in En | MEDLINE | ID: mdl-38428889

ABSTRACT

ABSTRACT

Large language models (LLMs) have been deployed in diverse fields, and the potential for their application in medicine has been explored through numerous studies. This study aimed to evaluate and compare the performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Emergency Medicine Board Examination question bank in the Korean language. Of the 2353 questions in the question bank, 150 questions were randomly selected, and 27 containing figures were excluded. Questions that required abilities such as analysis, creative thinking, evaluation, and synthesis were classified as higher-order questions, and those that required only recall, memory, and factual information in response were classified as lower-order questions. The answers and explanations obtained by inputting the 123 questions into the LLMs were analyzed and compared. ChatGPT-4 (75.6%) and Bing Chat (70.7%) showed higher correct response rates than ChatGPT-3.5 (56.9%) and Bard (51.2%). ChatGPT-4 showed the highest correct response rate for the higher-order questions at 76.5%, and Bard and Bing Chat showed the highest rate for the lower-order questions at 71.4%. The appropriateness of the explanation for the answer was significantly higher for ChatGPT-4 and Bing Chat than for ChatGPT-3.5 and Bard (75.6%, 68.3%, 52.8%, and 50.4%, respectively). ChatGPT-4 and Bing Chat outperformed ChatGPT-3.5 and Bard in answering a random selection of Emergency Medicine Board Examination questions in the Korean language.

Subject(s)

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Artificial Intelligence / Educational Measurement / Emergency Medicine Country/Region as subject: Asia Language: En Journal: Medicine (Baltimore) Year: 2024 Document type: Article Country of publication:

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google