Performance of ChatGPT on the Taiwan urology board examination: insights into current strengths and shortcomings.

Tsai, Chung-You; Hsieh, Shang-Ju; Huang, Hung-Hsiang; Deng, Juinn-Horng; Huang, Yi-You; Cheng, Pai-Yu

Tsai, Chung-You; Hsieh, Shang-Ju; Huang, Hung-Hsiang; Deng, Juinn-Horng; Huang, Yi-You; Cheng, Pai-Yu.

Afiliação

Tsai CY; Divisions of Urology, Department of Surgery, Far Eastern Memorial Hospital, No.21, Sec. 2, Nanya S. Rd., Banciao Dist., New Taipei City, 220, Taiwan.
Hsieh SJ; Department of Electrical Engineering, Yuan Ze University, Taoyuan, Taiwan.
Huang HH; Divisions of Urology, Department of Surgery, Far Eastern Memorial Hospital, No.21, Sec. 2, Nanya S. Rd., Banciao Dist., New Taipei City, 220, Taiwan.
Deng JH; Divisions of Urology, Department of Surgery, Far Eastern Memorial Hospital, No.21, Sec. 2, Nanya S. Rd., Banciao Dist., New Taipei City, 220, Taiwan.
Huang YY; Department of Electrical Engineering, Yuan Ze University, Taoyuan, Taiwan.
Cheng PY; Department of Biomedical Engineering, College of Medicine and College of Engineering, National Taiwan University, Taipei, Taiwan.

World J Urol ; 42(1): 250, 2024 Apr 23.

Article em En | MEDLINE | ID: mdl-38652322

ABSTRACT

ABSTRACT

PURPOSE:

To compare ChatGPT-4 and ChatGPT-3.5's performance on Taiwan urology board examination (TUBE), focusing on answer accuracy, explanation consistency, and uncertainty management tactics to minimize score penalties from incorrect responses across 12 urology domains.

METHODS:

450 multiple-choice questions from TUBE(2020-2022) were presented to two models. Three urologists assessed correctness and consistency of each response. Accuracy quantifies correct answers; consistency assesses logic and coherence in explanations out of total responses, alongside a penalty reduction experiment with prompt variations. Univariate logistic regression was applied for subgroup comparison.

RESULTS:

ChatGPT-4 showed strengths in urology, achieved an overall accuracy of 57.8%, with annual accuracies of 64.7% (2020), 58.0% (2021), and 50.7% (2022), significantly surpassing ChatGPT-3.5 (33.8%, OR = 2.68, 95% CI [2.05-3.52]). It could have passed the TUBE written exams if solely based on accuracy but failed in the final score due to penalties. ChatGPT-4 displayed a declining accuracy trend over time. Variability in accuracy across 12 urological domains was noted, with more frequently updated knowledge domains showing lower accuracy (53.2% vs. 62.2%, OR = 0.69, p = 0.05). A high consistency rate of 91.6% in explanations across all domains indicates reliable delivery of coherent and logical information. The simple prompt outperformed strategy-based prompts in accuracy (60% vs. 40%, p = 0.016), highlighting ChatGPT's limitations in its inability to accurately self-assess uncertainty and a tendency towards overconfidence, which may hinder medical decision-making.

CONCLUSIONS:

ChatGPT-4's high accuracy and consistent explanations in urology board examination demonstrate its potential in medical information processing. However, its limitations in self-assessment and overconfidence necessitate caution in its application, especially for inexperienced users. These insights call for ongoing advancements of urology-specific AI tools.

Assuntos

Avaliação Educacional; Urologia; Taiwan; Avaliação Educacional/métodos; Competência Clínica; Humanos; Conselhos de Especialidade Profissional

Palavras-chave

Artificial intelligence; ChatGPT; Internet information; Large language model; Urology

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Urologia / Avaliação Educacional Limite: Humans País/Região como assunto: Asia Idioma: En Revista: World J Urol Ano de publicação: 2024 Tipo de documento: Article País de afiliação: Taiwan

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google