Performance of three artificial intelligence (AI)-based large language models in standardized testing; implications for AI-assisted dental education.

Sabri, Hamoun; Saleh, Muhammad H A; Hazrati, Parham; Merchant, Keith; Misch, Jonathan; Kumar, Purnima S; Wang, Hom-Lay; Barootchi, Shayan

Sabri, Hamoun; Saleh, Muhammad H A; Hazrati, Parham; Merchant, Keith; Misch, Jonathan; Kumar, Purnima S; Wang, Hom-Lay; Barootchi, Shayan.

Afiliação

Sabri H; Department of Periodontics and Oral Medicine, School of Dentistry, University of Michigan, Ann Arbor, Michigan, USA.
Saleh MHA; Center for Clinical Research and Evidence Synthesis in Oral Tissue Regeneration (CRITERION), Ann Arbor, Michigan, USA.
Hazrati P; Department of Periodontics and Oral Medicine, School of Dentistry, University of Michigan, Ann Arbor, Michigan, USA.
Merchant K; Department of Periodontics and Oral Medicine, School of Dentistry, University of Michigan, Ann Arbor, Michigan, USA.
Misch J; Naval Post-Graduate Dental School, Bethesda, Maryland, USA.
Kumar PS; Department of Periodontics and Oral Medicine, School of Dentistry, University of Michigan, Ann Arbor, Michigan, USA.
Wang HL; Private Practice, Ann Arbor, Michigan, USA.
Barootchi S; Department of Periodontics and Oral Medicine, School of Dentistry, University of Michigan, Ann Arbor, Michigan, USA.

J Periodontal Res ; 2024 Jul 18.

Article em En | MEDLINE | ID: mdl-39030766

ABSTRACT

ABSTRACT

INTRODUCTION:

The emerging rise in novel computer technologies and automated data analytics has the potential to change the course of dental education. In line with our long-term goal of harnessing the power of AI to augment didactic teaching, the objective of this study was to quantify and compare the accuracy of responses provided by ChatGPT (GPT-4 and GPT-3.5) and Google Gemini, the three primary large language models (LLMs), to human graduate students (control group) to the annual in-service examination questions posed by the American Academy of Periodontology (AAP).

METHODS:

Under a comparative cross-sectional study design, a corpus of 1312 questions from the annual in-service examination of AAP administered between 2020 and 2023 were presented to the LLMs. Their responses were analyzed using chi-square tests, and the performance was juxtaposed to the scores of periodontal residents from corresponding years, as the human control group. Additionally, two sub-analyses were performed one on the performance of the LLMs on each section of the exam; and in answering the most difficult questions.

RESULTS:

ChatGPT-4 (total average 79.57%) outperformed all human control groups as well as GPT-3.5 and Google Gemini in all exam years (p < .001). This chatbot showed an accuracy range between 78.80% and 80.98% across the various exam years. Gemini consistently recorded superior performance with scores of 70.65% (p = .01), 73.29% (p = .02), 75.73% (p < .01), and 72.18% (p = .0008) for the exams from 2020 to 2023 compared to ChatGPT-3.5, which achieved 62.5%, 68.24%, 69.83%, and 59.27% respectively. Google Gemini (72.86%) surpassed the average scores achieved by first- (63.48% ± 31.67) and second-year residents (66.25% ± 31.61) when all exam years combined. However, it could not surpass that of third-year residents (69.06% ± 30.45).

CONCLUSIONS:

Within the confines of this analysis, ChatGPT-4 exhibited a robust capability in answering AAP in-service exam questions in terms of accuracy and reliability while Gemini and ChatGPT-3.5 showed a weaker performance. These findings underscore the potential of deploying LLMs as an educational tool in periodontics and oral implantology domains. However, the current limitations of these models such as inability to effectively process image-based inquiries, the propensity for generating inconsistent responses to the same prompts, and achieving high (80% by GPT-4) but not absolute accuracy rates should be considered. An objective comparison of their capability versus their capacity is required to further develop this field of study.

Palavras-chave

American Academy of Periodontology; ChatGPT; ChatGPT3.5; ChatGPT4; Gemini; Google Bard; Google Gemini; artificial intelligence; dental education; periodontal education

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Revista: J Periodontal Res Ano de publicação: 2024 Tipo de documento: Article País de afiliação: Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google