Validity and reliability of artificial intelligence chatbots as public sources of information on endodontics.

Mohammad-Rahimi, Hossein; Ourang, Seyed AmirHossein; Pourhoseingholi, Mohamad Amin; Dianat, Omid; Dummer, Paul Michael Howell; Nosrat, Ali

Mohammad-Rahimi, Hossein; Ourang, Seyed AmirHossein; Pourhoseingholi, Mohamad Amin; Dianat, Omid; Dummer, Paul Michael Howell; Nosrat, Ali.

Afiliação

Mohammad-Rahimi H; Topic Group Dental Diagnostics and Digital Dentistry, ITU/WHO Focus Group AI on Health, Berlin, Germany.
Ourang SA; Dentofacial Deformities Research Center, Research Institute of Dental Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran.
Pourhoseingholi MA; Gastroenterology and Liver Diseases Research Center, Research Institute for Gastroenterology and Liver Diseases, Shahid Beheshti University of Medical Sciences, Tehran, Iran.
Dianat O; Division of Endodontics, Department of Advanced Oral Sciences and Therapeutics, School of Dentistry, University of Maryland, Baltimore, Maryland, USA.
Dummer PMH; Private Practice, Centreville Endodontics, Centreville, Virginia, USA.
Nosrat A; School of Dentistry, College of Biomedical and Life Sciences, Cardiff University, Cardiff, UK.

Int Endod J ; 57(3): 305-314, 2024 Mar.

Article em En | MEDLINE | ID: mdl-38117284

ABSTRACT

ABSTRACT

AIM:

This study aimed to evaluate and compare the validity and reliability of responses provided by GPT-3.5, Google Bard, and Bing to frequently asked questions (FAQs) in the field of endodontics.

METHODOLOGY:

FAQs were formulated by expert endodontists (n = 10) and collected through GPT-3.5 queries (n = 10), with every question posed to each chatbot three times. Responses (N = 180) were independently evaluated by two board-certified endodontists using a modified Global Quality Score (GQS) on a 5-point Likert scale (5 strongly agree; 4 agree; 3 neutral; 2 disagree; 1 strongly disagree). Disagreements on scoring were resolved through evidence-based discussions. The validity of responses was analysed by categorizing scores into valid or invalid at two thresholds The low threshold was set at score ≥4 for all three responses whilst the high threshold was set at score 5 for all three responses. Fisher's exact test was conducted to compare the validity of responses between chatbots. Cronbach's alpha was calculated to assess the reliability by assessing the consistency of repeated responses for each chatbot.

RESULTS:

All three chatbots provided answers to all questions. Using the low-threshold validity test (GPT-3.5 95%; Google Bard 85%; Bing 75%), there was no significant difference between the platforms (p > .05). When using the high-threshold validity test, the chatbot scores were substantially lower (GPT-3.5 60%; Google Bard 15%; Bing 15%). The validity of GPT-3.5 responses was significantly higher than Google Bard and Bing (p = .008). All three chatbots achieved an acceptable level of reliability (Cronbach's alpha >0.7).

CONCLUSIONS:

GPT-3.5 provided more credible information on topics related to endodontics compared to Google Bard and Bing.

Assuntos

Inteligência Artificial; Endodontia; Reprodutibilidade dos Testes; Software; Fonte de Informação

Palavras-chave

Bing; ChatGPT; Google Bard; artificial intelligence; endodontics; large language models

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Inteligência Artificial / Endodontia Idioma: En Revista: Int Endod J Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google