Chatbots vs andrologists: Testing 25 clinical cases.

Perrot, Ophélie; Schirmann, Aurelie; Vidart, Adrien; Guillot-Tantay, Cyrille; Izard, Vincent; Lebret, Thierry; Boillot, Bernard; Mesnard, Benoit; Lebacle, Cedric; Madec, François-Xavier

Perrot, Ophélie; Schirmann, Aurelie; Vidart, Adrien; Guillot-Tantay, Cyrille; Izard, Vincent; Lebret, Thierry; Boillot, Bernard; Mesnard, Benoit; Lebacle, Cedric; Madec, François-Xavier.

Affiliation

Perrot O; Foch Hospital, Urology department, Suresnes, France. Electronic address: operro43@gmail.com.
Schirmann A; Foch Hospital, Urology department, Suresnes, France.
Vidart A; Foch Hospital, Urology department, Suresnes, France.
Guillot-Tantay C; Foch Hospital, Urology department, Suresnes, France.
Izard V; Foch Hospital, Urology department, Suresnes, France.
Lebret T; Foch Hospital, Urology department, Suresnes, France.
Boillot B; Foch Hospital, Urology department, Suresnes, France.
Mesnard B; Foch Hospital, Urology department, Suresnes, France.
Lebacle C; Kremlin-Bicetre Hospital, urology department, Kremlin-Bicetre, France.
Madec FX; Foch Hospital, Urology department, Suresnes, France.

Fr J Urol ; 34(5): 102636, 2024 Jun.

Article in En | MEDLINE | ID: mdl-38599321

ABSTRACT

ABSTRACT

OBJECTIVE:

AI-derived language models are booming, and their place in medicine is undefined. The aim of our study is to compare responses to andrology clinical cases, between chatbots and andrologists, to assess the reliability of these technologies. MATERIAL AND

METHOD:

We analyzed the responses of 32 experts, 18 residents and three chatbots (ChatGPT v3.5, v4 and Bard) to 25 andrology clinical cases. Responses were assessed on a Likert scale ranging from 0 to 2 for each question (0-false response or no response; 1-partially correct response, 2- correct response), on the basis of the latest national or, in the absence of such, international recommendations. We compared the averages obtained for all cases by the different groups.

RESULTS:

Experts obtained a higher mean score (m=11/12.4 σ=1.4) than ChatGPT v4 (m=10.7/12.4 σ=2.2, p=0.6475), ChatGPT v3.5 (m=9.5/12.4 σ=2.1, p=0.0062) and Bard (m=7.2/12.4 σ=3.3, p<0.0001). Residents obtained a mean score (m=9.4/12.4 σ=1.7) higher than Bard (m=7.2/12.4 σ=3.3, p=0.0053) but lower than ChatGPT v3.5 (m=9.5/12.4 σ=2.1, p=0.8393) and v4 (m=10.7/12.4 σ=2.2, p=0.0183) and experts (m=11.0/12.4 σ=1.4,p=0.0009). ChatGPT v4 performance (m=10.7 σ=2.2) was better than ChatGPT v3.5 (m=9.5, σ=2.1, p=0.0476) and Bard performance (m=7.2 σ=3.3, p<0.0001).

CONCLUSION:

The use of chatbots in medicine could be relevant. More studies are needed to integrate them into clinical practice.

Subject(s)
Key words

Andrology; Artificial intelligence; Clinical reasoning; Natural language processing

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Artificial Intelligence / Andrology Limits: Female / Humans / Male Language: En Journal: Fr J Urol Year: 2024 Document type: Article

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google