Decoding the NCCN Guidelines With AI: A Comparative Evaluation of ChatGPT-4.0 and Llama 2 in the Management of Thyroid Carcinoma.

Pandya, Shivam; Bresler, Tamir E; Wilson, Tyler; Htway, Zin; Fujita, Manabu

Pandya, Shivam; Bresler, Tamir E; Wilson, Tyler; Htway, Zin; Fujita, Manabu.

Afiliação

Pandya S; Department of Surgery, Los Robles Regional Medical Center, Thousand Oaks, CA, USA.
Bresler TE; Department of Surgery, Los Robles Regional Medical Center, Thousand Oaks, CA, USA.
Wilson T; Department of Surgery, Los Robles Regional Medical Center, Thousand Oaks, CA, USA.
Htway Z; Department of Laboratory, Los Robles Regional Medical Center, Thousand Oaks, CA, USA.
Fujita M; Department of Surgery, Los Robles Regional Medical Center, Thousand Oaks, CA, USA.

Am Surg ; : 31348241269430, 2024 Aug 13.

Article em En | MEDLINE | ID: mdl-39136578

ABSTRACT

ABSTRACT

INTRODUCTION:

Artificial Intelligence (AI) has emerged as a promising tool in the delivery of health care. ChatGPT-4.0 (OpenAI, San Francisco, California) and Llama 2 (Meta, Menlo Park, CA) have each gained attention for their use in various medical applications.

OBJECTIVE:

This study aims to evaluate and compare the effectiveness of ChatGPT-4.0 and Llama 2 in assisting with complex clinical decision making in the diagnosis and treatment of thyroid carcinoma.

PARTICIPANTS:

We reviewed the National Comprehensive Cancer Network® (NCCN) Clinical Practice Guidelines for the management of thyroid carcinoma and formulated up to 3 complex clinical questions for each decision-making page. ChatGPT-4.0 and Llama 2 were queried in a reproducible manner. The answers were scored on a Likert scale 5) Correct; 4) correct, with missing information requiring clarification; 3) correct, but unable to complete answer; 2) partially incorrect; 1) absolutely incorrect. Score frequencies were compared, and subgroup analysis was conducted on Correctness (defined as scores 1-2 vs 3-5) and Accuracy (scores 1-3 vs 4-5).

RESULTS:

In total, 58 pages of the NCCN Guidelines® were analyzed, generating 167 unique questions. There was no statistically significant difference between ChatGPT-4.0 and Llama 2 in terms of overall score (Mann-Whitney U-test; Mean Rank = 160.53 vs 174.47, P = 0.123), Correctness (P = 0.177), or Accuracy (P = 0.891).[Formula see text].

CONCLUSION:

ChatGPT-4.0 and Llama 2 demonstrate a limited but substantial capacity to assist with complex clinical decision making relating to the management of thyroid carcinoma, with no significant difference in their effectiveness.

Palavras-chave

endocrine; resident education; surgical oncology; thyroid

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Revista: Am Surg Ano de publicação: 2024 Tipo de documento: Article País de afiliação: Estados Unidos País de publicação: Estados Unidos

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google