Validation of GPT-4 for clinical event classification: A comparative analysis with ICD codes and human reviewers.

Wang, Yichen; Huang, Yuting; Nimma, Induja R; Pang, Songhan; Pang, Maoyin; Cui, Tao; Kumbhari, Vivek

Wang, Yichen; Huang, Yuting; Nimma, Induja R; Pang, Songhan; Pang, Maoyin; Cui, Tao; Kumbhari, Vivek.

Afiliação

Wang Y; Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.
Huang Y; Division of Gastroenterology and Hepatology, Mayo Clinic, Jacksonville, Florida, USA.
Nimma IR; Department of Medicine, Mayo Clinic, Jacksonville, Florida, USA.
Pang S; College of Arts and Sciences, University of Virginia, Charlottesville, Virginia, USA.
Pang M; Division of Gastroenterology and Hepatology, Mayo Clinic, Jacksonville, Florida, USA.
Cui T; Research Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, Florida, USA.
Kumbhari V; Division of Gastroenterology and Hepatology, Mayo Clinic, Jacksonville, Florida, USA.

J Gastroenterol Hepatol ; 39(8): 1535-1543, 2024 Aug.

Article em En | MEDLINE | ID: mdl-38627920

ABSTRACT

ABSTRACT

BACKGROUND AND

AIM:

Effective clinical event classification is essential for clinical research and quality improvement. The validation of artificial intelligence (AI) models like Generative Pre-trained Transformer 4 (GPT-4) for this task and comparison with conventional methods remains unexplored.

METHODS:

We evaluated the performance of the GPT-4 model for classifying gastrointestinal (GI) bleeding episodes from 200 medical discharge summaries and compared the results with human review and an International Classification of Diseases (ICD) code-based system. The analysis included accuracy, sensitivity, and specificity evaluation, using ground truth determined by physician reviewers.

RESULTS:

GPT-4 exhibited an accuracy of 94.4% in identifying GI bleeding occurrences, outperforming ICD codes (accuracy 63.5%, P < 0.001). GPT-4's accuracy was either slightly lower or statistically similar to individual human reviewers (Reviewer 1 98.5%, P < 0.001; Reviewer 2 90.8%, P = 0.170). For location classification, GPT-4 achieved accuracies of 81.7% and 83.5% for confirmed and probable GI bleeding locations, respectively, with figures that were either slightly lower or comparable with those of human reviewers. GPT-4 was highly efficient, analyzing the dataset in 12.7 min at a cost of 21.2 USD, whereas human reviewers required 8-9 h each.

CONCLUSION:

Our study indicates GPT-4 offers a reliable, cost-efficient, and faster alternative to current clinical event classification methods, outperforming the conventional ICD coding system and performing comparably to individual expert human reviewers. Its implementation could facilitate more accurate and granular clinical research and quality audits. Future research should explore scalability, prompt and model tuning, and ethical implications of high-performance AI models in clinical data processing.

Assuntos

Inteligência Artificial; Hemorragia Gastrointestinal; Classificação Internacional de Doenças; Humanos; Hemorragia Gastrointestinal/classificação; Hemorragia Gastrointestinal/etiologia; Sensibilidade e Especificidade

Palavras-chave

Artificial intelligence; Disease classification; GPT4; Medical records systems, computerized; Natural language processing

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Inteligência Artificial / Classificação Internacional de Doenças / Hemorragia Gastrointestinal Limite: Humans Idioma: En Revista: J Gastroenterol Hepatol Assunto da revista: GASTROENTEROLOGIA Ano de publicação: 2024 Tipo de documento: Article País de afiliação: Estados Unidos

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google