Búsqueda | Portal de Búsqueda de la BVS

Large Language Models for Automated Synoptic Reports and Resectability Categorization in Pancreatic Cancer.

Bhayana, Rajesh; Nanda, Bipin; Dehkharghanian, Taher; Deng, Yangqing; Bhambra, Nishaant; Elias, Gavin; Datta, Daksh; Kambadakone, Avinash; Shwaartz, Chaya G; Moulton, Carol-Anne; Henault, David; Gallinger, Steven; Krishna, Satheesh.

Radiology ; 311(3): e233117, 2024 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-38888478

RESUMEN

Background Structured radiology reports for pancreatic ductal adenocarcinoma (PDAC) improve surgical decision-making over free-text reports, but radiologist adoption is variable. Resectability criteria are applied inconsistently. Purpose To evaluate the performance of large language models (LLMs) in automatically creating PDAC synoptic reports from original reports and to explore performance in categorizing tumor resectability. Materials and Methods In this institutional review board-approved retrospective study, 180 consecutive PDAC staging CT reports on patients referred to the authors' European Society for Medical Oncology-designated cancer center from January to December 2018 were included. Reports were reviewed by two radiologists to establish the reference standard for 14 key findings and National Comprehensive Cancer Network (NCCN) resectability category. GPT-3.5 and GPT-4 (accessed September 18-29, 2023) were prompted to create synoptic reports from original reports with the same 14 features, and their performance was evaluated (recall, precision, F1 score). To categorize resectability, three prompting strategies (default knowledge, in-context knowledge, chain-of-thought) were used for both LLMs. Hepatopancreaticobiliary surgeons reviewed original and artificial intelligence (AI)-generated reports to determine resectability, with accuracy and review time compared. The McNemar test, t test, Wilcoxon signed-rank test, and mixed effects logistic regression models were used where appropriate. Results GPT-4 outperformed GPT-3.5 in the creation of synoptic reports (F1 score: 0.997 vs 0.967, respectively). Compared with GPT-3.5, GPT-4 achieved equal or higher F1 scores for all 14 extracted features. GPT-4 had higher precision than GPT-3.5 for extracting superior mesenteric artery involvement (100% vs 88.8%, respectively). For categorizing resectability, GPT-4 outperformed GPT-3.5 for each prompting strategy. For GPT-4, chain-of-thought prompting was most accurate, outperforming in-context knowledge prompting (92% vs 83%, respectively; P = .002), which outperformed the default knowledge strategy (83% vs 67%, P < .001). Surgeons were more accurate in categorizing resectability using AI-generated reports than original reports (83% vs 76%, respectively; P = .03), while spending less time on each report (58%; 95% CI: 0.53, 0.62). Conclusion GPT-4 created near-perfect PDAC synoptic reports from original reports. GPT-4 with chain-of-thought achieved high accuracy in categorizing resectability. Surgeons were more accurate and efficient using AI-generated reports. © RSNA, 2024 Supplemental material is available for this article. See also the editorial by Chang in this issue.

Asunto(s)

Carcinoma Ductal Pancreático , Neoplasias Pancreáticas , Humanos , Neoplasias Pancreáticas/cirugía , Neoplasias Pancreáticas/diagnóstico por imagen , Neoplasias Pancreáticas/patología , Estudios Retrospectivos , Carcinoma Ductal Pancreático/cirugía , Carcinoma Ductal Pancreático/diagnóstico por imagen , Carcinoma Ductal Pancreático/patología , Femenino , Masculino , Anciano , Persona de Mediana Edad , Tomografía Computarizada por Rayos X/métodos , Procesamiento de Lenguaje Natural , Inteligencia Artificial , Anciano de 80 o más Años

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA