Search | VHL Regional Portal

Optimizing Diagnostic Performance of ChatGPT: The Impact of Prompt Engineering on Thoracic Radiology Cases.

Cesur, Turay; Günes, Yasin Celal.

Cureus ; 16(5): e60009, 2024 May.

Article in English | MEDLINE | ID: mdl-38854352

ABSTRACT

Background Recent studies have highlighted the diagnostic performance of ChatGPT 3.5 and GPT-4 in a text-based format, demonstrating their radiological knowledge across different areas. Our objective is to investigate the impact of prompt engineering on the diagnostic performance of ChatGPT 3.5 and GPT-4 in diagnosing thoracic radiology cases, highlighting how the complexity of prompts influences model performance. Methodology We conducted a retrospective cross-sectional study using 124 publicly available Case of the Month examples from the Thoracic Society of Radiology website. We initially input the cases into the ChatGPT versions without prompting. Then, we employed five different prompts, ranging from basic task-oriented to complex role-specific formulations to measure the diagnostic accuracy of ChatGPT versions. The differential diagnosis lists generated by the models were compared against the radiological diagnoses listed on the Thoracic Society of Radiology website, with a scoring system in place to comprehensively assess the accuracy. Diagnostic accuracy and differential diagnosis scores were analyzed using the McNemar, Chi-square, Kruskal-Wallis, and Mann-Whitney U tests. Results Without any prompts, ChatGPT 3.5's accuracy was 25% (31/124), which increased to 56.5% (70/124) with the most complex prompt (P < 0.001). GPT-4 showed a high baseline accuracy at 53.2% (66/124) without prompting. This accuracy increased to 59.7% (74/124) with complex prompts (P = 0.09). Notably, there was no statistical difference in peak performance between ChatGPT 3.5 (70/124) and GPT-4 (74/124) (P = 0.55). Conclusions This study emphasizes the critical influence of prompt engineering on enhancing the diagnostic performance of ChatGPT versions, especially ChatGPT 3.5.

Comparing the Diagnostic Performance of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and Radiologists in Challenging Neuroradiology Cases.

Horiuchi, Daisuke; Tatekawa, Hiroyuki; Oura, Tatsushi; Oue, Satoshi; Walston, Shannon L; Takita, Hirotaka; Matsushita, Shu; Mitsuyama, Yasuhito; Shimono, Taro; Miki, Yukio; Ueda, Daiju.

Clin Neuroradiol ; 2024 May 28.

Article in English | MEDLINE | ID: mdl-38806794

ABSTRACT

PURPOSE: To compare the diagnostic performance among Generative Pre-trained Transformer (GPT)-4-based ChatGPT, GPT4 with vision (GPT-4V) based ChatGPT, and radiologists in challenging neuroradiology cases. METHODS: We collected 32 consecutive "Freiburg Neuropathology Case Conference" cases from the journal Clinical Neuroradiology between March 2016 and December 2023. We input the medical history and imaging findings into GPT-4-based ChatGPT and the medical history and images into GPT-4V-based ChatGPT, then both generated a diagnosis for each case. Six radiologists (three radiology residents and three board-certified radiologists) independently reviewed all cases and provided diagnoses. ChatGPT and radiologists' diagnostic accuracy rates were evaluated based on the published ground truth. Chi-square tests were performed to compare the diagnostic accuracy of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and radiologists. RESULTS: GPT4 and GPT-4V-based ChatGPTs achieved accuracy rates of 22% (7/32) and 16% (5/32), respectively. Radiologists achieved the following accuracy rates: three radiology residents 28% (9/32), 31% (10/32), and 28% (9/32); and three board-certified radiologists 38% (12/32), 47% (15/32), and 44% (14/32). GPT-4-based ChatGPT's diagnostic accuracy was lower than each radiologist, although not significantly (all pâ¯> 0.07). GPT-4V-based ChatGPT's diagnostic accuracy was also lower than each radiologist and significantly lower than two board-certified radiologists (pâ¯= 0.02 and 0.03) (not significant for radiology residents and one board-certified radiologist [all pâ¯> 0.09]). CONCLUSION: While GPT-4-based ChatGPT demonstrated relatively higher diagnostic performance than GPT-4V-based ChatGPT, the diagnostic performance of GPT4 and GPT-4V-based ChatGPTs did not reach the performance level of either radiology residents or board-certified radiologists in challenging neuroradiology cases.

Accuracy of ChatGPT generated diagnosis from patient's medical history and imaging findings in neuroradiology cases.

Horiuchi, Daisuke; Tatekawa, Hiroyuki; Shimono, Taro; Walston, Shannon L; Takita, Hirotaka; Matsushita, Shu; Oura, Tatsushi; Mitsuyama, Yasuhito; Miki, Yukio; Ueda, Daiju.

Neuroradiology ; 66(1): 73-79, 2024 Jan.

Article in English | MEDLINE | ID: mdl-37994939

ABSTRACT

PURPOSE: The noteworthy performance of Chat Generative Pre-trained Transformer (ChatGPT), an artificial intelligence text generation model based on the GPT-4 architecture, has been demonstrated in various fields; however, its potential applications in neuroradiology remain unexplored. This study aimed to evaluate the diagnostic performance of GPT-4 based ChatGPT in neuroradiology. METHODS: We collected 100 consecutive "Case of the Week" cases from the American Journal of Neuroradiology between October 2021 and September 2023. ChatGPT generated a diagnosis from patient's medical history and imaging findings for each case. Then the diagnostic accuracy rate was determined using the published ground truth. Each case was categorized by anatomical location (brain, spine, and head & neck), and brain cases were further divided into central nervous system (CNS) tumor and non-CNS tumor groups. Fisher's exact test was conducted to compare the accuracy rates among the three anatomical locations, as well as between the CNS tumor and non-CNS tumor groups. RESULTS: ChatGPT achieved a diagnostic accuracy rate of 50% (50/100 cases). There were no significant differences between the accuracy rates of the three anatomical locations (p = 0.89). The accuracy rate was significantly lower for the CNS tumor group compared to the non-CNS tumor group in the brain cases (16% [3/19] vs. 62% [36/58], p < 0.001). CONCLUSION: This study demonstrated the diagnostic performance of ChatGPT in neuroradiology. ChatGPT's diagnostic accuracy varied depending on disease etiologies, and its diagnostic accuracy was significantly lower in CNS tumors compared to non-CNS tumors.

Subject(s)

Artificial Intelligence , Neoplasms , Humans , Head , Brain , Neck

A Conversation With ChatGPT About the Usage of Lithium in Pregnancy for Bipolar Disorder.

Randhawa, Jaismeen; Khan, Aadil.

Cureus ; 15(10): e46548, 2023 Oct.

Article in English | MEDLINE | ID: mdl-37933339

ABSTRACT

This conversation with ChatGPT explores the use of lithium in pregnancy for bipolar disorder, a topic of significant importance in psychiatry. Bipolar disorder is characterized by extreme mood swings, and its prevalence varies globally. ChatGPT provides valuable information on bipolar disorder, its prevalence, age of onset, and gender differences. It also discusses the use of lithium during pregnancy, emphasizing the need for individualized decisions, close monitoring, and potential risks and benefits. However, it is essential to note that ChatGPT's responses lack specific references, raising concerns about the reliability of the information provided. Further research is needed to quantify the correctness and dependability of ChatGPT-generated answers in the healthcare context.

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL