Pesquisa | BVS CLAP/SMR-OPAS/OMS

Performance of large language models on benign prostatic hyperplasia frequently asked questions.

Zhang, YuNing; Dong, Yijie; Mei, Zihan; Hou, Yiqing; Wei, Minyan; Yeung, Yat Hin; Xu, Jiale; Hua, Qing; Lai, LiMei; Li, Ning; Xia, ShuJun; Zhou, Chun; Zhou, JianQiao.

Prostate ; 84(9): 807-813, 2024 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-38558009

RESUMO

BACKGROUND: Benign prostatic hyperplasia (BPH) is a common condition, yet it is challenging for the average BPH patient to find credible and accurate information about BPH. Our goal is to evaluate and compare the accuracy and reproducibility of large language models (LLMs), including ChatGPT-3.5, ChatGPT-4, and the New Bing Chat in responding to a BPH frequently asked questions (FAQs) questionnaire. METHODS: A total of 45 questions related to BPH were categorized into basic and professional knowledge. Three LLM-ChatGPT-3.5, ChatGPT-4, and New Bing Chat-were utilized to generate responses to these questions. Responses were graded as comprehensive, correct but inadequate, mixed with incorrect/outdated data, or completely incorrect. Reproducibility was assessed by generating two responses for each question. All responses were reviewed and judged by experienced urologists. RESULTS: All three LLMs exhibited high accuracy in generating responses to questions, with accuracy rates ranging from 86.7% to 100%. However, there was no statistically significant difference in response accuracy among the three (p > 0.017 for all comparisons). Additionally, the accuracy of the LLMs' responses to the basic knowledge questions was roughly equivalent to that of the specialized knowledge questions, showing a difference of less than 3.5% (GPT-3.5: 90% vs. 86.7%; GPT-4: 96.7% vs. 95.6%; New Bing: 96.7% vs. 93.3%). Furthermore, all three LLMs demonstrated high reproducibility, with rates ranging from 93.3% to 97.8%. CONCLUSIONS: ChatGPT-3.5, ChatGPT-4, and New Bing Chat offer accurate and reproducible responses to BPH-related questions, establishing them as valuable resources for enhancing health literacy and supporting BPH patients in conjunction with healthcare professionals.

Assuntos

Hiperplasia Prostática , Humanos , Hiperplasia Prostática/diagnóstico , Masculino , Reprodutibilidade dos Testes , Inquéritos e Questionários , Idioma , Educação de Pacientes como Assunto/métodos

Transforming free-text radiology reports into structured reports using ChatGPT: A study on thyroid ultrasonography.

Jiang, Huan; Xia, ShuJun; Yang, YiXuan; Xu, JiaLe; Hua, Qing; Mei, ZiHan; Hou, YiQing; Wei, MinYan; Lai, LiMei; Li, Ning; Dong, YiJie; Zhou, JianQiao.

Eur J Radiol ; 175: 111458, 2024 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-38613868

RESUMO

PURPOSE: The importance of structured radiology reports has been fully recognized, as they facilitate efficient data extraction and promote collaboration among healthcare professionals. Our purpose is to assess the accuracy and reproducibility of ChatGPT, a large language model, in generating structured thyroid ultrasound reports. METHODS: This is a retrospective study that includes 184 nodules in 136 thyroid ultrasound reports from 136 patients. ChatGPT-3.5 and ChatGPT-4.0 were used to structure the reports based on ACR-TIRADS guidelines. Two radiologists evaluated the responses for quality, nodule categorization accuracy, and management recommendations. Each text was submitted twice to assess the consistency of the nodule classification and management recommendations. RESULTS: On 136 ultrasound reports from 136 patients (mean age, 52 years ± 12 [SD]; 61 male), ChatGPT-3.5 generated 202 satisfactory structured reports, while ChatGPT-4.0 only produced 69 satisfactory structured reports (74.3 % vs. 25.4 %, odds ratio (OR) = 8.490, 95 %CI: 5.775-12.481, p < 0.001). ChatGPT-4.0 outperformed ChatGPT-3.5 in categorizing thyroid nodules, with an accuracy of 69.3 % compared to 34.5 % (OR = 4.282, 95 %CI: 3.145-5.831, p < 0.001). ChatGPT-4.0 also provided more comprehensive or correct management recommendations than ChatGPT-3.5 (OR = 1.791, 95 %CI: 1.297-2.473, p < 0.001). Finally, ChatGPT-4.0 exhibits higher consistency in categorizing nodules compared to ChatGPT-3.5 (ICC = 0.732 vs. ICC = 0.429), and both exhibited moderate consistency in management recommendations (ICC = 0.549 vs ICC = 0.575). CONCLUSIONS: Our study demonstrates the potential of ChatGPT in transforming free-text thyroid ultrasound reports into structured formats. ChatGPT-3.5 excels in generating structured reports, while ChatGPT-4.0 shows superior accuracy in nodule categorization and management recommendations.

Assuntos

Sistemas de Informação em Radiologia , Nódulo da Glândula Tireoide , Ultrassonografia , Humanos , Pessoa de Meia-Idade , Masculino , Feminino , Ultrassonografia/métodos , Nódulo da Glândula Tireoide/diagnóstico por imagem , Reprodutibilidade dos Testes , Estudos Retrospectivos , Processamento de Linguagem Natural , Glândula Tireoide/diagnóstico por imagem , Adulto

Clinical application potential of large language model: a study based on thyroid nodules.

Xia, Shujun; Hua, Qing; Mei, Zihan; Xu, Wenwen; Lai, Limei; Wei, Minyan; Qin, Yu; Luo, Lin; Wang, Changhua; Huo, ShengNan; Fu, Lijun; Zhou, Feidu; Wu, Jiang; Zhang, Li; Lv, De; Li, Jianxin; Wang, Xin; Li, Ning; Song, Yanyan; Zhou, Jianqiao.

Endocrine ; 2024 Jul 30.

Artigo em Inglês | MEDLINE | ID: mdl-39080210

RESUMO

BACKGROUND: Limited data indicated the performance of large language model (LLM) taking on the role of doctors. We aimed to investigate the potential for ChatGPT-3.5 and New Bing Chat acting as doctors using thyroid nodules as an example. METHODS: A total of 145 patients with thyroid nodules were included for generating questions. Each question was entered into chatbot of ChatGPT-3.5 and New Bing Chat five times and five responses were acquired respectively. These responses were compared with answers given by five junior doctors. Responses from five senior doctors were regarded as gold standard. Accuracy and reproducibility of responses from ChatGPT-3.5 and New Bing Chat were evaluated. RESULTS: The accuracy of ChatGPT-3.5 and New Bing Chat in answering Q2, Q3, Q5 were lower than that of junior doctors (all P < 0.05), while both LLMs were comparable to junior doctors when answering Q4 and Q6. In terms of "high reproducibility and accuracy", ChatGPT-3.5 outperformed New Bing Chat in Q1 and Q5 (P < 0.001 and P = 0.008, respectively), but showed no significant difference in Q2, Q3, Q4, and Q6 (P > 0.05 for all). New Bing Chat generated higher accuracy than ChatGPT-3.5 (72.41% vs 58.62%) (P = 0.003) in decision making of thyroid nodules, and both were less accurate than junior doctors (89.66%, P < 0.001 for both). CONCLUSIONS: The exploration of ChatGPT-3.5 and New Bing Chat in the diagnosis and management of thyroid nodules illustrates that LLMs currently demonstrate the potential for medical applications, but do not yet reach the clinical decision-making capacity of doctors.

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA