Search | VHL Regional Portal

1.

Diagnosing Glaucoma Based on the Ocular Hypertension Treatment Study Dataset Using Chat Generative Pre-Trained Transformer as a Large Language Model.

Raja, Hina; Huang, Xiaoqin; Delsoz, Mohammad; Madadi, Yeganeh; Poursoroush, Asma; Munawar, Asim; Kahook, Malik Y; Yousefi, Siamak.

Ophthalmol Sci ; 5(1): 100599, 2025.

Article in English | MEDLINE | ID: mdl-39346574

ABSTRACT

Purpose: To evaluate the capabilities of Chat Generative Pre-Trained Transformer (ChatGPT), as a large language model (LLM), for diagnosing glaucoma using the Ocular Hypertension Treatment Study (OHTS) dataset, and comparing the diagnostic capability of ChatGPT 3.5 and ChatGPT 4.0. Design: Prospective data collection study. Participants: A total of 3170 eyes of 1585 subjects from the OHTS were included in this study. Methods: We selected demographic, clinical, ocular, visual field, optic nerve head photo, and history of disease parameters of each participant and developed case reports by converting tabular data into textual format based on information from both eyes of all subjects. We then developed a procedure using the application programming interface of ChatGPT, a LLM-based chatbot, to automatically input prompts into a chat box. This was followed by querying 2 different generations of ChatGPT (versions 3.5 and 4.0) regarding the underlying diagnosis of each subject. We then evaluated the output responses based on several objective metrics. Main Outcome Measures: Area under the receiver operating characteristic curve (AUC), accuracy, specificity, sensitivity, and F1 score. Results: Chat Generative Pre-Trained Transformer 3.5 achieved AUC of 0.74, accuracy of 66%, specificity of 64%, sensitivity of 85%, and F1 score of 0.72. Chat Generative Pre-Trained Transformer 4.0 obtained AUC of 0.76, accuracy of 87%, specificity of 90%, sensitivity of 61%, and F1 score of 0.92. Conclusions: The accuracy of ChatGPT 4.0 in diagnosing glaucoma based on input data from OHTS was promising. The overall accuracy of ChatGPT 4.0 was higher than ChatGPT 3.5. However, ChatGPT 3.5 was found to be more sensitive than ChatGPT 4.0. In its current forms, ChatGPT may serve as a useful tool in exploring disease status of ocular hypertensive eyes when specific data are available for analysis. In the future, leveraging LLMs with multimodal capabilities, allowing for integration of imaging and diagnostic testing as part of the analyses, could further enhance diagnostic capabilities and enhance diagnostic accuracy. Financial Disclosures: Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.

2.

Analysis of ChatGPT Responses to Ophthalmic Cases: Can ChatGPT Think like an Ophthalmologist?

Chen, Jimmy S; Reddy, Akshay J; Al-Sharif, Eman; Shoji, Marissa K; Kalaw, Fritz Gerald P; Eslani, Medi; Lang, Paul Z; Arya, Malvika; Koretz, Zachary A; Bolo, Kyle A; Arnett, Justin J; Roginiel, Aliya C; Do, Jiun L; Robbins, Shira L; Camp, Andrew S; Scott, Nathan L; Rudell, Jolene C; Weinreb, Robert N; Baxter, Sally L; Granet, David B.

Ophthalmol Sci ; 5(1): 100600, 2025.

Article in English | MEDLINE | ID: mdl-39346575

ABSTRACT

Objective: Large language models such as ChatGPT have demonstrated significant potential in question-answering within ophthalmology, but there is a paucity of literature evaluating its ability to generate clinical assessments and discussions. The objectives of this study were to (1) assess the accuracy of assessment and plans generated by ChatGPT and (2) evaluate ophthalmologists' abilities to distinguish between responses generated by clinicians versus ChatGPT. Design: Cross-sectional mixed-methods study. Subjects: Sixteen ophthalmologists from a single academic center, of which 10 were board-eligible and 6 were board-certified, were recruited to participate in this study. Methods: Prompt engineering was used to ensure ChatGPT output discussions in the style of the ophthalmologist author of the Medical College of Wisconsin Ophthalmic Case Studies. Cases where ChatGPT accurately identified the primary diagnoses were included and then paired. Masked human-generated and ChatGPT-generated discussions were sent to participating ophthalmologists to identify the author of the discussions. Response confidence was assessed using a 5-point Likert scale score, and subjective feedback was manually reviewed. Main Outcome Measures: Accuracy of ophthalmologist identification of discussion author, as well as subjective perceptions of human-generated versus ChatGPT-generated discussions. Results: Overall, ChatGPT correctly identified the primary diagnosis in 15 of 17 (88.2%) cases. Two cases were excluded from the paired comparison due to hallucinations or fabrications of nonuser-provided data. Ophthalmologists correctly identified the author in 77.9% ± 26.6% of the 13 included cases, with a mean Likert scale confidence rating of 3.6 ± 1.0. No significant differences in performance or confidence were found between board-certified and board-eligible ophthalmologists. Subjectively, ophthalmologists found that discussions written by ChatGPT tended to have more generic responses, irrelevant information, hallucinated more frequently, and had distinct syntactic patterns (all P < 0.01). Conclusions: Large language models have the potential to synthesize clinical data and generate ophthalmic discussions. While these findings have exciting implications for artificial intelligence-assisted health care delivery, more rigorous real-world evaluation of these models is necessary before clinical deployment. Financial Disclosures: The author(s) have no proprietary or commercial interest in any materials discussed in this article.

3.

ChatGPT-Assisted Classification of Postoperative Bleeding Following Microinvasive Glaucoma Surgery Using Electronic Health Record Data.

Shaheen, Abdulla; Afflitto, Gabriele Gallo; Swaminathan, Swarup S.

Ophthalmol Sci ; 5(1): 100602, 2025.

Article in English | MEDLINE | ID: mdl-39380881

ABSTRACT

Purpose: To evaluate the performance of a large language model (LLM) in classifying electronic health record (EHR) text, and to use this classification to evaluate the type and resolution of hemorrhagic events (HEs) after microinvasive glaucoma surgery (MIGS). Design: Retrospective cohort study. Participants: Eyes from the Bascom Palmer Glaucoma Repository. Methods: Eyes that underwent MIGS between July 1, 2014 and February 1, 2022 were analyzed. Chat Generative Pre-trained Transformer (ChatGPT) was used to classify deidentified EHR anterior chamber examination text into HE categories (no hyphema, microhyphema, clot, and hyphema). Agreement between classifications by ChatGPT and a glaucoma specialist was evaluated using Cohen's Kappa and precision-recall (PR) curve. Time to resolution of HEs was assessed using Cox proportional-hazards models. Goniotomy HE resolution was evaluated by degree of angle treatment (90°-179°, 180°-269°, 270°-360°). Logistic regression was used to identify HE risk factors. Main Outcome Measures: Accuracy of ChatGPT HE classification and incidence and resolution of HEs. Results: The study included 434 goniotomy eyes (368 patients) and 528 Schlemm's canal stent (SCS) eyes (390 patients). Chat Generative Pre-trained Transformer facilitated excellent HE classification (Cohen's kappa 0.93, area under PR curve 0.968). Using ChatGPT classifications, at postoperative day 1, HEs occurred in 67.8% of goniotomy and 25.2% of SCS eyes (P < 0.001). The 270° to 360° goniotomy group had the highest HE rate (84.0%, P < 0.001). At postoperative week 1, HEs were observed in 43.4% and 11.3% of goniotomy and SCS eyes, respectively (P < 0.001). By postoperative month 1, HE rates were 13.3% and 1.3% among goniotomy and SCS eyes, respectively (P < 0.001). Time to HE resolution differed between the goniotomy angle groups (log-rank P = 0.034); median time to resolution was 10, 10, and 15 days for the 90° to 179°, 180° to 269°, and 270° to 360° groups, respectively. Risk factor analysis demonstrated greater goniotomy angle was the only significant predictor of HEs (odds ratio for 270°-360°: 4.08, P < 0.001). Conclusions: Large language models can be effectively used to classify longitudinal EHR free-text examination data with high accuracy, highlighting a promising direction for future LLM-assisted research and clinical decision support. Hemorrhagic events are relatively common self-resolving complications that occur more often in goniotomy cases and with larger goniotomy treatments. Time to HE resolution differs significantly between goniotomy groups. Financial Disclosures: Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.

4.

Editorial: Coping with an AI-saturated world: psychological dynamics and outcomes of AI-mediated communication.

Chen, Anfan; Evans, Richard; Zeng, Runxi.

Front Psychol ; 15: 1479981, 2024.

Article in English | MEDLINE | ID: mdl-39351112

5.

GPT-4 Vision: Multi-Modal Evolution of ChatGPT and Potential Role in Radiology.

Javan, Ramin; Kim, Theodore; Mostaghni, Navid.

Cureus ; 16(8): e68298, 2024 Aug.

Article in English | MEDLINE | ID: mdl-39350878

ABSTRACT

GPT-4 Vision (GPT-4V) represents a significant advancement in multimodal artificial intelligence, enabling text generation from images without specialized training. This marks the transformation of ChatGPT as a large language model (LLM) into GPT-4's promised large multimodal model (LMM). As these AI models continue to advance, they may enhance radiology workflow and aid with decision support. This technical note explores potential GPT-4V applications in radiology and evaluates performance for sample tasks. GPT-4V capabilities were tested using images from the web, personal and institutional teaching files, and hand-drawn sketches. Prompts evaluated scientific figure analysis, radiologic image reporting, image comparison, handwriting interpretation, sketch-to-code, and artistic expression. In this limited demonstration of GPT-4V's capabilities, it showed promise in classifying images, counting entities, comparing images, and deciphering handwriting and sketches. However, it exhibited limitations in detecting some fractures, discerning a change in size of lesions, accurately interpreting complex diagrams, and consistently characterizing radiologic findings. Artistic expression responses were coherent. WhileGPT-4V may eventually assist with tasks related to radiology, current reliability gaps highlight the need for continued training and improvement before consideration for any medical use by the general public and ultimately clinical integration. Future iterations could enable a virtual assistant to discuss findings, improve reports, extract data from images, provide decision support based on guidelines, white papers, and appropriateness criteria. Human expertise remain essential for safe practice and partnerships between physicians, researchers, and technology leaders are necessary to safeguard against risks like bias and privacy concerns.

6.

Comparing emotions in ChatGPT answers and human answers to the coding questions on Stack Overflow.

Fatahi, Somayeh; Vassileva, Julita; Roy, Chanchal K.

Front Artif Intell ; 7: 1393903, 2024.

Article in English | MEDLINE | ID: mdl-39351510

ABSTRACT

Introduction: Recent advances in generative Artificial Intelligence (AI) and Natural Language Processing (NLP) have led to the development of Large Language Models (LLMs) and AI-powered chatbots like ChatGPT, which have numerous practical applications. Notably, these models assist programmers with coding queries, debugging, solution suggestions, and providing guidance on software development tasks. Despite known issues with the accuracy of ChatGPT's responses, its comprehensive and articulate language continues to attract frequent use. This indicates potential for ChatGPT to support educators and serve as a virtual tutor for students. Methods: To explore this potential, we conducted a comprehensive analysis comparing the emotional content in responses from ChatGPT and human answers to 2000 questions sourced from Stack Overflow (SO). The emotional aspects of the answers were examined to understand how the emotional tone of AI responses compares to that of human responses. Results: Our analysis revealed that ChatGPT's answers are generally more positive compared to human responses. In contrast, human answers often exhibit emotions such as anger and disgust. Significant differences were observed in emotional expressions between ChatGPT and human responses, particularly in the emotions of anger, disgust, and joy. Human responses displayed a broader emotional spectrum compared to ChatGPT, suggesting greater emotional variability among humans. Discussion: The findings highlight a distinct emotional divergence between ChatGPT and human responses, with ChatGPT exhibiting a more uniformly positive tone and humans displaying a wider range of emotions. This variance underscores the need for further research into the role of emotional content in AI and human interactions, particularly in educational contexts where emotional nuances can impact learning and communication.

7.

Optimizing ChatGPT's Interpretation and Reporting of Delirium Assessment Outcomes: Exploratory Study.

Choi, Yong K; Lin, Shih-Yin; Fick, Donna Marie; Shulman, Richard W; Lee, Sangil; Shrestha, Priyanka; Santoso, Kate.

JMIR Form Res ; 8: e51383, 2024 Oct 01.

Article in English | MEDLINE | ID: mdl-39353189

ABSTRACT

BACKGROUND: Generative artificial intelligence (AI) and large language models, such as OpenAI's ChatGPT, have shown promising potential in supporting medical education and clinical decision-making, given their vast knowledge base and natural language processing capabilities. As a general purpose AI system, ChatGPT can complete a wide range of tasks, including differential diagnosis without additional training. However, the specific application of ChatGPT in learning and applying a series of specialized, context-specific tasks mimicking the workflow of a human assessor, such as administering a standardized assessment questionnaire, followed by inputting assessment results in a standardized form, and interpretating assessment results strictly following credible, published scoring criteria, have not been thoroughly studied. OBJECTIVE: This exploratory study aims to evaluate and optimize ChatGPT's capabilities in administering and interpreting the Sour Seven Questionnaire, an informant-based delirium assessment tool. Specifically, the objectives were to train ChatGPT-3.5 and ChatGPT-4 to understand and correctly apply the Sour Seven Questionnaire to clinical vignettes using prompt engineering, assess the performance of these AI models in identifying and scoring delirium symptoms against scores from human experts, and refine and enhance the models' interpretation and reporting accuracy through iterative prompt optimization. METHODS: We used prompt engineering to train ChatGPT-3.5 and ChatGPT-4 models on the Sour Seven Questionnaire, a tool for assessing delirium through caregiver input. Prompt engineering is a methodology used to enhance the AI's processing of inputs by meticulously structuring the prompts to improve accuracy and consistency in outputs. In this study, prompt engineering involved creating specific, structured commands that guided the AI models in understanding and applying the assessment tool's criteria accurately to clinical vignettes. This approach also included designing prompts to explicitly instruct the AI on how to format its responses, ensuring they were consistent with clinical documentation standards. RESULTS: Both ChatGPT models demonstrated promising proficiency in applying the Sour Seven Questionnaire to the vignettes, despite initial inconsistencies and errors. Performance notably improved through iterative prompt engineering, enhancing the models' capacity to detect delirium symptoms and assign scores. Prompt optimizations included adjusting the scoring methodology to accept only definitive "Yes" or "No" responses, revising the evaluation prompt to mandate responses in a tabular format, and guiding the models to adhere to the 2 recommended actions specified in the Sour Seven Questionnaire. CONCLUSIONS: Our findings provide preliminary evidence supporting the potential utility of AI models such as ChatGPT in administering standardized clinical assessment tools. The results highlight the significance of context-specific training and prompt engineering in harnessing the full potential of these AI models for health care applications. Despite the encouraging results, broader generalizability and further validation in real-world settings warrant additional research.

Subject(s)

Delirium , Humans , Delirium/diagnosis , Surveys and Questionnaires , Artificial Intelligence

8.

Comparative Study to Evaluate the Accuracy of Differential Diagnosis Lists Generated by Gemini Advanced, Gemini, and Bard for a Case Report Series Analysis: Cross-Sectional Study.

Hirosawa, Takanobu; Harada, Yukinori; Tokumasu, Kazuki; Ito, Takahiro; Suzuki, Tomoharu; Shimizu, Taro.

JMIR Med Inform ; 12: e63010, 2024 Oct 02.

Article in English | MEDLINE | ID: mdl-39357052

ABSTRACT

BACKGROUND: Generative artificial intelligence (GAI) systems by Google have recently been updated from Bard to Gemini and Gemini Advanced as of December 2023. Gemini is a basic, free-to-use model after a user's login, while Gemini Advanced operates on a more advanced model requiring a fee-based subscription. These systems have the potential to enhance medical diagnostics. However, the impact of these updates on comprehensive diagnostic accuracy remains unknown. OBJECTIVE: This study aimed to compare the accuracy of the differential diagnosis lists generated by Gemini Advanced, Gemini, and Bard across comprehensive medical fields using case report series. METHODS: We identified a case report series with relevant final diagnoses published in the American Journal Case Reports from January 2022 to March 2023. After excluding nondiagnostic cases and patients aged 10 years and younger, we included the remaining case reports. After refining the case parts as case descriptions, we input the same case descriptions into Gemini Advanced, Gemini, and Bard to generate the top 10 differential diagnosis lists. In total, 2 expert physicians independently evaluated whether the final diagnosis was included in the lists and its ranking. Any discrepancies were resolved by another expert physician. Bonferroni correction was applied to adjust the P values for the number of comparisons among 3 GAI systems, setting the corrected significance level at P value <.02. RESULTS: In total, 392 case reports were included. The inclusion rates of the final diagnosis within the top 10 differential diagnosis lists were 73% (286/392) for Gemini Advanced, 76.5% (300/392) for Gemini, and 68.6% (269/392) for Bard. The top diagnoses matched the final diagnoses in 31.6% (124/392) for Gemini Advanced, 42.6% (167/392) for Gemini, and 31.4% (123/392) for Bard. Gemini demonstrated higher diagnostic accuracy than Bard both within the top 10 differential diagnosis lists (P=.02) and as the top diagnosis (P=.001). In addition, Gemini Advanced achieved significantly lower accuracy than Gemini in identifying the most probable diagnosis (P=.002). CONCLUSIONS: The results of this study suggest that Gemini outperformed Bard in diagnostic accuracy following the model update. However, Gemini Advanced requires further refinement to optimize its performance for future artificial intelligence-enhanced diagnostics. These findings should be interpreted cautiously and considered primarily for research purposes, as these GAI systems have not been adjusted for medical diagnostics nor approved for clinical use.

Subject(s)

Artificial Intelligence , Humans , Diagnosis, Differential , Cross-Sectional Studies

9.

Large language models and humans converge in judging public figures' personalities.

Cao, Xubo; Kosinski, Michal.

PNAS Nexus ; 3(10): pgae418, 2024 Oct.

Article in English | MEDLINE | ID: mdl-39359393

ABSTRACT

ChatGPT-4 and 600 human raters evaluated 226 public figures' personalities using the Ten-Item Personality Inventory. The correlation between ChatGPT-4 and aggregate human ratings ranged from r = 0.76 to 0.87, outperforming the models specifically trained to make such predictions. Notably, the model was not provided with any training data or feedback on its performance. We discuss the potential explanations and practical implications of ChatGPT-4's ability to mimic human responses accurately.

10.

David vs. Goliath: comparing conventional machine learning and a large language model for assessing students' concept use in a physics problem.

Kieser, Fabian; Tschisgale, Paul; Rauh, Sophia; Bai, Xiaoyu; Maus, Holger; Petersen, Stefan; Stede, Manfred; Neumann, Knut; Wulff, Peter.

Front Artif Intell ; 7: 1408817, 2024.

Article in English | MEDLINE | ID: mdl-39359648

ABSTRACT

Large language models have been shown to excel in many different tasks across disciplines and research sites. They provide novel opportunities to enhance educational research and instruction in different ways such as assessment. However, these methods have also been shown to have fundamental limitations. These relate, among others, to hallucinating knowledge, explainability of model decisions, and resource expenditure. As such, more conventional machine learning algorithms might be more convenient for specific research problems because they allow researchers more control over their research. Yet, the circumstances in which either conventional machine learning or large language models are preferable choices are not well understood. This study seeks to answer the question to what extent either conventional machine learning algorithms or a recently advanced large language model performs better in assessing students' concept use in a physics problem-solving task. We found that conventional machine learning algorithms in combination outperformed the large language model. Model decisions were then analyzed via closer examination of the models' classifications. We conclude that in specific contexts, conventional machine learning can supplement large language models, especially when labeled data is available.

11.

Ascle-A Python Natural Language Processing Toolkit for Medical Text Generation: Development and Evaluation Study.

Yang, Rui; Zeng, Qingcheng; You, Keen; Qiao, Yujie; Huang, Lucas; Hsieh, Chia-Chun; Rosand, Benjamin; Goldwasser, Jeremy; Dave, Amisha; Keenan, Tiarnan; Ke, Yuhe; Hong, Chuan; Liu, Nan; Chew, Emily; Radev, Dragomir; Lu, Zhiyong; Xu, Hua; Chen, Qingyu; Li, Irene.

J Med Internet Res ; 26: e60601, 2024 Oct 03.

Article in English | MEDLINE | ID: mdl-39361955

ABSTRACT

BACKGROUND: Medical texts present significant domain-specific challenges, and manually curating these texts is a time-consuming and labor-intensive process. To address this, natural language processing (NLP) algorithms have been developed to automate text processing. In the biomedical field, various toolkits for text processing exist, which have greatly improved the efficiency of handling unstructured text. However, these existing toolkits tend to emphasize different perspectives, and none of them offer generation capabilities, leaving a significant gap in the current offerings. OBJECTIVE: This study aims to describe the development and preliminary evaluation of Ascle. Ascle is tailored for biomedical researchers and clinical staff with an easy-to-use, all-in-one solution that requires minimal programming expertise. For the first time, Ascle provides 4 advanced and challenging generative functions: question-answering, text summarization, text simplification, and machine translation. In addition, Ascle integrates 12 essential NLP functions, along with query and search capabilities for clinical databases. METHODS: We fine-tuned 32 domain-specific language models and evaluated them thoroughly on 27 established benchmarks. In addition, for the question-answering task, we developed a retrieval-augmented generation (RAG) framework for large language models that incorporated a medical knowledge graph with ranking techniques to enhance the reliability of generated answers. Additionally, we conducted a physician validation to assess the quality of generated content beyond automated metrics. RESULTS: The fine-tuned models and RAG framework consistently enhanced text generation tasks. For example, the fine-tuned models improved the machine translation task by 20.27 in terms of BLEU score. In the question-answering task, the RAG framework raised the ROUGE-L score by 18% over the vanilla models. Physician validation of generated answers showed high scores for readability (4.95/5) and relevancy (4.43/5), with a lower score for accuracy (3.90/5) and completeness (3.31/5). CONCLUSIONS: This study introduces the development and evaluation of Ascle, a user-friendly NLP toolkit designed for medical text generation. All code is publicly available through the Ascle GitHub repository. All fine-tuned language models can be accessed through Hugging Face.

Subject(s)

Natural Language Processing , Humans , Algorithms , Software

12.

"Doctor ChatGPT, Can You Help Me?" The Patient's Perspective: Cross-Sectional Study.

Armbruster, Jonas; Bussmann, Florian; Rothhaas, Catharina; Titze, Nadine; Grützner, Paul Alfred; Freischmidt, Holger.

J Med Internet Res ; 26: e58831, 2024 Oct 01.

Article in English | MEDLINE | ID: mdl-39352738

ABSTRACT

BACKGROUND: Artificial intelligence and the language models derived from it, such as ChatGPT, offer immense possibilities, particularly in the field of medicine. It is already evident that ChatGPT can provide adequate and, in some cases, expert-level responses to health-related queries and advice for patients. However, it is currently unknown how patients perceive these capabilities, whether they can derive benefit from them, and whether potential risks, such as harmful suggestions, are detected by patients. OBJECTIVE: This study aims to clarify whether patients can get useful and safe health care advice from an artificial intelligence chatbot assistant. METHODS: This cross-sectional study was conducted using 100 publicly available health-related questions from 5 medical specialties (trauma, general surgery, otolaryngology, pediatrics, and internal medicine) from a web-based platform for patients. Responses generated by ChatGPT-4.0 and by an expert panel (EP) of experienced physicians from the aforementioned web-based platform were packed into 10 sets consisting of 10 questions each. The blinded evaluation was carried out by patients regarding empathy and usefulness (assessed through the question: "Would this answer have helped you?") on a scale from 1 to 5. As a control, evaluation was also performed by 3 physicians in each respective medical specialty, who were additionally asked about the potential harm of the response and its correctness. RESULTS: In total, 200 sets of questions were submitted by 64 patients (mean 45.7, SD 15.9 years; 29/64, 45.3% male), resulting in 2000 evaluated answers of ChatGPT and the EP each. ChatGPT scored higher in terms of empathy (4.18 vs 2.7; P<.001) and usefulness (4.04 vs 2.98; P<.001). Subanalysis revealed a small bias in terms of levels of empathy given by women in comparison with men (4.46 vs 4.14; P=.049). Ratings of ChatGPT were high regardless of the participant's age. The same highly significant results were observed in the evaluation of the respective specialist physicians. ChatGPT outperformed significantly in correctness (4.51 vs 3.55; P<.001). Specialists rated the usefulness (3.93 vs 4.59) and correctness (4.62 vs 3.84) significantly lower in potentially harmful responses from ChatGPT (P<.001). This was not the case among patients. CONCLUSIONS: The results indicate that ChatGPT is capable of supporting patients in health-related queries better than physicians, at least in terms of written advice through a web-based platform. In this study, ChatGPT's responses had a lower percentage of potentially harmful advice than the web-based EP. However, it is crucial to note that this finding is based on a specific study design and may not generalize to all health care settings. Alarmingly, patients are not able to independently recognize these potential dangers.

Subject(s)

Physician-Patient Relations , Humans , Cross-Sectional Studies , Male , Female , Adult , Middle Aged , Artificial Intelligence , Physicians/psychology , Internet , Empathy , Surveys and Questionnaires

13.

Validation of large language models for detecting pathologic complete response in breast cancer using population-based pathology reports.

Cheligeer, Ken; Wu, Guosong; Laws, Alison; Quan, May Lynn; Li, Andrea; Brisson, Anne-Marie; Xie, Jason; Xu, Yuan.

BMC Med Inform Decis Mak ; 24(1): 283, 2024 Oct 03.

Article in English | MEDLINE | ID: mdl-39363322

ABSTRACT

AIMS: The primary goal of this study is to evaluate the capabilities of Large Language Models (LLMs) in understanding and processing complex medical documentation. We chose to focus on the identification of pathologic complete response (pCR) in narrative pathology reports. This approach aims to contribute to the advancement of comprehensive reporting, health research, and public health surveillance, thereby enhancing patient care and breast cancer management strategies. METHODS: The study utilized two analytical pipelines, developed with open-source LLMs within the healthcare system's computing environment. First, we extracted embeddings from pathology reports using 15 different transformer-based models and then employed logistic regression on these embeddings to classify the presence or absence of pCR. Secondly, we fine-tuned the Generative Pre-trained Transformer-2 (GPT-2) model by attaching a simple feed-forward neural network (FFNN) layer to improve the detection performance of pCR from pathology reports. RESULTS: In a cohort of 351 female breast cancer patients who underwent neoadjuvant chemotherapy (NAC) and subsequent surgery between 2010 and 2017 in Calgary, the optimized method displayed a sensitivity of 95.3% (95%CI: 84.0-100.0%), a positive predictive value of 90.9% (95%CI: 76.5-100.0%), and an F1 score of 93.0% (95%CI: 83.7-100.0%). The results, achieved through diverse LLM integration, surpassed traditional machine learning models, underscoring the potential of LLMs in clinical pathology information extraction. CONCLUSIONS: The study successfully demonstrates the efficacy of LLMs in interpreting and processing digital pathology data, particularly for determining pCR in breast cancer patients post-NAC. The superior performance of LLM-based pipelines over traditional models highlights their significant potential in extracting and analyzing key clinical data from narrative reports. While promising, these findings highlight the need for future external validation to confirm the reliability and broader applicability of these methods.

Subject(s)

Breast Neoplasms , Humans , Breast Neoplasms/pathology , Female , Middle Aged , Neural Networks, Computer , Natural Language Processing , Adult , Aged , Neoadjuvant Therapy , Pathologic Complete Response

14.

Performance of ChatGPT on Nursing Licensure Examinations in the United States and China: Cross-Sectional Study.

Wu, Zelin; Gan, Wenyi; Xue, Zhaowen; Ni, Zhengxin; Zheng, Xiaofei; Zhang, Yiyi.

JMIR Med Educ ; 10: e52746, 2024 Oct 03.

Article in English | MEDLINE | ID: mdl-39363539

ABSTRACT

Background: The creation of large language models (LLMs) such as ChatGPT is an important step in the development of artificial intelligence, which shows great potential in medical education due to its powerful language understanding and generative capabilities. The purpose of this study was to quantitatively evaluate and comprehensively analyze ChatGPT's performance in handling questions for the National Nursing Licensure Examination (NNLE) in China and the United States, including the National Council Licensure Examination for Registered Nurses (NCLEX-RN) and the NNLE. Objective: This study aims to examine how well LLMs respond to the NCLEX-RN and the NNLE multiple-choice questions (MCQs) in various language inputs. To evaluate whether LLMs can be used as multilingual learning assistance for nursing, and to assess whether they possess a repository of professional knowledge applicable to clinical nursing practice. Methods: First, we compiled 150 NCLEX-RN Practical MCQs, 240 NNLE Theoretical MCQs, and 240 NNLE Practical MCQs. Then, the translation function of ChatGPT 3.5 was used to translate NCLEX-RN questions from English to Chinese and NNLE questions from Chinese to English. Finally, the original version and the translated version of the MCQs were inputted into ChatGPT 4.0, ChatGPT 3.5, and Google Bard. Different LLMs were compared according to the accuracy rate, and the differences between different language inputs were compared. Results: The accuracy rates of ChatGPT 4.0 for NCLEX-RN practical questions and Chinese-translated NCLEX-RN practical questions were 88.7% (133/150) and 79.3% (119/150), respectively. Despite the statistical significance of the difference (P=.03), the correct rate was generally satisfactory. Around 71.9% (169/235) of NNLE Theoretical MCQs and 69.1% (161/233) of NNLE Practical MCQs were correctly answered by ChatGPT 4.0. The accuracy of ChatGPT 4.0 in processing NNLE Theoretical MCQs and NNLE Practical MCQs translated into English was 71.5% (168/235; P=.92) and 67.8% (158/233; P=.77), respectively, and there was no statistically significant difference between the results of text input in different languages. ChatGPT 3.5 (NCLEX-RN P=.003, NNLE Theoretical P<.001, NNLE Practical P=.12) and Google Bard (NCLEX-RN P<.001, NNLE Theoretical P<.001, NNLE Practical P<.001) had lower accuracy rates for nursing-related MCQs than ChatGPT 4.0 in English input. English accuracy was higher when compared with ChatGPT 3.5's Chinese input, and the difference was statistically significant (NCLEX-RN P=.02, NNLE Practical P=.02). Whether submitted in Chinese or English, the MCQs from the NCLEX-RN and NNLE demonstrated that ChatGPT 4.0 had the highest number of unique correct responses and the lowest number of unique incorrect responses among the 3 LLMs. Conclusions: This study, focusing on 618 nursing MCQs including NCLEX-RN and NNLE exams, found that ChatGPT 4.0 outperformed ChatGPT 3.5 and Google Bard in accuracy. It excelled in processing English and Chinese inputs, underscoring its potential as a valuable tool in nursing education and clinical decision-making.

Subject(s)

Educational Measurement , Licensure, Nursing , China , Humans , Licensure, Nursing/standards , Cross-Sectional Studies , United States , Educational Measurement/methods , Educational Measurement/standards , Artificial Intelligence

15.

ChatGPT for improving postoperative instructions in multiple fields of plastic surgery.

Zhang, Andi; Li, Cindy Xin Ran; Piper, Merisa; Rose, John; Chen, Kevin; Lin, Alexander Y.

J Plast Reconstr Aesthet Surg ; 99: 201-208, 2024 Aug 24.

Article in English | MEDLINE | ID: mdl-39383672

ABSTRACT

BACKGROUND: Clear discharge instructions are vital for patients and caregivers to manage postoperative care at home. However, they often exceed the sixth-grade reading level recommended by national associations. It was hypothesized that ChatGPT could help rewrite instructions to this level for increased accessibility and comprehension. This study aimed to assess the readability, understandability, actionability, and safety of ChatGPT rewritten postoperative instructions in four plastic surgery subspecialties: breast, craniofacial, hand, and aesthetic surgery. METHODS: Postoperative instructions from four index procedures in plastic surgery were obtained. ChatGPT was used to rewrite at the sixth- and fourth-grade reading levels. Readability was determined by seven readability indexes, understandability and actionability by the Patient Education Materials Assessment Tool for printable materials questionnaire, and safety by the primary surgeons. RESULTS: Overall, the average readability of the original postoperative instructions ranged between the seventh and eighth grade levels. Only one of the sixth-grade ChatGPT instructions was lowered to the sixth-grade level. Of the fourth-grade ChatGPT instructions, all were reduced to the sixth-grade-level or below, but none achieved the fourth-grade level. Understandability scores increased as reading levels decreased, whereas actionability scores decreased for fourth-grade rewrites. Safety was not compromised in all rewrites. CONCLUSIONS: ChatGPT can adapt postoperative instructions to a more readable sixth-grade level without compromising safety. This study suggests prompting ChatGPT to write one to two grade levels lower than the desired reading level. While understandability increased for all ChatGPT rewrites, actionability decreased for fourth-grade-level instructions. Sixth-grade remains the optimal reading level for postoperative instructions. This study demonstrates that ChatGPT can help improve patient care by improving the readability of postoperative instructions.

16.

Enhancing Orthopedic Knowledge Assessments: The Performance of Specialized Generative Language Model Optimization.

Zhou, Hong; Wang, Hong-Lin; Duan, Yu-Yu; Yan, Zi-Neng; Luo, Rui; Lv, Xiang-Xin; Xie, Yi; Zhang, Jia-Yao; Yang, Jia-Ming; Xue, Ming-di; Fang, Ying; Lu, Lin; Liu, Peng-Ran; Ye, Zhe-Wei.

Curr Med Sci ; 2024 Oct 05.

Article in English | MEDLINE | ID: mdl-39368054

ABSTRACT

OBJECTIVE: This study aimed to evaluate and compare the effectiveness of knowledge base-optimized and unoptimized large language models (LLMs) in the field of orthopedics to explore optimization strategies for the application of LLMs in specific fields. METHODS: This research constructed a specialized knowledge base using clinical guidelines from the American Academy of Orthopaedic Surgeons (AAOS) and authoritative orthopedic publications. A total of 30 orthopedic-related questions covering aspects such as anatomical knowledge, disease diagnosis, fracture classification, treatment options, and surgical techniques were input into both the knowledge base-optimized and unoptimized versions of the GPT-4, ChatGLM, and Spark LLM, with their generated responses recorded. The overall quality, accuracy, and comprehensiveness of these responses were evaluated by 3 experienced orthopedic surgeons. RESULTS: Compared with their unoptimized LLMs, the optimized version of GPT-4 showed improvements of 15.3% in overall quality, 12.5% in accuracy, and 12.8% in comprehensiveness; ChatGLM showed improvements of 24.8%, 16.1%, and 19.6%, respectively; and Spark LLM showed improvements of 6.5%, 14.5%, and 24.7%, respectively. CONCLUSION: The optimization of knowledge bases significantly enhances the quality, accuracy, and comprehensiveness of the responses provided by the 3 models in the orthopedic field. Therefore, knowledge base optimization is an effective method for improving the performance of LLMs in specific fields.

17.

Human-augmented large language model-driven selection of glutathione peroxidase 4 as a candidate blood transcriptional biomarker for circulating erythroid cells.

Subba, Bishesh; Toufiq, Mohammed; Omi, Fuadur; Yurieva, Marina; Khan, Taushif; Rinchai, Darawan; Palucka, Karolina; Chaussabel, Damien.

Sci Rep ; 14(1): 23225, 2024 10 05.

Article in English | MEDLINE | ID: mdl-39369090

ABSTRACT

The identification of optimal candidate genes from large-scale blood transcriptomic data is crucial for developing targeted assays to monitor immune responses. Here, we introduce a novel, optimized large language model (LLM)-based approach for prioritizing candidate biomarkers from blood transcriptional modules. Focusing on module M14.51 from the BloodGen3 repertoire, we implemented a multi-step LLM-driven workflow. Initial high-throughput screening used GPT-4, Claude 3, and Claude 3.5 Sonnet to score and rank the module's constituent genes across six criteria. Top candidates then underwent high-resolution scoring using Consensus GPT, with concurrent manual fact-checking and, when needed, iterative refinement of the scores based on user feedback. Qualitative assessment of literature-based narratives and analysis of reference transcriptome data further refined the selection process. This novel multi-tiered approach consistently identified Glutathione Peroxidase 4 (GPX4) as the top candidate gene for module M14.51. GPX4's role in oxidative stress regulation, its potential as a future drug target, and its expression pattern across diverse cell types supported its selection. The incorporation of reference transcriptome data further validated GPX4 as the most suitable candidate for this module. This study presents an advanced LLM-driven workflow with a novel optimized scoring strategy for candidate gene prioritization, incorporating human-in-the-loop augmentation. The approach identified GPX4 as a key gene in the erythroid cell-associated module M14.51, suggesting its potential utility for biomarker discovery and targeted assay development. By combining AI-driven literature analysis with iterative human expert validation, this method leverages the strengths of both artificial and human intelligence, potentially contributing to the development of biologically relevant and clinically informative targeted assays. Further validation studies are needed to confirm the broader applicability of this human-augmented AI approach.

Subject(s)

Biomarkers , Erythroid Cells , Phospholipid Hydroperoxide Glutathione Peroxidase , Humans , Biomarkers/blood , Erythroid Cells/metabolism , Phospholipid Hydroperoxide Glutathione Peroxidase/genetics , Phospholipid Hydroperoxide Glutathione Peroxidase/metabolism , Transcriptome , Gene Expression Profiling/methods , Oxidative Stress/genetics

18.

AI and Heart Failure: Present State and Future With Multimodal Large Language Models.

Cheema, Baljash; Pandit, Jay.

JACC Adv ; 3(9): 101029, 2024 Sep.

Article in English | MEDLINE | ID: mdl-39372464

19.

Based on Medicine, The Now and Future of Large Language Models.

Su, Ziqing; Tang, Guozhang; Huang, Rui; Qiao, Yang; Zhang, Zheng; Dai, Xingliang.

Cell Mol Bioeng ; 17(4): 263-277, 2024 Aug.

Article in English | MEDLINE | ID: mdl-39372551

ABSTRACT

Objectives: This review explores the potential applications of large language models (LLMs) such as ChatGPT, GPT-3.5, and GPT-4 in the medical field, aiming to encourage their prudent use, provide professional support, and develop accessible medical AI tools that adhere to healthcare standards. Methods: This paper examines the impact of technologies such as OpenAI's Generative Pre-trained Transformers (GPT) series, including GPT-3.5 and GPT-4, and other large language models (LLMs) in medical education, scientific research, clinical practice, and nursing. Specifically, it includes supporting curriculum design, acting as personalized learning assistants, creating standardized simulated patient scenarios in education; assisting with writing papers, data analysis, and optimizing experimental designs in scientific research; aiding in medical imaging analysis, decision-making, patient education, and communication in clinical practice; and reducing repetitive tasks, promoting personalized care and self-care, providing psychological support, and enhancing management efficiency in nursing. Results: LLMs, including ChatGPT, have demonstrated significant potential and effectiveness in the aforementioned areas, yet their deployment in healthcare settings is fraught with ethical complexities, potential lack of empathy, and risks of biased responses. Conclusion: Despite these challenges, significant medical advancements can be expected through the proper use of LLMs and appropriate policy guidance. Future research should focus on overcoming these barriers to ensure the effective and ethical application of LLMs in the medical field.

20.

Large language models and their big bullshit potential.

Fisher, Sarah A.

Ethics Inf Technol ; 26(4): 67, 2024.

Article in English | MEDLINE | ID: mdl-39372727

ABSTRACT

Newly powerful large language models have burst onto the scene, with applications across a wide range of functions. We can now expect to encounter their outputs at rapidly increasing volumes and frequencies. Some commentators claim that large language models are bullshitting, generating convincing output without regard for the truth. If correct, that would make large language models distinctively dangerous discourse participants. Bullshitters not only undermine the norm of truthfulness (by saying false things) but the normative status of truth itself (by treating it as entirely irrelevant). So, do large language models really bullshit? I argue that they can, in the sense of issuing propositional content in response to fact-seeking prompts, without having first assessed that content for truth or falsity. However, I further argue that they need not bullshit, given appropriate guardrails. So, just as with human speakers, the propensity for a large language model to bullshit depends on its own particular make-up.

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL