Search | VHL Regional Portal

1.

Responses From ChatGPT-4 Show Limited Correlation With Expert Consensus Statement on Anterior Shoulder Instability.

Artamonov, Alexander; Bachar-Avnieli, Ira; Klang, Eyal; Lubovsky, Omri; Atoun, Ehud; Bermant, Alexander; Rosinsky, Philip J.

Arthrosc Sports Med Rehabil ; 6(3): 100923, 2024 Jun.

Article in English | MEDLINE | ID: mdl-39006799

ABSTRACT

Purpose: To compare the similarity of answers provided by Generative Pretrained Transformer-4 (GPT-4) with those of a consensus statement on diagnosis, nonoperative management, and Bankart repair in anterior shoulder instability (ASI). Methods: An expert consensus statement on ASI published by Hurley et al. in 2022 was reviewed and questions laid out to the expert panel were extracted. GPT-4, the subscription version of ChatGPT, was queried using the same set of questions. Answers provided by GPT-4 were compared with those of the expert panel and subjectively rated for similarity by 2 experienced shoulder surgeons. GPT-4 was then used to rate the similarity of its own responses to the consensus statement, classifying them as low, medium, or high. Rates of similarity as classified by the shoulder surgeons and GPT-4 were then compared and interobserver reliability calculated using weighted κ scores. Results: The degree of similarity between responses of GPT-4 and the ASI consensus statement, as defined by shoulder surgeons, was high in 25.8%, medium in 45.2%, and low 29% of questions. GPT-4 assessed similarity as high in 48.3%, medium in 41.9%, and low 9.7% of questions. Surgeons and GPT-4 reached consensus on the classification of 18 questions (58.1%) and disagreement on 13 questions (41.9%). Conclusions: The responses generated by artificial intelligence exhibit limited correlation with an expert statement on the diagnosis and treatment of ASI. Clinical Relevance: As the use of artificial intelligence becomes more prevalent, it is important to understand how closely information resembles content produced by human authors.

2.

Duration-Dependent Risk of Hypoxemia in Colonoscopy Procedures.

Klang, Eyal; Sharif, Kassem; Ukashi, Offir; Rahman, Nisim; Lahat, Adi.

J Clin Med ; 13(13)2024 Jun 24.

Article in English | MEDLINE | ID: mdl-38999246

ABSTRACT

Background and Aims: Colonoscopy is a critical diagnostic and therapeutic procedure in gastroenterology. However, it carries risks, including hypoxemia, which can impact patient safety. Understanding the factors that contribute to the incidence of severe hypoxemia, specifically the role of procedure duration, is essential for improving patient outcomes. This study aims to elucidate the relationship between the length of colonoscopy procedures and the occurrence of severe hypoxemia. Methods: We conducted a retrospective cohort study at Sheba Medical Center, Israel, including 21,524 adult patients who underwent colonoscopy from January 2020 to January 2024. The study focused on the incidence of severe hypoxemia, defined as a drop in oxygen saturation below 90%. Sedation protocols, involving a combination of Fentanyl, Midazolam, and Propofol were personalized based on the endoscopist's discretion. Data were collected from electronic health records, covering patient demographics, clinical scores, sedation and procedure details, and outcomes. Statistical analyses, including logistic regression, were used to examine the association between procedure duration and hypoxemia, adjusting for various patient and procedural factors. Results: We initially collected records of 26,569 patients who underwent colonoscopy, excluding 5045 due to incomplete data, resulting in a final cohort of 21,524 patients. Procedures under 20 min comprised 48.9% of the total, while those lasting 20-40 min made up 50.7%. Only 8.5% lasted 40-60 min, and 2.9% exceeded 60 min. Longer procedures correlated with higher hypoxemia risk: 17.3% for <20 min, 24.2% for 20-40 min, 32.4% for 40-60 min, and 36.1% for ≥60 min. Patients aged 60-80 and ≥80 had increased hypoxemia odds (aOR 1.1, 95% CI 1.0-1.2 and aOR 1.2, 95% CI 1.0-1.4, respectively). Procedure durations of 20-40 min, 40-60 min, and over 60 min had aORs of 1.5 (95% CI 1.4-1.6), 2.1 (95% CI 1.9-2.4), and 2.4 (95% CI 2.0-3.0), respectively. Conclusions: The duration of colonoscopy procedures significantly impacts the risk of severe hypoxemia, with longer durations associated with higher risks. This study underscores the importance of optimizing procedural efficiency and tailoring sedation protocols to individual patient risk profiles to enhance the safety of colonoscopy. Further research is needed to develop strategies that minimize procedure duration without compromising the quality of care, thereby reducing the risk of hypoxemia and improving patient safety.

3.

Applications of large language models in psychiatry: a systematic review.

Omar, Mahmud; Soffer, Shelly; Charney, Alexander W; Landi, Isotta; Nadkarni, Girish N; Klang, Eyal.

Front Psychiatry ; 15: 1422807, 2024.

Article in English | MEDLINE | ID: mdl-38979501

ABSTRACT

Background: With their unmatched ability to interpret and engage with human language and context, large language models (LLMs) hint at the potential to bridge AI and human cognitive processes. This review explores the current application of LLMs, such as ChatGPT, in the field of psychiatry. Methods: We followed PRISMA guidelines and searched through PubMed, Embase, Web of Science, and Scopus, up until March 2024. Results: From 771 retrieved articles, we included 16 that directly examine LLMs' use in psychiatry. LLMs, particularly ChatGPT and GPT-4, showed diverse applications in clinical reasoning, social media, and education within psychiatry. They can assist in diagnosing mental health issues, managing depression, evaluating suicide risk, and supporting education in the field. However, our review also points out their limitations, such as difficulties with complex cases and potential underestimation of suicide risks. Conclusion: Early research in psychiatry reveals LLMs' versatile applications, from diagnostic support to educational roles. Given the rapid pace of advancement, future investigations are poised to explore the extent to which these models might redefine traditional roles in mental health care.

4.

Utilizing ChatGPT to Facilitate Referrals for Fetal Echocardiography.

Gordin Kopylov, Lital; Goldrat, Itai; Maymon, Ron; Svirsky, Ran; Wiener, Yifat; Klang, Eyal.

Fetal Diagn Ther ; : 1-4, 2024 Jun 04.

Article in English | MEDLINE | ID: mdl-38834046

ABSTRACT

INTRODUCTION: OpenAI's GPT-4 (artificial intelligence [AI]) is being studied for its use as a medical decision support tool. This research examines its accuracy in refining referrals for fetal echocardiography (FE) to improve early detection and outcomes related to congenital heart defects (CHDs). METHODS: Past FE data referred to our institution were evaluated separately by pediatric cardiologist, gynecologist (human experts [experts]), and AI, according to established guidelines. We compared experts and AI's agreement on referral necessity, with experts addressing discrepancies. RESULTS: Total of 59 FE cases were addressed retrospectively. Cardiologist, gynecologist, and AI recommended performing FE in 47.5%, 49.2%, and 59.0% of cases, respectively. Comparing AI recommendations to experts indicated agreement of around 80.0% with both experts (p < 0.001). Notably, AI suggested more echocardiographies for minor CHD (64.7%) compared to experts (47.1%), and for major CHD, experts recommended performing FE in all cases (100%) while AI recommended in majority of cases (90.9%). Discrepancies between AI and experts are detailed and reviewed. CONCLUSIONS: The evaluation found moderate agreement between AI and experts. Contextual misunderstandings and lack of specialized medical knowledge limit AI, necessitating clinical guideline guidance. Despite shortcomings, AI's referrals comprised 65% of minor CHD cases versus experts 47%, suggesting its potential as a cautious decision aid for clinicians.

5.

Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4.

Lahat, Adi; Sharif, Kassem; Zoabi, Narmin; Shneor Patt, Yonatan; Sharif, Yousra; Fisher, Lior; Shani, Uria; Arow, Mohamad; Levin, Roni; Klang, Eyal.

J Med Internet Res ; 26: e54571, 2024 Jun 27.

Article in English | MEDLINE | ID: mdl-38935937

ABSTRACT

BACKGROUND: Artificial intelligence, particularly chatbot systems, is becoming an instrumental tool in health care, aiding clinical decision-making and patient engagement. OBJECTIVE: This study aims to analyze the performance of ChatGPT-3.5 and ChatGPT-4 in addressing complex clinical and ethical dilemmas, and to illustrate their potential role in health care decision-making while comparing seniors' and residents' ratings, and specific question types. METHODS: A total of 4 specialized physicians formulated 176 real-world clinical questions. A total of 8 senior physicians and residents assessed responses from GPT-3.5 and GPT-4 on a 1-5 scale across 5 categories: accuracy, relevance, clarity, utility, and comprehensiveness. Evaluations were conducted within internal medicine, emergency medicine, and ethics. Comparisons were made globally, between seniors and residents, and across classifications. RESULTS: Both GPT models received high mean scores (4.4, SD 0.8 for GPT-4 and 4.1, SD 1.0 for GPT-3.5). GPT-4 outperformed GPT-3.5 across all rating dimensions, with seniors consistently rating responses higher than residents for both models. Specifically, seniors rated GPT-4 as more beneficial and complete (mean 4.6 vs 4.0 and 4.6 vs 4.1, respectively; P<.001), and GPT-3.5 similarly (mean 4.1 vs 3.7 and 3.9 vs 3.5, respectively; P<.001). Ethical queries received the highest ratings for both models, with mean scores reflecting consistency across accuracy and completeness criteria. Distinctions among question types were significant, particularly for the GPT-4 mean scores in completeness across emergency, internal, and ethical questions (4.2, SD 1.0; 4.3, SD 0.8; and 4.5, SD 0.7, respectively; P<.001), and for GPT-3.5's accuracy, beneficial, and completeness dimensions. CONCLUSIONS: ChatGPT's potential to assist physicians with medical issues is promising, with prospects to enhance diagnostics, treatments, and ethics. While integration into clinical workflows may be valuable, it must complement, not replace, human expertise. Continued research is essential to ensure safe and effective implementation in clinical environments.

Subject(s)

Clinical Decision-Making , Humans , Artificial Intelligence

6.

Deep learning in magnetic resonance enterography for Crohn's disease assessment: a systematic review.

Brem, Ofir; Elisha, David; Konen, Eli; Amitai, Michal; Klang, Eyal.

Abdom Radiol (NY) ; 2024 May 01.

Article in English | MEDLINE | ID: mdl-38693270

ABSTRACT

Crohn's disease (CD) poses significant morbidity, underscoring the need for effective, non-invasive inflammatory assessment using magnetic resonance enterography (MRE). This literature review evaluates recent publications on the role of deep learning in improving MRE for CD assessment. We searched MEDLINE/PUBMED for studies that reported the use of deep learning algorithms for assessment of CD activity. The study was conducted according to the PRISMA guidelines. The risk of bias was evaluated using the QUADAS-2 tool. Five eligible studies, encompassing 468 subjects, were identified. Our study suggests that diverse deep learning applications, including image quality enhancement, bowel segmentation for disease burden quantification, and 3D reconstruction for surgical planning are useful and promising for CD assessment. However, most of the studies are preliminary, retrospective studies, and have a high risk of bias in at least one category. Future research is needed to assess how deep learning can impact CD patient diagnostics, particularly when considering the increasing integration of such models into hospital systems.

7.

Obesity Is Associated with Fatty Liver and Fat Changes in the Kidneys in Humans as Assessed by MRI.

Raphael, Hadar; Klang, Eyal; Konen, Eli; Inbar, Yael; Leibowitz, Avshalom; Frenkel-Nir, Yael; Apter, Sara; Grossman, Ehud.

Nutrients ; 16(9)2024 May 03.

Article in English | MEDLINE | ID: mdl-38732633

ABSTRACT

BACKGROUND: Obesity is associated with metabolic syndrome and fat accumulation in various organs such as the liver and the kidneys. Our goal was to assess, using magnetic resonance imaging (MRI) Dual-Echo phase sequencing, the association between liver and kidney fat deposition and their relation to obesity. METHODS: We analyzed MRI scans of individuals who were referred to the Chaim Sheba Medical Center between December 2017 and May 2020 to perform a study for any indication. For each individual, we retrieved from the computerized charts data on sex, and age, weight, height, body mass index (BMI), systolic and diastolic blood pressure (BP), and comorbidities (diabetes mellitus, hypertension, dyslipidemia). RESULTS: We screened MRI studies of 399 subjects with a median age of 51 years, 52.4% of whom were women, and a median BMI 24.6 kg/m2. We diagnosed 18% of the participants with fatty liver and 18.6% with fat accumulation in the kidneys (fatty kidneys). Out of the 67 patients with fatty livers, 23 (34.3%) also had fatty kidneys, whereas among the 315 patients without fatty livers, only 48 patients (15.2%) had fatty kidneys (p < 0.01). In comparison to the patients who did not have a fatty liver or fatty kidneys (n = 267), those who had both (n = 23) were more obese, had higher systolic BP, and were more likely to have diabetes mellitus. In comparison to the patients without a fatty liver, those with fatty livers had an adjusted odds ratio of 2.91 (97.5% CI; 1.61-5.25) to have fatty kidneys. In total, 19.6% of the individuals were obese (BMI ≥ 30), and 26.1% had overweight (25 < BMI < 30). The obese and overweight individuals were older and more likely to have diabetes mellitus and hypertension and had higher rates of fatty livers and fatty kidneys. Fat deposition in both the liver and the kidneys was observed in 15.9% of the obese patients, in 8.3% of the overweight patients, and in none of those with normal weight. Obesity was the only risk factor for fatty kidneys and fatty livers, with an adjusted OR of 6.3 (97.5% CI 2.1-18.6). CONCLUSIONS: Obesity is a major risk factor for developing a fatty liver and fatty kidneys. Individuals with a fatty liver are more likely to have fatty kidneys. MRI is an accurate modality for diagnosing fatty kidneys. Reviewing MRI scans of any indication should include assessment of fat fractions in the kidneys in addition to that of the liver.

Subject(s)

Fatty Liver , Kidney , Magnetic Resonance Imaging , Obesity , Humans , Female , Male , Middle Aged , Obesity/complications , Kidney/diagnostic imaging , Kidney/physiopathology , Adult , Fatty Liver/diagnostic imaging , Fatty Liver/epidemiology , Body Mass Index , Liver/diagnostic imaging , Liver/pathology , Kidney Diseases/diagnostic imaging , Kidney Diseases/epidemiology , Aged , Risk Factors

8.

Evaluating the accuracy of a state-of-the-art large language model for prediction of admissions from the emergency room.

Glicksberg, Benjamin S; Timsina, Prem; Patel, Dhaval; Sawant, Ashwin; Vaid, Akhil; Raut, Ganesh; Charney, Alexander W; Apakama, Donald; Carr, Brendan G; Freeman, Robert; Nadkarni, Girish N; Klang, Eyal.

J Am Med Inform Assoc ; 2024 May 21.

Article in English | MEDLINE | ID: mdl-38771093

ABSTRACT

BACKGROUND: Artificial intelligence (AI) and large language models (LLMs) can play a critical role in emergency room operations by augmenting decision-making about patient admission. However, there are no studies for LLMs using real-world data and scenarios, in comparison to and being informed by traditional supervised machine learning (ML) models. We evaluated the performance of GPT-4 for predicting patient admissions from emergency department (ED) visits. We compared performance to traditional ML models both naively and when informed by few-shot examples and/or numerical probabilities. METHODS: We conducted a retrospective study using electronic health records across 7 NYC hospitals. We trained Bio-Clinical-BERT and XGBoost (XGB) models on unstructured and structured data, respectively, and created an ensemble model reflecting ML performance. We then assessed GPT-4 capabilities in many scenarios: through Zero-shot, Few-shot with and without retrieval-augmented generation (RAG), and with and without ML numerical probabilities. RESULTS: The Ensemble ML model achieved an area under the receiver operating characteristic curve (AUC) of 0.88, an area under the precision-recall curve (AUPRC) of 0.72 and an accuracy of 82.9%. The naïve GPT-4's performance (0.79 AUC, 0.48 AUPRC, and 77.5% accuracy) showed substantial improvement when given limited, relevant data to learn from (ie, RAG) and underlying ML probabilities (0.87 AUC, 0.71 AUPRC, and 83.1% accuracy). Interestingly, RAG alone boosted performance to near peak levels (0.82 AUC, 0.56 AUPRC, and 81.3% accuracy). CONCLUSIONS: The naïve LLM had limited performance but showed significant improvement in predicting ED admissions when supplemented with real-world examples to learn from, particularly through RAG, and/or numerical probabilities from traditional ML models. Its peak performance, although slightly lower than the pure ML model, is noteworthy given its potential for providing reasoning behind predictions. Further refinement of LLMs with real-world data is necessary for successful integration as decision-support tools in care settings.

9.

Artificial intelligence for detection of effusion and lipo-hemarthrosis in X-rays and CT of the knee.

Cohen, Israel; Sorin, Vera; Lekach, Ruth; Raskin, Daniel; Segev, Maria; Klang, Eyal; Eshed, Iris; Barash, Yiftach.

Eur J Radiol ; 175: 111460, 2024 Jun.

Article in English | MEDLINE | ID: mdl-38608501

ABSTRACT

BACKGROUND: Traumatic knee injuries are challenging to diagnose accurately through radiography and to a lesser extent, through CT, with fractures sometimes overlooked. Ancillary signs like joint effusion or lipo-hemarthrosis are indicative of fractures, suggesting the need for further imaging. Artificial Intelligence (AI) can automate image analysis, improving diagnostic accuracy and help prioritizing clinically important X-ray or CT studies. OBJECTIVE: To develop and evaluate an AI algorithm for detecting effusion of any kind in knee X-rays and selected CT images and distinguishing between simple effusion and lipo-hemarthrosis indicative of intra-articular fractures. METHODS: This retrospective study analyzed post traumatic knee imaging from January 2016 to February 2023, categorizing images into lipo-hemarthrosis, simple effusion, or normal. It utilized the FishNet-150 algorithm for image classification, with class activation maps highlighting decision-influential regions. The AI's diagnostic accuracy was validated against a gold standard, based on the evaluations made by a radiologist with at least four years of experience. RESULTS: Analysis included CT images from 515 patients and X-rays from 637 post traumatic patients, identifying lipo-hemarthrosis, simple effusion, and normal findings. The AI showed an AUC of 0.81 for detecting any effusion, 0.78 for simple effusion, and 0.83 for lipo-hemarthrosis in X-rays; and 0.89, 0.89, and 0.91, respectively, in CTs. CONCLUSION: The AI algorithm effectively detects knee effusion and differentiates between simple effusion and lipo-hemarthrosis in post-traumatic patients for both X-rays and selected CT images further studies are needed to validate these results.

Subject(s)

Artificial Intelligence , Hemarthrosis , Knee Injuries , Tomography, X-Ray Computed , Humans , Knee Injuries/diagnostic imaging , Knee Injuries/complications , Tomography, X-Ray Computed/methods , Female , Male , Retrospective Studies , Hemarthrosis/diagnostic imaging , Hemarthrosis/etiology , Middle Aged , Adult , Algorithms , Aged , Exudates and Transudates/diagnostic imaging , Aged, 80 and over , Young Adult , Adolescent , Radiographic Image Interpretation, Computer-Assisted/methods , Knee Joint/diagnostic imaging , Sensitivity and Specificity

10.

Utilizing natural language processing and large language models in the diagnosis and prediction of infectious diseases: A systematic review.

Omar, Mahmud; Brin, Dana; Glicksberg, Benjamin; Klang, Eyal.

Am J Infect Control ; 2024 Apr 06.

Article in English | MEDLINE | ID: mdl-38588980

ABSTRACT

BACKGROUND: Natural Language Processing (NLP) and Large Language Models (LLMs) hold largely untapped potential in infectious disease management. This review explores their current use and uncovers areas needing more attention. METHODS: This analysis followed systematic review procedures, registered with the Prospective Register of Systematic Reviews. We conducted a search across major databases including PubMed, Embase, Web of Science, and Scopus, up to December 2023, using keywords related to NLP, LLM, and infectious diseases. We also employed the Quality Assessment of Diagnostic Accuracy Studies-2 tool for evaluating the quality and robustness of the included studies. RESULTS: Our review identified 15 studies with diverse applications of NLP in infectious disease management. Notable examples include GPT-4's application in detecting urinary tract infections and BERTweet's use in Lyme Disease surveillance through social media analysis. These models demonstrated effective disease monitoring and public health tracking capabilities. However, the effectiveness varied across studies. For instance, while some NLP tools showed high accuracy in pneumonia detection and high sensitivity in identifying invasive mold diseases from medical reports, others fell short in areas like bloodstream infection management. CONCLUSIONS: This review highlights the yet-to-be-fully-realized promise of NLP and LLMs in infectious disease management. It calls for more exploration to fully harness AI's capabilities, particularly in the areas of diagnosis, surveillance, predicting disease courses, and tracking epidemiological trends.

11.

Navigating the vestibular maze: text-mining analysis of publication trends over five decades.

Wolfovitz, Amit; Gecel, Nir A; Gimmon, Yoav; Shivatzki, Shaked; Sorin, Vera; Barash, Yiftach; Klang, Eyal; Tessler, Idit.

Front Neurol ; 15: 1292640, 2024.

Article in English | MEDLINE | ID: mdl-38560730

ABSTRACT

Introduction: The field of vestibular science, encompassing the study of the vestibular system and associated disorders, has experienced notable growth and evolving trends over the past five decades. Here, we explore the changing landscape in vestibular science, focusing on epidemiology, peripheral pathologies, diagnosis methods, treatment, and technological advancements. Methods: Publication data was obtained from the US National Center for Biotechnology Information (NCBI) PubMed database. The analysis included epidemiological, etiological, diagnostic, and treatment-focused studies on peripheral vestibular disorders, with a particular emphasis on changes in topics and trends of publications over time. Results: Our dataset of 39,238 publications revealed a rising trend in research across all age groups. Etiologically, benign paroxysmal positional vertigo (BPPV) and Meniere's disease were the most researched conditions, but the prevalence of studies on vestibular migraine showed a marked increase in recent years. Electronystagmography (ENG)/ Videonystagmography (VNG) and Vestibular Evoked Myogenic Potential (VEMP) were the most commonly discussed diagnostic tools, while physiotherapy stood out as the primary treatment modality. Conclusion: Our study presents a unique opportunity and point of view, exploring the evolving landscape of vestibular science publications over the past five decades. The analysis underscored the dynamic nature of the field, highlighting shifts in focus and emerging publication trends in diagnosis and treatment over time.

12.

ChatGPT's adherence to otolaryngology clinical practice guidelines.

Tessler, Idit; Wolfovitz, Amit; Alon, Eran E; Gecel, Nir A; Livneh, Nir; Zimlichman, Eyal; Klang, Eyal.

Eur Arch Otorhinolaryngol ; 281(7): 3829-3834, 2024 Jul.

Article in English | MEDLINE | ID: mdl-38647684

ABSTRACT

OBJECTIVES: Large language models, including ChatGPT, has the potential to transform the way we approach medical knowledge, yet accuracy in clinical topics is critical. Here we assessed ChatGPT's performance in adhering to the American Academy of Otolaryngology-Head and Neck Surgery guidelines. METHODS: We presented ChatGPT with 24 clinical otolaryngology questions based on the guidelines of the American Academy of Otolaryngology. This was done three times (N = 72) to test the model's consistency. Two otolaryngologists evaluated the responses for accuracy and relevance to the guidelines. Cohen's Kappa was used to measure evaluator agreement, and Cronbach's alpha assessed the consistency of ChatGPT's responses. RESULTS: The study revealed mixed results; 59.7% (43/72) of ChatGPT's responses were highly accurate, while only 2.8% (2/72) directly contradicted the guidelines. The model showed 100% accuracy in Head and Neck, but lower accuracy in Rhinology and Otology/Neurotology (66%), Laryngology (50%), and Pediatrics (8%). The model's responses were consistent in 17/24 (70.8%), with a Cronbach's alpha value of 0.87, indicating a reasonable consistency across tests. CONCLUSIONS: Using a guideline-based set of structured questions, ChatGPT demonstrates consistency but variable accuracy in otolaryngology. Its lower performance in some areas, especially Pediatrics, suggests that further rigorous evaluation is needed before considering real-world clinical use.

Subject(s)

Guideline Adherence , Otolaryngology , Practice Guidelines as Topic , Otolaryngology/standards , Humans , United States

13.

AI in the ED: Assessing the efficacy of GPT models vs. physicians in medical score calculation.

Haim, Gal Ben; Braun, Adi; Eden, Haggai; Burshtein, Livnat; Barash, Yiftach; Irony, Avinoah; Klang, Eyal.

Am J Emerg Med ; 79: 161-166, 2024 May.

Article in English | MEDLINE | ID: mdl-38447503

ABSTRACT

BACKGROUND AND AIMS: Artificial Intelligence (AI) models like GPT-3.5 and GPT-4 have shown promise across various domains but remain underexplored in healthcare. Emergency Departments (ED) rely on established scoring systems, such as NIHSS and HEART score, to guide clinical decision-making. This study aims to evaluate the proficiency of GPT-3.5 and GPT-4 against experienced ED physicians in calculating five commonly used medical scores. METHODS: This retrospective study analyzed data from 150 patients who visited the ED over one week. Both AI models and two human physicians were tasked with calculating scores for NIH Stroke Scale, Canadian Syncope Risk Score, Alvarado Score for Acute Appendicitis, Canadian CT Head Rule, and HEART Score. Cohen's Kappa statistic and AUC values were used to assess inter-rater agreement and predictive performance, respectively. RESULTS: The highest level of agreement was observed between the human physicians (Kappa = 0.681), while GPT-4 also showed moderate to substantial agreement with them (Kappa values of 0.473 and 0.576). GPT-3.5 had the lowest agreement with human scorers. These results highlight the superior predictive performance of human expertise over the currently available automated systems for this specific medical outcome. Human physicians achieved a higher ROC-AUC on 3 of the 5 scores, but none of the differences were statistically significant. CONCLUSIONS: While AI models demonstrated some level of concordance with human expertise, they fell short in emulating the complex clinical judgments that physicians make. The study suggests that current AI models may serve as supplementary tools but are not ready to replace human expertise in high-stakes settings like the ED. Further research is needed to explore the capabilities and limitations of AI in emergency medicine.

Subject(s)

Artificial Intelligence , Physicians , Humans , Canada , Retrospective Studies , Emergency Service, Hospital

14.

Utilizing large language models in breast cancer management: systematic review.

Sorin, Vera; Glicksberg, Benjamin S; Artsi, Yaara; Barash, Yiftach; Konen, Eli; Nadkarni, Girish N; Klang, Eyal.

J Cancer Res Clin Oncol ; 150(3): 140, 2024 Mar 19.

Article in English | MEDLINE | ID: mdl-38504034

ABSTRACT

PURPOSE: Despite advanced technologies in breast cancer management, challenges remain in efficiently interpreting vast clinical data for patient-specific insights. We reviewed the literature on how large language models (LLMs) such as ChatGPT might offer solutions in this field. METHODS: We searched MEDLINE for relevant studies published before December 22, 2023. Keywords included: "large language models", "LLM", "GPT", "ChatGPT", "OpenAI", and "breast". The risk bias was evaluated using the QUADAS-2 tool. RESULTS: Six studies evaluating either ChatGPT-3.5 or GPT-4, met our inclusion criteria. They explored clinical notes analysis, guideline-based question-answering, and patient management recommendations. Accuracy varied between studies, ranging from 50 to 98%. Higher accuracy was seen in structured tasks like information retrieval. Half of the studies used real patient data, adding practical clinical value. Challenges included inconsistent accuracy, dependency on the way questions are posed (prompt-dependency), and in some cases, missing critical clinical information. CONCLUSION: LLMs hold potential in breast cancer care, especially in textual information extraction and guideline-driven clinical question-answering. Yet, their inconsistent accuracy underscores the need for careful validation of these models, and the importance of ongoing supervision.

Subject(s)

Breast Neoplasms , Humans , Female , Breast Neoplasms/therapy , Breast , Information Storage and Retrieval , Language

15.

Large language models for generating medical examinations: systematic review.

Artsi, Yaara; Sorin, Vera; Konen, Eli; Glicksberg, Benjamin S; Nadkarni, Girish; Klang, Eyal.

BMC Med Educ ; 24(1): 354, 2024 Mar 29.

Article in English | MEDLINE | ID: mdl-38553693

ABSTRACT

BACKGROUND: Writing multiple choice questions (MCQs) for the purpose of medical exams is challenging. It requires extensive medical knowledge, time and effort from medical educators. This systematic review focuses on the application of large language models (LLMs) in generating medical MCQs. METHODS: The authors searched for studies published up to November 2023. Search terms focused on LLMs generated MCQs for medical examinations. Non-English, out of year range and studies not focusing on AI generated multiple-choice questions were excluded. MEDLINE was used as a search database. Risk of bias was evaluated using a tailored QUADAS-2 tool. RESULTS: Overall, eight studies published between April 2023 and October 2023 were included. Six studies used Chat-GPT 3.5, while two employed GPT 4. Five studies showed that LLMs can produce competent questions valid for medical exams. Three studies used LLMs to write medical questions but did not evaluate the validity of the questions. One study conducted a comparative analysis of different models. One other study compared LLM-generated questions with those written by humans. All studies presented faulty questions that were deemed inappropriate for medical exams. Some questions required additional modifications in order to qualify. CONCLUSIONS: LLMs can be used to write MCQs for medical examinations. However, their limitations cannot be ignored. Further study in this field is essential and more conclusive evidence is needed. Until then, LLMs may serve as a supplementary tool for writing medical examinations. 2 studies were at high risk of bias. The study followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.

Subject(s)

Knowledge , Language , Humans , Databases, Factual , Writing

16.

Artificial Intelligence for Identification of Images with Active Bleeding in Mesenteric and Celiac Arteries Angiography.

Barash, Yiftach; Livne, Adva; Klang, Eyal; Sorin, Vera; Cohen, Israel; Khaitovich, Boris; Raskin, Daniel.

Cardiovasc Intervent Radiol ; 47(6): 785-792, 2024 Jun.

Article in English | MEDLINE | ID: mdl-38530394

ABSTRACT

PURPOSE: The purpose of this study is to evaluate the efficacy of an artificial intelligence (AI) model designed to identify active bleeding in digital subtraction angiography images for upper gastrointestinal bleeding. METHODS: Angiographic images were retrospectively collected from mesenteric and celiac artery embolization procedures performed between 2018 and 2022. This dataset included images showing both active bleeding and non-bleeding phases from the same patients. The images were labeled as normal versus images that contain active bleeding. A convolutional neural network was trained and validated to automatically classify the images. Algorithm performance was tested in terms of area under the curve, accuracy, sensitivity, specificity, F1 score, positive and negative predictive value. RESULTS: The dataset included 587 pre-labeled images from 142 patients. Of these, 302 were labeled as normal angiogram and 285 as containing active bleeding. The model's performance on the validation cohort was area under the curve 85.0 ± 10.9% (standard deviation) and average classification accuracy 77.43 ± 4.9%. For Youden's index cutoff, sensitivity and specificity were 85.4 ± 9.4% and 81.2 ± 8.6%, respectively. CONCLUSION: In this study, we explored the application of AI in mesenteric and celiac artery angiography for detecting active bleeding. The results of this study show the potential of an AI-based algorithm to accurately classify images with active bleeding. Further studies using a larger dataset are needed to improve accuracy and allow segmentation of the bleeding.

Subject(s)

Angiography, Digital Subtraction , Artificial Intelligence , Celiac Artery , Gastrointestinal Hemorrhage , Mesenteric Arteries , Humans , Celiac Artery/diagnostic imaging , Retrospective Studies , Gastrointestinal Hemorrhage/diagnostic imaging , Gastrointestinal Hemorrhage/therapy , Angiography, Digital Subtraction/methods , Male , Female , Middle Aged , Mesenteric Arteries/diagnostic imaging , Aged , Sensitivity and Specificity , Embolization, Therapeutic/methods , Algorithms , Adult , Radiographic Image Interpretation, Computer-Assisted/methods

17.

Advancing Medical Practice with Artificial Intelligence: ChatGPT in Healthcare.

Tessler, Idit; Wolfovitz, Amit; Livneh, Nir; Gecel, Nir A; Sorin, Vera; Barash, Yiftach; Konen, Eli; Klang, Eyal.

Isr Med Assoc J ; 26(2): 80-85, 2024 Feb.

Article in English | MEDLINE | ID: mdl-38420977

ABSTRACT

BACKGROUND: Advancements in artificial intelligence (AI) and natural language processing (NLP) have led to the development of language models such as ChatGPT. These models have the potential to transform healthcare and medical research. However, understanding their applications and limitations is essential. OBJECTIVES: To present a view of ChatGPT research and to critically assess ChatGPT's role in medical writing and clinical environments. METHODS: We performed a literature review via the PubMed search engine from 20 November 2022, to 23 April 2023. The search terms included ChatGPT, OpenAI, and large language models. We included studies that focused on ChatGPT, explored its use or implications in medicine, and were original research articles. The selected studies were analyzed considering study design, NLP tasks, main findings, and limitations. RESULTS: Our study included 27 articles that examined ChatGPT's performance in various tasks and medical fields. These studies covered knowledge assessment, writing, and analysis tasks. While ChatGPT was found to be useful in tasks such as generating research ideas, aiding clinical reasoning, and streamlining workflows, limitations were also identified. These limitations included inaccuracies, inconsistencies, fictitious information, and limited knowledge, highlighting the need for further improvements. CONCLUSIONS: The review underscores ChatGPT's potential in various medical applications. Yet, it also points to limitations that require careful human oversight and responsible use to improve patient care, education, and decision-making.

Subject(s)

Artificial Intelligence , Medicine , Humans , Educational Status , Language , Delivery of Health Care

18.

The Emergence Phenomenon in Artificial Intelligence: A Warning Sign on the Path to Artificial General Intelligence.

Sorin, Vera; Klang, Eyal.

Isr Med Assoc J ; 26(2): 120-121, 2024 Feb.

Article in English | MEDLINE | ID: mdl-38420985

Subject(s)

Artificial Intelligence , Intelligence , Humans

19.

Bidirectional Encoder Representations from Transformers in Radiology: A Systematic Review of Natural Language Processing Applications.

Gorenstein, Larisa; Konen, Eli; Green, Michael; Klang, Eyal.

J Am Coll Radiol ; 21(6): 914-941, 2024 Jun.

Article in English | MEDLINE | ID: mdl-38302036

ABSTRACT

INTRODUCTION: Bidirectional Encoder Representations from Transformers (BERT), introduced in 2018, has revolutionized natural language processing. Its bidirectional understanding of word context has enabled innovative applications, notably in radiology. This study aimed to assess BERT's influence and applications within the radiologic domain. METHODS: Adhering to Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines, we conducted a systematic review, searching PubMed for literature on BERT-based models and natural language processing in radiology from January 1, 2018, to February 12, 2023. The search encompassed keywords related to generative models, transformer architecture, and various imaging techniques. RESULTS: Of 597 results, 30 met our inclusion criteria. The remaining were unrelated to radiology or did not use BERT-based models. The included studies were retrospective, with 14 published in 2022. The primary focus was on classification and information extraction from radiology reports, with x-rays as the prevalent imaging modality. Specific investigations included automatic CT protocol assignment and deep learning applications in chest x-ray interpretation. CONCLUSION: This review underscores the primary application of BERT in radiology for report classification. It also reveals emerging BERT applications for protocol assignment and report generation. As BERT technology advances, we foresee further innovative applications. Its implementation in radiology holds potential for enhancing diagnostic precision, expediting report generation, and optimizing patient care.

Subject(s)

Natural Language Processing , Humans , Radiology , Radiology Information Systems

20.

Prevalence of common and rare ophthalmic findings in adults attending a medical survey institute.

Landau Prat, Daphna; Kapelushnik, Noa; Zloto, Ofira; Leshno, Ari; Klang, Eyal; Sina, Sigal; Segev, Shlomo; Arazi, Mattan; Soudry, Shahar; Ben Simon, Guy J.

Int Ophthalmol ; 44(1): 43, 2024 Feb 09.

Article in English | MEDLINE | ID: mdl-38334834

ABSTRACT

PURPOSE: To examine the ophthalmic data from a large database of people attending a general medical survey institute, and to investigate ophthalmic findings of the eye and its adnexa, including differences in age and sex. METHODS: Retrospective analysis including medical data of all consecutive individuals whose ophthalmic data and the prevalences of ocular pathologies were extracted from a very large database of subjects examined at a single general medical survey institute. RESULTS: Data were derived from 184,589 visits of 3676 patients (mean age 52 years, 68% males). The prevalence of the following eye pathologies were extracted. Eyelids: blepharitis (n = 4885, 13.3%), dermatochalasis (n = 4666, 12.7%), ptosis (n = 677, 1.8%), ectropion (n = 73, 0.2%), and xanthelasma (n = 160, 0.4%). Anterior segment: pinguecula (n = 3368, 9.2%), pterygium (n = 852, 2.3%), and cataract or pseudophakia (n = 9381, 27.1%). Cataract type (percentage of all phakic patients): nuclear sclerosis (n = 8908, 24.2%), posterior subcapsular (n = 846, 2.3%), and capsular anterior (n = 781, 2.1%). Pseudophakia was recorded for 697 patients (4.6%), and posterior subcapsular opacification for 229 (0.6%) patients. Optic nerve head (ONH): peripapillary atrophy (n = 4947, 13.5%), tilted disc (n = 3344, 9.1%), temporal slope (n = 410, 1.1%), ONH notch (n = 61, 0.2%), myelinated nerve fiber layer (n = 94, 0.3%), ONH drusen (n = 37, 0.1%), optic pit (n = 3, 0.0%), and ON coloboma (n = 4, 0.0%). Most pathologies were more common in males except for ONH, and most pathologies demonstrated a higher prevalence with increasing age. CONCLUSIONS: Normal ophthalmic data and the prevalences of ocular pathologies were extracted from a very large database of subjects seen at a single medical survey institute.

Subject(s)

Cataract , Pseudophakia , Adult , Male , Humans , Middle Aged , Female , Prevalence , Retrospective Studies , Optic Nerve

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL