Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 143
Filter
1.
Vis Comput Ind Biomed Art ; 7(1): 20, 2024 Aug 05.
Article in English | MEDLINE | ID: mdl-39101954

ABSTRACT

Large language models (LLMs), such as ChatGPT, have demonstrated impressive capabilities in various tasks and attracted increasing interest as a natural language interface across many domains. Recently, large vision-language models (VLMs) that learn rich vision-language correlation from image-text pairs, like BLIP-2 and GPT-4, have been intensively investigated. However, despite these developments, the application of LLMs and VLMs in image quality assessment (IQA), particularly in medical imaging, remains unexplored. This is valuable for objective performance evaluation and potential supplement or even replacement of radiologists' opinions. To this end, this study introduces IQAGPT, an innovative computed tomography (CT) IQA system that integrates image-quality captioning VLM with ChatGPT to generate quality scores and textual reports. First, a CT-IQA dataset comprising 1,000 CT slices with diverse quality levels is professionally annotated and compiled for training and evaluation. To better leverage the capabilities of LLMs, the annotated quality scores are converted into semantically rich text descriptions using a prompt template. Second, the image-quality captioning VLM is fine-tuned on the CT-IQA dataset to generate quality descriptions. The captioning model fuses image and text features through cross-modal attention. Third, based on the quality descriptions, users verbally request ChatGPT to rate image-quality scores or produce radiological quality reports. Results demonstrate the feasibility of assessing image quality using LLMs. The proposed IQAGPT outperformed GPT-4 and CLIP-IQA, as well as multitask classification and regression models that solely rely on images.

2.
Int Dent J ; 2024 Aug 03.
Article in English | MEDLINE | ID: mdl-39098480

ABSTRACT

INTRODUCTION AND AIMS: In the face of escalating oral cancer rates, the application of large language models like Generative Pretrained Transformer (GPT)-4 presents a novel pathway for enhancing public awareness about prevention and early detection. This research aims to explore the capabilities and possibilities of GPT-4 in addressing open-ended inquiries in the field of oral cancer. METHODS: Using 60 questions accompanied by reference answers, covering concepts, causes, treatments, nutrition, and other aspects of oral cancer, evaluators from diverse backgrounds were selected to evaluate the capabilities of GPT-4 and a customized version. A P value under .05 was considered significant. RESULTS: Analysis revealed that GPT-4 and its adaptations notably excelled in answering open-ended questions, with the majority of responses receiving high scores. Although the median score for standard GPT-4 was marginally better, statistical tests showed no significant difference in capabilities between the two models (P > .05). Despite statistical significance indicated diverse backgrounds of evaluators have statistically difference (P < .05), a post hoc test and comprehensive analysis demonstrated that both editions of GPT-4 demonstrated equivalent capabilities in answering questions concerning oral cancer. CONCLUSIONS: GPT-4 has demonstrated its capability to furnish responses to open-ended inquiries concerning oral cancer. Utilizing this advanced technology to boost public awareness about oral cancer is viable and has much potential. When it's unable to locate pertinent information, it will resort to their inherent knowledge base or recommend consulting professionals after offering some basic information. Therefore, it cannot supplant the expertise and clinical judgment of surgical oncologists and could be used as an adjunctive evaluation tool.

4.
Am J Emerg Med ; 84: 68-73, 2024 Jul 30.
Article in English | MEDLINE | ID: mdl-39096711

ABSTRACT

INTRODUCTION: GPT-4, GPT-4o and Gemini advanced, which are among the well-known large language models (LLMs), have the capability to recognize and interpret visual data. When the literature is examined, there are a very limited number of studies examining the ECG performance of GPT-4. However, there is no study in the literature examining the success of Gemini and GPT-4o in ECG evaluation. The aim of our study is to evaluate the performance of GPT-4, GPT-4o, and Gemini in ECG evaluation, assess their usability in the medical field, and compare their accuracy rates in ECG interpretation with those of cardiologists and emergency medicine specialists. METHODS: The study was conducted from May 14, 2024, to June 3, 2024. The book "150 ECG Cases" served as a reference, containing two sections: daily routine ECGs and more challenging ECGs. For this study, two emergency medicine specialists selected 20 ECG cases from each section, totaling 40 cases. In the next stage, the questions were evaluated by emergency medicine specialists and cardiologists. In the subsequent phase, a diagnostic question was entered daily into GPT-4, GPT-4o, and Gemini Advanced on separate chat interfaces. In the final phase, the responses provided by cardiologists, emergency medicine specialists, GPT-4, GPT-4o, and Gemini Advanced were statistically evaluated across three categories: routine daily ECGs, more challenging ECGs, and the total number of ECGs. RESULTS: Cardiologists outperformed GPT-4, GPT-4o, and Gemini Advanced in all three groups. Emergency medicine specialists performed better than GPT-4o in routine daily ECG questions and total ECG questions (p = 0.003 and p = 0.042, respectively). When comparing GPT-4o with Gemini Advanced and GPT-4, GPT-4o performed better in total ECG questions (p = 0.027 and p < 0.001, respectively). In routine daily ECG questions, GPT-4o also outperformed Gemini Advanced (p = 0.004). Weak agreement was observed in the responses given by GPT-4 (p < 0.001, Fleiss Kappa = 0.265) and Gemini Advanced (p < 0.001, Fleiss Kappa = 0.347), while moderate agreement was observed in the responses given by GPT-4o (p < 0.001, Fleiss Kappa = 0.514). CONCLUSION: While GPT-4o shows promise, especially in more challenging ECG questions, and may have potential as an assistant for ECG evaluation, its performance in routine and overall assessments still lags behind human specialists. The limited accuracy and consistency of GPT-4 and Gemini suggest that their current use in clinical ECG interpretation is risky.

5.
J Med Internet Res ; 26: e52758, 2024 Aug 16.
Article in English | MEDLINE | ID: mdl-39151163

ABSTRACT

BACKGROUND: The screening process for systematic reviews is resource-intensive. Although previous machine learning solutions have reported reductions in workload, they risked excluding relevant papers. OBJECTIVE: We evaluated the performance of a 3-layer screening method using GPT-3.5 and GPT-4 to streamline the title and abstract-screening process for systematic reviews. Our goal is to develop a screening method that maximizes sensitivity for identifying relevant records. METHODS: We conducted screenings on 2 of our previous systematic reviews related to the treatment of bipolar disorder, with 1381 records from the first review and 3146 from the second. Screenings were conducted using GPT-3.5 (gpt-3.5-turbo-0125) and GPT-4 (gpt-4-0125-preview) across three layers: (1) research design, (2) target patients, and (3) interventions and controls. The 3-layer screening was conducted using prompts tailored to each study. During this process, information extraction according to each study's inclusion criteria and optimization for screening were carried out using a GPT-4-based flow without manual adjustments. Records were evaluated at each layer, and those meeting the inclusion criteria at all layers were subsequently judged as included. RESULTS: On each layer, both GPT-3.5 and GPT-4 were able to process about 110 records per minute, and the total time required for screening the first and second studies was approximately 1 hour and 2 hours, respectively. In the first study, the sensitivities/specificities of the GPT-3.5 and GPT-4 were 0.900/0.709 and 0.806/0.996, respectively. Both screenings by GPT-3.5 and GPT-4 judged all 6 records used for the meta-analysis as included. In the second study, the sensitivities/specificities of the GPT-3.5 and GPT-4 were 0.958/0.116 and 0.875/0.855, respectively. The sensitivities for the relevant records align with those of human evaluators: 0.867-1.000 for the first study and 0.776-0.979 for the second study. Both screenings by GPT-3.5 and GPT-4 judged all 9 records used for the meta-analysis as included. After accounting for justifiably excluded records by GPT-4, the sensitivities/specificities of the GPT-4 screening were 0.962/0.996 in the first study and 0.943/0.855 in the second study. Further investigation indicated that the cases incorrectly excluded by GPT-3.5 were due to a lack of domain knowledge, while the cases incorrectly excluded by GPT-4 were due to misinterpretations of the inclusion criteria. CONCLUSIONS: Our 3-layer screening method with GPT-4 demonstrated acceptable level of sensitivity and specificity that supports its practical application in systematic review screenings. Future research should aim to generalize this approach and explore its effectiveness in diverse settings, both medical and nonmedical, to fully establish its use and operational feasibility.


Subject(s)
Systematic Reviews as Topic , Humans , Language
6.
JAMIA Open ; 7(3): ooae075, 2024 Oct.
Article in English | MEDLINE | ID: mdl-39139700

ABSTRACT

Objectives: Clinical note section identification helps locate relevant information and could be beneficial for downstream tasks such as named entity recognition. However, the traditional supervised methods suffer from transferability issues. This study proposes a new framework for using large language models (LLMs) for section identification to overcome the limitations. Materials and Methods: We framed section identification as question-answering and provided the section definitions in free-text. We evaluated multiple LLMs off-the-shelf without any training. We also fine-tune our LLMs to investigate how the size and the specificity of the fine-tuning dataset impacts model performance. Results: GPT4 achieved the highest F1 score of 0.77. The best open-source model (Tulu2-70b) achieved 0.64 and is on par with GPT3.5 (ChatGPT). GPT4 is also found to obtain F1 scores greater than 0.9 for 9 out of the 27 (33%) section types and greater than 0.8 for 15 out of 27 (56%) section types. For our fine-tuned models, we found they plateaued with an increasing size of the general domain dataset. We also found that adding a reasonable amount of section identification examples is beneficial. Discussion: These results indicate that GPT4 is nearly production-ready for section identification, and seemingly contains both knowledge of note structure and the ability to follow complex instructions, and the best current open-source LLM is catching up. Conclusion: Our study shows that LLMs are promising for generalizable clinical note section identification. They have the potential to be further improved by adding section identification examples to the fine-tuning dataset.

7.
Ultrasound Med Biol ; 2024 Aug 12.
Article in English | MEDLINE | ID: mdl-39138026

ABSTRACT

OBJECTIVES: To assess the capabilities of large language models (LLMs), including Open AI (GPT-4.0) and Microsoft Bing (GPT-4), in generating structured reports, the Breast Imaging Reporting and Data System (BI-RADS) categories, and management recommendations from free-text breast ultrasound reports. MATERIALS AND METHODS: In this retrospective study, 100 free-text breast ultrasound reports from patients who underwent surgery between January and May 2023 were gathered. The capabilities of Open AI (GPT-4.0) and Microsoft Bing (GPT-4) to convert these unstructured reports into structured ultrasound reports were studied. The quality of structured reports, BI-RADS categories, and management recommendations generated by GPT-4.0 and Bing were evaluated by senior radiologists based on the guidelines. RESULTS: Open AI (GPT-4.0) was better than Microsoft Bing (GPT-4) in terms of performance in generating structured reports (88% vs. 55%; p < 0.001), giving correct BI-RADS categories (54% vs. 47%; p = 0.013) and providing reasonable management recommendations (81% vs. 63%; p < 0.001). As the ability to predict benign and malignant characteristics, GPT-4.0 performed significantly better than Bing (AUC, 0.9317 vs. 0.8177; p < 0.001), while both performed significantly inferior to senior radiologists (AUC, 0.9763; both p < 0.001). CONCLUSION: This study highlights the potential of LLMs, specifically Open AI (GPT-4.0), in converting unstructured breast ultrasound reports into structured ones, offering accurate diagnoses and providing reasonable recommendations.

8.
J Infect Dis ; 2024 Aug 13.
Article in English | MEDLINE | ID: mdl-39136574

ABSTRACT

BACKGROUND: Surgical site infection (SSI) is a common and costly complication in spinal surgery. Identifying risk factors and preventive strategies is crucial for reducing SSIs. GPT-4 has evolved from a simple text-based tool to a sophisticated multimodal data expert, invaluable for clinicians. This study explored GPT-4's applications in SSI management across various clinical scenarios. METHODS: GPT-4 was employed in various clinical scenarios related to SSIs in spinal surgery. Researchers designed specific questions for GPT-4 to generate tailored responses. Six evaluators assessed these responses for logic and accuracy using a 5-point Likert scale. Inter-rater consistency was measured with Fleiss' kappa, and radar charts visualized GPT-4's performance. RESULTS: The inter-rater consistency, measured by Fleiss' kappa, ranged from 0.62 to 0.83. The overall average scores for logic and accuracy were 24.27±0.4 and 24.46±0.25 on 5-point Likert scale. Radar charts showed GPT-4's consistently high performance across various criteria. GPT-4 demonstrated high proficiency in creating personalized treatment plans tailored to diverse clinical patient records and offered interactive patient education. It significantly improved SSI management strategies, infection prediction models, and identified emerging research trends. However, it had limitations in fine-tuning antibiotic treatments and customizing patient education materials. CONCLUSIONS: GPT-4 represents a significant advancement in managing SSIs in spinal surgery, promoting patient-centered care and precision medicine. Despite some limitations in antibiotic customization and patient education, GPT-4's continuous learning, attention to data privacy and security, collaboration with healthcare professionals, and patient acceptance of AI recommendations suggest its potential to revolutionize SSI management, requiring further development and clinical integration.

9.
Future Microbiol ; : 1-10, 2024 Jul 29.
Article in English | MEDLINE | ID: mdl-39069960

ABSTRACT

Aim: Assessing the visual accuracy of two large language models (LLMs) in microbial classification. Materials & methods: GPT-4o and Gemini 1.5 Pro were evaluated in distinguishing Gram-positive from Gram-negative bacteria and classifying them as cocci or bacilli using 80 Gram stain images from a labeled database. Results: GPT-4o achieved 100% accuracy in identifying simultaneously Gram stain and shape for Clostridium perfringens, Pseudomonas aeruginosa and Staphylococcus aureus. Gemini 1.5 Pro showed more variability for similar bacteria (45, 100 and 95%, respectively). Both LLMs failed to identify both Gram stain and bacterial shape for Neisseria gonorrhoeae. Cumulative accuracy plots indicated that GPT-4o consistently performed equally or better in every identification, except for Neisseria gonorrhoeae's shape. Conclusion: These results suggest that these LLMs in their unprimed state are not ready to be implemented in clinical practice and highlight the need for more research with larger datasets to improve LLMs' effectiveness in clinical microbiology.


This study looked at how well large language models (LLMs) could identify different types of bacteria using images, without having any specific training in this area beforehand.We tested two LLMs with image analysis capabilities, GPT-4o and Gemini 1.5 Pro. These models were asked to determine whether bacteria were Gram-positive or Gram-negative and whether they were round (cocci) or rod-shaped (bacilli). We used 80 images of four stained bacteria from a labeled database as a reference for this test.GPT-4o was more accurate in identifying both the Gram stain and shape of the bacteria compared with Gemini 1.5 Pro. GPT-4o had excellent accuracy in correctly classifying the Gram stain and bacterial shape of Clostridium perfringens, Pseudomonas aeruginosa and Staphylococcus aureus. Gemini 1.5 Pro had mixed results for these bacteria. However, both models struggled with Neisseria gonorrhoeae, failing to correctly identify its Gram stain and shape.The study shows that while these LLMs have potential, they are not ready to be implemented in clinical practice. More research and larger datasets are needed to improve their accuracy in clinical microbiology.

10.
J Hand Surg Am ; 2024 Jul 26.
Article in English | MEDLINE | ID: mdl-39066762

ABSTRACT

PURPOSE: Exploring the integration of artificial intelligence in clinical settings, this study examined the feasibility of using Generative Pretrained Transformer 4 (GPT-4), a large language model, as a consultation assistant in a hand surgery outpatient clinic. METHODS: The study involved 10 simulated patient scenarios with common hand conditions, where GPT-4, enhanced through specific prompt engineering techniques, conducted medical history interviews, and assisted in diagnostic processes. A panel of expert hand surgeons, each board-certified in hand surgery, evaluated GPT-4's responses using a Likert Scale across five criteria with scores ranging from 1 (lowest) to 5 (highest). RESULTS: Generative Pretrained Transformer 4 achieved an average score of 4.6, reflecting good performance in documenting a medical history, as evaluated by the hand surgeons. CONCLUSIONS: These findings suggest that GPT-4 can effectively document medical histories to meet the standards of hand surgeons in a simulated environment. The findings indicate potential for future application in patient care, but the actual performance of GPT-4 in real clinical settings remains to be investigated. CLINICAL RELEVANCE: This study provides a preliminary indication that GPT-4 could be a useful consultation assistant in a hand surgery outpatient clinic, but further research is required to explore its reliability and practicality in actual practice.

11.
PNAS Nexus ; 3(6): pgae231, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38948324

ABSTRACT

Large language models (LLMs) demonstrate increasingly human-like abilities across a wide variety of tasks. In this paper, we investigate whether LLMs like ChatGPT can accurately infer the psychological dispositions of social media users and whether their ability to do so varies across socio-demographic groups. Specifically, we test whether GPT-3.5 and GPT-4 can derive the Big Five personality traits from users' Facebook status updates in a zero-shot learning scenario. Our results show an average correlation of r = 0.29 ( range = [ 0.22 , 0.33 ] ) between LLM-inferred and self-reported trait scores-a level of accuracy that is similar to that of supervised machine learning models specifically trained to infer personality. Our findings also highlight heterogeneity in the accuracy of personality inferences across different age groups and gender categories: predictions were found to be more accurate for women and younger individuals on several traits, suggesting a potential bias stemming from the underlying training data or differences in online self-expression. The ability of LLMs to infer psychological dispositions from user-generated text has the potential to democratize access to cheap and scalable psychometric assessments for both researchers and practitioners. On the one hand, this democratization might facilitate large-scale research of high ecological validity and spark innovation in personalized services. On the other hand, it also raises ethical concerns regarding user privacy and self-determination, highlighting the need for stringent ethical frameworks and regulation.

12.
Jpn J Radiol ; 2024 Jul 01.
Article in English | MEDLINE | ID: mdl-38954192

ABSTRACT

PURPOSE: Large language models (LLMs) are rapidly advancing and demonstrating high performance in understanding textual information, suggesting potential applications in interpreting patient histories and documented imaging findings. As LLMs continue to improve, their diagnostic abilities are expected to be enhanced further. However, there is a lack of comprehensive comparisons between LLMs from different manufacturers. In this study, we aimed to test the diagnostic performance of the three latest major LLMs (GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro) using Radiology Diagnosis Please Cases, a monthly diagnostic quiz series for radiology experts. MATERIALS AND METHODS: Clinical history and imaging findings, provided textually by the case submitters, were extracted from 324 quiz questions originating from Radiology Diagnosis Please cases published between 1998 and 2023. The top three differential diagnoses were generated by GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro, using their respective application programming interfaces. A comparative analysis of diagnostic performance among these three LLMs was conducted using Cochrane's Q and post hoc McNemar's tests. RESULTS: The respective diagnostic accuracies of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro for primary diagnosis were 41.0%, 54.0%, and 33.9%, which further improved to 49.4%, 62.0%, and 41.0%, when considering the accuracy of any of the top three differential diagnoses. Significant differences in the diagnostic performance were observed among all pairs of models. CONCLUSION: Claude 3 Opus outperformed GPT-4o and Gemini 1.5 Pro in solving radiology quiz cases. These models appear capable of assisting radiologists when supplied with accurate evaluations and worded descriptions of imaging findings.

13.
Learn Health Syst ; 8(3): e10438, 2024 Jul.
Article in English | MEDLINE | ID: mdl-39036534

ABSTRACT

Introduction: Large language models (LLMs) have a high diagnostic accuracy when they evaluate previously published clinical cases. Methods: We compared the accuracy of GPT-4's differential diagnoses for previously unpublished challenging case scenarios with the diagnostic accuracy for previously published cases. Results: For a set of previously unpublished challenging clinical cases, GPT-4 achieved 61.1% correct in its top 6 diagnoses versus the previously reported 49.1% for physicians. For a set of 45 clinical vignettes of more common clinical scenarios, GPT-4 included the correct diagnosis in its top 3 diagnoses 100% of the time versus the previously reported 84.3% for physicians. Conclusions: GPT-4 performs at a level at least as good as, if not better than, that of experienced physicians on highly challenging cases in internal medicine. The extraordinary performance of GPT-4 on diagnosing common clinical scenarios could be explained in part by the fact that these cases were previously published and may have been included in the training dataset for this LLM.

15.
Stud Health Technol Inform ; 315: 290-294, 2024 Jul 24.
Article in English | MEDLINE | ID: mdl-39049270

ABSTRACT

The MAUDE database is a valuable public resource for understanding malfunctions and adverse events related to medical devices and health IT. However, its extensive data and complex structure pose challenges. To overcome this, we have developed an automated analytical pipeline using GPT-4, a cutting-edge large language model. This pipeline is intended to efficiently extract, categorize, and visualize safety events with minimal human annotation. In our analysis of 4,459 colonoscopy reports from MAUDE (2011-2021), the events were categorized into operational, human factor, and device-related. Ishikawa diagrams visualized a subset stored in a vector database for easy retrieval and comparison through a similarity search. This innovative approach streamlines access to vital safety insights, reducing the workload on human annotators, and holds promise to enhance the utility of the MAUDE database.


Subject(s)
Databases, Factual , Humans , Colonoscopy , Equipment Failure , Natural Language Processing , Patient Safety
16.
Diagnostics (Basel) ; 14(14)2024 Jul 17.
Article in English | MEDLINE | ID: mdl-39061677

ABSTRACT

BACKGROUND AND OBJECTIVES: Integrating large language models (LLMs) such as GPT-4 Turbo into diagnostic imaging faces a significant challenge, with current misdiagnosis rates ranging from 30-50%. This study evaluates how prompt engineering and confidence thresholds can improve diagnostic accuracy in neuroradiology. METHODS: We analyze 751 neuroradiology cases from the American Journal of Neuroradiology using GPT-4 Turbo with customized prompts to improve diagnostic precision. RESULTS: Initially, GPT-4 Turbo achieved a baseline diagnostic accuracy of 55.1%. By reformatting responses to list five diagnostic candidates and applying a 90% confidence threshold, the highest precision of the diagnosis increased to 72.9%, with the candidate list providing the correct diagnosis at 85.9%, reducing the misdiagnosis rate to 14.1%. However, this threshold reduced the number of cases that responded. CONCLUSIONS: Strategic prompt engineering and high confidence thresholds significantly reduce misdiagnoses and improve the precision of the LLM diagnostic in neuroradiology. More research is needed to optimize these approaches for broader clinical implementation, balancing accuracy and utility.

17.
Sci Rep ; 14(1): 17341, 2024 07 28.
Article in English | MEDLINE | ID: mdl-39069520

ABSTRACT

This study was designed to assess how different prompt engineering techniques, specifically direct prompts, Chain of Thought (CoT), and a modified CoT approach, influence the ability of GPT-3.5 to answer clinical and calculation-based medical questions, particularly those styled like the USMLE Step 1 exams. To achieve this, we analyzed the responses of GPT-3.5 to two distinct sets of questions: a batch of 1000 questions generated by GPT-4, and another set comprising 95 real USMLE Step 1 questions. These questions spanned a range of medical calculations and clinical scenarios across various fields and difficulty levels. Our analysis revealed that there were no significant differences in the accuracy of GPT-3.5's responses when using direct prompts, CoT, or modified CoT methods. For instance, in the USMLE sample, the success rates were 61.7% for direct prompts, 62.8% for CoT, and 57.4% for modified CoT, with a p-value of 0.734. Similar trends were observed in the responses to GPT-4 generated questions, both clinical and calculation-based, with p-values above 0.05 indicating no significant difference between the prompt types. The conclusion drawn from this study is that the use of CoT prompt engineering does not significantly alter GPT-3.5's effectiveness in handling medical calculations or clinical scenario questions styled like those in USMLE exams. This finding is crucial as it suggests that performance of ChatGPT remains consistent regardless of whether a CoT technique is used instead of direct prompts. This consistency could be instrumental in simplifying the integration of AI tools like ChatGPT into medical education, enabling healthcare professionals to utilize these tools with ease, without the necessity for complex prompt engineering.


Subject(s)
Educational Measurement , Humans , Educational Measurement/methods , Licensure, Medical , Clinical Competence , United States , Education, Medical, Undergraduate/methods
18.
JMIR Med Educ ; 10: e51282, 2024 Jul 08.
Article in English | MEDLINE | ID: mdl-38989848

ABSTRACT

Background: Accurate medical advice is paramount in ensuring optimal patient care, and misinformation can lead to misguided decisions with potentially detrimental health outcomes. The emergence of large language models (LLMs) such as OpenAI's GPT-4 has spurred interest in their potential health care applications, particularly in automated medical consultation. Yet, rigorous investigations comparing their performance to human experts remain sparse. Objective: This study aims to compare the medical accuracy of GPT-4 with human experts in providing medical advice using real-world user-generated queries, with a specific focus on cardiology. It also sought to analyze the performance of GPT-4 and human experts in specific question categories, including drug or medication information and preliminary diagnoses. Methods: We collected 251 pairs of cardiology-specific questions from general users and answers from human experts via an internet portal. GPT-4 was tasked with generating responses to the same questions. Three independent cardiologists (SL, JHK, and JJC) evaluated the answers provided by both human experts and GPT-4. Using a computer interface, each evaluator compared the pairs and determined which answer was superior, and they quantitatively measured the clarity and complexity of the questions as well as the accuracy and appropriateness of the responses, applying a 3-tiered grading scale (low, medium, and high). Furthermore, a linguistic analysis was conducted to compare the length and vocabulary diversity of the responses using word count and type-token ratio. Results: GPT-4 and human experts displayed comparable efficacy in medical accuracy ("GPT-4 is better" at 132/251, 52.6% vs "Human expert is better" at 119/251, 47.4%). In accuracy level categorization, humans had more high-accuracy responses than GPT-4 (50/237, 21.1% vs 30/238, 12.6%) but also a greater proportion of low-accuracy responses (11/237, 4.6% vs 1/238, 0.4%; P=.001). GPT-4 responses were generally longer and used a less diverse vocabulary than those of human experts, potentially enhancing their comprehensibility for general users (sentence count: mean 10.9, SD 4.2 vs mean 5.9, SD 3.7; P<.001; type-token ratio: mean 0.69, SD 0.07 vs mean 0.79, SD 0.09; P<.001). Nevertheless, human experts outperformed GPT-4 in specific question categories, notably those related to drug or medication information and preliminary diagnoses. These findings highlight the limitations of GPT-4 in providing advice based on clinical experience. Conclusions: GPT-4 has shown promising potential in automated medical consultation, with comparable medical accuracy to human experts. However, challenges remain particularly in the realm of nuanced clinical judgment. Future improvements in LLMs may require the integration of specific clinical reasoning pathways and regulatory oversight for safe use. Further research is needed to understand the full potential of LLMs across various medical specialties and conditions.


Subject(s)
Artificial Intelligence , Cardiology , Humans , Cardiology/standards
19.
Oral Maxillofac Surg ; 2024 Jul 26.
Article in English | MEDLINE | ID: mdl-39060850

ABSTRACT

BACKGROUND: This research aimed to investigate the concordance between clinical impressions and histopathologic diagnoses made by clinicians and artificial intelligence tools for odontogenic keratocyst (OKC) and Odontogenic tumours (OT) in a New Zealand population from 2008 to 2023. METHODS: Histopathological records from the Oral Pathology Centre, University of Otago (2008-2023) were examined to identify OKCs and OT. Specimen referral details, histopathologic reports, and clinician differential diagnoses, as well as those provided by ORAD and Chat-GPT4, were documented. Data were analyzed using SPSS, and concordance between provisional and histopathologic diagnoses was ascertained. RESULTS: Of the 34,225 biopsies, 302 and 321 samples were identified as OTs and OKCs. Concordance rates were 43.2% for clinicians, 45.6% for ORAD, and 41.4% for Chat-GPT4. Corresponding Kappa value against histological diagnosis were 0.23, 0.13 and 0.14. Surgeons achieved a higher concordance rate (47.7%) compared to non-surgeons (29.82%). Odds ratio of having concordant diagnosis using Chat-GPT4 and ORAD were between 1.4 and 2.8 (p < 0.05). ROC-AUC and PR-AUC were similar between the groups (Clinician 0.62/0.42, ORAD 0.58/0.28, Char-GPT4 0.63/0.37) for ameloblastoma and for OKC (Clinician 0.64/0.78, ORAD 0.66/0.77, Char-GPT4 0.60/0.71). CONCLUSION: Clinicians with surgical training achieved higher concordance rate when it comes to OT and OKC. Chat-GPT4 and Bayesian approach (ORAD) have shown potential in enhancing diagnostic capabilities.

SELECTION OF CITATIONS
SEARCH DETAIL