Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 396
Filtrar
1.
Data Brief ; 55: 110591, 2024 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-38966662

RESUMO

This data paper introduces a comprehensive dataset tailored for word sense disambiguation tasks, explicitly focusing on a hundred polysemous words frequently employed in Modern Standard Arabic. The dataset encompasses a diverse set of senses for each word, ranging from 3 to 8, resulting in 367 unique senses. Each word sense is accompanied by contextual sentences comprising ten sentence examples that feature the polysemous word in various contexts. The data collection resulted in a dataset of 3670 samples. Significantly, the dataset is in Arabic, which is known for its rich morphology, complex syntax, and extensive polysemy. The data was meticulously collected from various web sources, spanning news, medicine, finance, and more domains. This inclusivity ensures the dataset's applicability across diverse fields, positioning it as a pivotal resource for Arabic Natural Language Processing (NLP) applications. The data collection timeframe spans from the first of April 2023 to the first of May 2023. The dataset provides comprehensive model learning by including all senses for a frequently used Arabic polysemous term, even rare senses that are infrequently used in real-world contexts, thereby mitigating biases. The dataset comprises synthetic sentences generated by GPT3.5-turbo, addressing instances where rare senses lack sufficient real-world data. The dataset collection process involved initial web scraping, followed by manual sorting to distinguish word senses, supplemented by thorough searches by a human expert to fill in missing contextual sentences. Finally, in instances where online data for rare word senses was lacking or insufficient, synthetic samples were generated. Beyond its primary utility in word sense disambiguation, this dataset holds considerable value for scientists and researchers across various domains, extending its relevance to sentiment analysis applications.

2.
Jpn J Radiol ; 2024 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-38954192

RESUMO

PURPOSE: Large language models (LLMs) are rapidly advancing and demonstrating high performance in understanding textual information, suggesting potential applications in interpreting patient histories and documented imaging findings. As LLMs continue to improve, their diagnostic abilities are expected to be enhanced further. However, there is a lack of comprehensive comparisons between LLMs from different manufacturers. In this study, we aimed to test the diagnostic performance of the three latest major LLMs (GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro) using Radiology Diagnosis Please Cases, a monthly diagnostic quiz series for radiology experts. MATERIALS AND METHODS: Clinical history and imaging findings, provided textually by the case submitters, were extracted from 324 quiz questions originating from Radiology Diagnosis Please cases published between 1998 and 2023. The top three differential diagnoses were generated by GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro, using their respective application programming interfaces. A comparative analysis of diagnostic performance among these three LLMs was conducted using Cochrane's Q and post hoc McNemar's tests. RESULTS: The respective diagnostic accuracies of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro for primary diagnosis were 41.0%, 54.0%, and 33.9%, which further improved to 49.4%, 62.0%, and 41.0%, when considering the accuracy of any of the top three differential diagnoses. Significant differences in the diagnostic performance were observed among all pairs of models. CONCLUSION: Claude 3 Opus outperformed GPT-4o and Gemini 1.5 Pro in solving radiology quiz cases. These models appear capable of assisting radiologists when supplied with accurate evaluations and worded descriptions of imaging findings.

3.
J Imaging Inform Med ; 2024 Jul 17.
Artigo em Inglês | MEDLINE | ID: mdl-39020157

RESUMO

Radiology-structured reports (SR) have many advantages over free text (FT), but the wide implementation of SR is still lagging. A powerful tool such as GPT-4 can address this issue. We aim to employ a web-based reporting tool powered by GPT-4 capable of converting FT to SR and then evaluate its impact on reporting time and report quality. Thirty abdominopelvic CT scans were reported by two radiologists across two sessions (15 scans each): a control session using traditional reporting methods and an AI-assisted session employing a GPT-4-powered web application to structure free text into structured reports. For each radiologist, the output included 15 control finalized reports, 15 AI-assisted pre-edits, and 15 post-edit finalized reports. Reporting turnaround times were assessed, including total reporting time (TRT) and case reporting time (TATc). Quality assessments were conducted by two blinded radiologists. TRT and TATc have decreased with the use of the AI-assisted reporting tool, although statistically not significant (p-value > 0.05). Mean TATc for RAD-1 decreased from 00:20:08 to 00:16:30 (hours:minutes:seconds) and TRT decreased from 05:02:00 to 04:08:00. Mean TATc for RAD-2 decreased from 00:12:04 to 00:10:04 and TRT decreased from 03:01:00 to 02:31:00. Quality scores of the finalized reports with and without AI-assistance were comparable with no significant differences. Adjusting the AI-assisted TATc by removing the editing time showed statistically significant results compared to the control for both radiologists (p-value < 0.05). The AI-assisted reporting tool can generate SR while reducing TRT and TATc without sacrificing report quality. Editing time is a potential area for further improvement.

4.
JMIR Med Educ ; 10: e51282, 2024 Jul 08.
Artigo em Inglês | MEDLINE | ID: mdl-38989848

RESUMO

Background: Accurate medical advice is paramount in ensuring optimal patient care, and misinformation can lead to misguided decisions with potentially detrimental health outcomes. The emergence of large language models (LLMs) such as OpenAI's GPT-4 has spurred interest in their potential health care applications, particularly in automated medical consultation. Yet, rigorous investigations comparing their performance to human experts remain sparse. Objective: This study aims to compare the medical accuracy of GPT-4 with human experts in providing medical advice using real-world user-generated queries, with a specific focus on cardiology. It also sought to analyze the performance of GPT-4 and human experts in specific question categories, including drug or medication information and preliminary diagnoses. Methods: We collected 251 pairs of cardiology-specific questions from general users and answers from human experts via an internet portal. GPT-4 was tasked with generating responses to the same questions. Three independent cardiologists (SL, JHK, and JJC) evaluated the answers provided by both human experts and GPT-4. Using a computer interface, each evaluator compared the pairs and determined which answer was superior, and they quantitatively measured the clarity and complexity of the questions as well as the accuracy and appropriateness of the responses, applying a 3-tiered grading scale (low, medium, and high). Furthermore, a linguistic analysis was conducted to compare the length and vocabulary diversity of the responses using word count and type-token ratio. Results: GPT-4 and human experts displayed comparable efficacy in medical accuracy ("GPT-4 is better" at 132/251, 52.6% vs "Human expert is better" at 119/251, 47.4%). In accuracy level categorization, humans had more high-accuracy responses than GPT-4 (50/237, 21.1% vs 30/238, 12.6%) but also a greater proportion of low-accuracy responses (11/237, 4.6% vs 1/238, 0.4%; P=.001). GPT-4 responses were generally longer and used a less diverse vocabulary than those of human experts, potentially enhancing their comprehensibility for general users (sentence count: mean 10.9, SD 4.2 vs mean 5.9, SD 3.7; P<.001; type-token ratio: mean 0.69, SD 0.07 vs mean 0.79, SD 0.09; P<.001). Nevertheless, human experts outperformed GPT-4 in specific question categories, notably those related to drug or medication information and preliminary diagnoses. These findings highlight the limitations of GPT-4 in providing advice based on clinical experience. Conclusions: GPT-4 has shown promising potential in automated medical consultation, with comparable medical accuracy to human experts. However, challenges remain particularly in the realm of nuanced clinical judgment. Future improvements in LLMs may require the integration of specific clinical reasoning pathways and regulatory oversight for safe use. Further research is needed to understand the full potential of LLMs across various medical specialties and conditions.


Assuntos
Inteligência Artificial , Cardiologia , Humanos , Cardiologia/normas
5.
Neuropathol Appl Neurobiol ; 50(4): e12997, 2024 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-39010256

RESUMO

AIMS: Recent advances in artificial intelligence, particularly with large language models like GPT-4Vision (GPT-4V)-a derivative feature of ChatGPT-have expanded the potential for medical image interpretation. This study evaluates the accuracy of GPT-4V in image classification tasks of histopathological images and compares its performance with a traditional convolutional neural network (CNN). METHODS: We utilised 1520 images, including haematoxylin and eosin staining and tau immunohistochemistry, from patients with various neurodegenerative diseases, such as Alzheimer's disease (AD), progressive supranuclear palsy (PSP) and corticobasal degeneration (CBD). We assessed GPT-4V's performance using multi-step prompts to determine how textual context influences image interpretation. We also employed few-shot learning to enhance improvements in GPT-4V's diagnostic performance in classifying three specific tau lesions-astrocytic plaques, neuritic plaques and tufted astrocytes-and compared the outcomes with the CNN model YOLOv8. RESULTS: GPT-4V accurately recognised staining techniques and tissue origin but struggled with specific lesion identification. The interpretation of images was notably influenced by the provided textual context, which sometimes led to diagnostic inaccuracies. For instance, when presented with images of the motor cortex, the diagnosis shifted inappropriately from AD to CBD or PSP. However, few-shot learning markedly improved GPT-4V's diagnostic capabilities, enhancing accuracy from 40% in zero-shot learning to 90% with 20-shot learning, matching the performance of YOLOv8, which required 100-shot learning to achieve the same accuracy. CONCLUSIONS: Although GPT-4V faces challenges in independently interpreting histopathological images, few-shot learning significantly improves its performance. This approach is especially promising for neuropathology, where acquiring extensive labelled datasets is often challenging.


Assuntos
Redes Neurais de Computação , Doenças Neurodegenerativas , Humanos , Doenças Neurodegenerativas/patologia , Interpretação de Imagem Assistida por Computador/métodos , Doença de Alzheimer/patologia
6.
JAMIA Open ; 7(3): ooae060, 2024 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-38962662

RESUMO

Objective: Accurately identifying clinical phenotypes from Electronic Health Records (EHRs) provides additional insights into patients' health, especially when such information is unavailable in structured data. This study evaluates the application of OpenAI's Generative Pre-trained Transformer (GPT)-4 model to identify clinical phenotypes from EHR text in non-small cell lung cancer (NSCLC) patients. The goal was to identify disease stages, treatments and progression utilizing GPT-4, and compare its performance against GPT-3.5-turbo, Flan-T5-xl, Flan-T5-xxl, Llama-3-8B, and 2 rule-based and machine learning-based methods, namely, scispaCy and medspaCy. Materials and Methods: Phenotypes such as initial cancer stage, initial treatment, evidence of cancer recurrence, and affected organs during recurrence were identified from 13 646 clinical notes for 63 NSCLC patients from Washington University in St. Louis, Missouri. The performance of the GPT-4 model is evaluated against GPT-3.5-turbo, Flan-T5-xxl, Flan-T5-xl, Llama-3-8B, medspaCy, and scispaCy by comparing precision, recall, and micro-F1 scores. Results: GPT-4 achieved higher F1 score, precision, and recall compared to Flan-T5-xl, Flan-T5-xxl, Llama-3-8B, medspaCy, and scispaCy's models. GPT-3.5-turbo performed similarly to that of GPT-4. GPT, Flan-T5, and Llama models were not constrained by explicit rule requirements for contextual pattern recognition. spaCy models relied on predefined patterns, leading to their suboptimal performance. Discussion and Conclusion: GPT-4 improves clinical phenotype identification due to its robust pre-training and remarkable pattern recognition capability on the embedded tokens. It demonstrates data-driven effectiveness even with limited context in the input. While rule-based models remain useful for some tasks, GPT models offer improved contextual understanding of the text, and robust clinical phenotype extraction.

8.
Artigo em Inglês | MEDLINE | ID: mdl-39045737

RESUMO

OBJECTIVE: To investigate the consistency of Chatbot Generative Pretrained Transformer (ChatGPT)-4 in the analysis of clinical pictures of common laryngological conditions. STUDY DESIGN: Prospective uncontrolled study. SETTING: Multicenter study. METHODS: Patient history and clinical videolaryngostroboscopic images were presented to ChatGPT-4 for differential diagnoses, management, and treatment(s). ChatGPT-4 responses were assessed by 3 blinded laryngologists with the artificial intelligence performance instrument (AIPI). The complexity of cases and the consistency between practitioners and ChatGPT-4 for interpreting clinical images were evaluated with a 5-point Likert Scale. The intraclass correlation coefficient (ICC) was used to measure the strength of interrater agreement. RESULTS: Forty patients with a mean complexity score of 2.60 ± 1.15. were included. The mean consistency score for ChatGPT-4 image interpretation was 2.46 ± 1.42. ChatGPT-4 perfectly analyzed the clinical images in 6 cases (15%; 5/5), while the consistency between GPT-4 and judges was high in 5 cases (12.5%; 4/5). Judges reported an ICC of 0.965 for the consistency score (P = .001). ChatGPT-4 erroneously documented vocal fold irregularity (mass or lesion), glottic insufficiency, and vocal cord paralysis in 21 (52.5%), 2 (0.05%), and 5 (12.5%) cases, respectively. ChatGPT-4 and practitioners indicated 153 and 63 additional examinations, respectively (P = .001). The ChatGPT-4 primary diagnosis was correct in 20.0% to 25.0% of cases. The clinical image consistency score was significantly associated with the AIPI score (rs = 0.830; P = .001). CONCLUSION: The ChatGPT-4 is more efficient in primary diagnosis, rather than in the image analysis, selecting the most adequate additional examinations and treatments.

9.
JMIR Med Educ ; 10: e52818, 2024 Jul 23.
Artigo em Inglês | MEDLINE | ID: mdl-39042876

RESUMO

BACKGROUND: The rapid evolution of ChatGPT has generated substantial interest and led to extensive discussions in both public and academic domains, particularly in the context of medical education. OBJECTIVE: This study aimed to evaluate ChatGPT's performance in a pulmonology examination through a comparative analysis with that of third-year medical students. METHODS: In this cross-sectional study, we conducted a comparative analysis with 2 distinct groups. The first group comprised 244 third-year medical students who had previously taken our institution's 2020 pulmonology examination, which was conducted in French. The second group involved ChatGPT-3.5 in 2 separate sets of conversations: without contextualization (V1) and with contextualization (V2). In both V1 and V2, ChatGPT received the same set of questions administered to the students. RESULTS: V1 demonstrated exceptional proficiency in radiology, microbiology, and thoracic surgery, surpassing the majority of medical students in these domains. However, it faced challenges in pathology, pharmacology, and clinical pneumology. In contrast, V2 consistently delivered more accurate responses across various question categories, regardless of the specialization. ChatGPT exhibited suboptimal performance in multiple choice questions compared to medical students. V2 excelled in responding to structured open-ended questions. Both ChatGPT conversations, particularly V2, outperformed students in addressing questions of low and intermediate difficulty. Interestingly, students showcased enhanced proficiency when confronted with highly challenging questions. V1 fell short of passing the examination. Conversely, V2 successfully achieved examination success, outperforming 139 (62.1%) medical students. CONCLUSIONS: While ChatGPT has access to a comprehensive web-based data set, its performance closely mirrors that of an average medical student. Outcomes are influenced by question format, item complexity, and contextual nuances. The model faces challenges in medical contexts requiring information synthesis, advanced analytical aptitude, and clinical judgment, as well as in non-English language assessments and when confronted with data outside mainstream internet sources.


Assuntos
Avaliação Educacional , Pneumologia , Estudantes de Medicina , Humanos , Estudos Transversais , Pneumologia/educação , Estudantes de Medicina/estatística & dados numéricos , Avaliação Educacional/métodos , Educação de Graduação em Medicina/métodos , Masculino , Aptidão , Feminino , Competência Clínica
10.
PNAS Nexus ; 3(6): pgae231, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38948324

RESUMO

Large language models (LLMs) demonstrate increasingly human-like abilities across a wide variety of tasks. In this paper, we investigate whether LLMs like ChatGPT can accurately infer the psychological dispositions of social media users and whether their ability to do so varies across socio-demographic groups. Specifically, we test whether GPT-3.5 and GPT-4 can derive the Big Five personality traits from users' Facebook status updates in a zero-shot learning scenario. Our results show an average correlation of r = 0.29 ( range = [ 0.22 , 0.33 ] ) between LLM-inferred and self-reported trait scores-a level of accuracy that is similar to that of supervised machine learning models specifically trained to infer personality. Our findings also highlight heterogeneity in the accuracy of personality inferences across different age groups and gender categories: predictions were found to be more accurate for women and younger individuals on several traits, suggesting a potential bias stemming from the underlying training data or differences in online self-expression. The ability of LLMs to infer psychological dispositions from user-generated text has the potential to democratize access to cheap and scalable psychometric assessments for both researchers and practitioners. On the one hand, this democratization might facilitate large-scale research of high ecological validity and spark innovation in personalized services. On the other hand, it also raises ethical concerns regarding user privacy and self-determination, highlighting the need for stringent ethical frameworks and regulation.

11.
Front Ophthalmol (Lausanne) ; 4: 1387190, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38984105

RESUMO

Overview: This study evaluates the diagnostic accuracy of a multimodal large language model (LLM), ChatGPT-4, in recognizing glaucoma using color fundus photographs (CFPs) with a benchmark dataset and without prior training or fine tuning. Methods: The publicly accessible Retinal Fundus Glaucoma Challenge "REFUGE" dataset was utilized for analyses. The input data consisted of the entire 400 image testing set. The task involved classifying fundus images into either 'Likely Glaucomatous' or 'Likely Non-Glaucomatous'. We constructed a confusion matrix to visualize the results of predictions from ChatGPT-4, focusing on accuracy of binary classifications (glaucoma vs non-glaucoma). Results: ChatGPT-4 demonstrated an accuracy of 90% with a 95% confidence interval (CI) of 87.06%-92.94%. The sensitivity was found to be 50% (95% CI: 34.51%-65.49%), while the specificity was 94.44% (95% CI: 92.08%-96.81%). The precision was recorded at 50% (95% CI: 34.51%-65.49%), and the F1 Score was 0.50. Conclusion: ChatGPT-4 achieved relatively high diagnostic accuracy without prior fine tuning on CFPs. Considering the scarcity of data in specialized medical fields, including ophthalmology, the use of advanced AI techniques, such as LLMs, might require less data for training compared to other forms of AI with potential savings in time and financial resources. It may also pave the way for the development of innovative tools to support specialized medical care, particularly those dependent on multimodal data for diagnosis and follow-up, irrespective of resource constraints.

12.
Cureus ; 16(6): e61955, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38978943

RESUMO

BACKGROUND: In reconstructive plastic surgery, the need for comprehensive research and systematic reviews is apparent due to the field's intricacies, influencing the evidence supporting specific procedures. Although Chat-GPT's knowledge is limited to September 2021, its integration into research proves valuable for efficiently identifying knowledge gaps. Therefore, this tool becomes a potent asset, directing researchers to focus on conducting systematic reviews where they are most necessary. METHODS: Chat-GPT 3.5 was prompted to generate 10 unpublished, innovative research topics on breast reconstruction surgery, followed by 10 additional subtopics. Results were filtered for systematic reviews in PubMed, and novel ideas were identified. To evaluate Chat-GPT's power in generating improved responses, two additional searches were conducted using search terms generated by Chat-GPT. RESULTS: Chat-GPT produced 83 novel ideas, leading to an accuracy rate of 83%. There was a wide range of novel ideas produced among topics such as transgender women, generating 10 ideas, whereas acellular dermal matrix (ADM) generated five ideas. Chat-GPT increased the total number of manuscripts generated by a factor of 2.3, 3.9, and 4.0 in the first, second, and third trials, respectively. While the search results were accurate to our manual searches (95.2% accuracy), the greater number of manuscripts potentially diluted the quality of articles, resulting in fewer novel systematic review ideas. CONCLUSION: Chat-GPT proves valuable in identifying gaps in the literature and offering insights into areas lacking research in breast reconstruction surgery. While it displays high sensitivity, refining its specificity is imperative. Prudent practice involves evaluating accomplished work and conducting a comprehensive review of all components involved.

13.
Front Psychiatry ; 15: 1422807, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38979501

RESUMO

Background: With their unmatched ability to interpret and engage with human language and context, large language models (LLMs) hint at the potential to bridge AI and human cognitive processes. This review explores the current application of LLMs, such as ChatGPT, in the field of psychiatry. Methods: We followed PRISMA guidelines and searched through PubMed, Embase, Web of Science, and Scopus, up until March 2024. Results: From 771 retrieved articles, we included 16 that directly examine LLMs' use in psychiatry. LLMs, particularly ChatGPT and GPT-4, showed diverse applications in clinical reasoning, social media, and education within psychiatry. They can assist in diagnosing mental health issues, managing depression, evaluating suicide risk, and supporting education in the field. However, our review also points out their limitations, such as difficulties with complex cases and potential underestimation of suicide risks. Conclusion: Early research in psychiatry reveals LLMs' versatile applications, from diagnostic support to educational roles. Given the rapid pace of advancement, future investigations are poised to explore the extent to which these models might redefine traditional roles in mental health care.

14.
Learn Health Syst ; 8(3): e10438, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-39036534

RESUMO

Introduction: Large language models (LLMs) have a high diagnostic accuracy when they evaluate previously published clinical cases. Methods: We compared the accuracy of GPT-4's differential diagnoses for previously unpublished challenging case scenarios with the diagnostic accuracy for previously published cases. Results: For a set of previously unpublished challenging clinical cases, GPT-4 achieved 61.1% correct in its top 6 diagnoses versus the previously reported 49.1% for physicians. For a set of 45 clinical vignettes of more common clinical scenarios, GPT-4 included the correct diagnosis in its top 3 diagnoses 100% of the time versus the previously reported 84.3% for physicians. Conclusions: GPT-4 performs at a level at least as good as, if not better than, that of experienced physicians on highly challenging cases in internal medicine. The extraordinary performance of GPT-4 on diagnosing common clinical scenarios could be explained in part by the fact that these cases were previously published and may have been included in the training dataset for this LLM.

15.
Int J Low Extrem Wounds ; : 15347346241262942, 2024 Jun 18.
Artigo em Inglês | MEDLINE | ID: mdl-38887081

RESUMO

In response to the commentary by Daungsupawong and Wiwanitkit (doi: 10.1177/15347346241247914), we authored a reply letter addressing their concerns regarding our previous publication (doi: 10.1177/15347346241236811). Daungsupawong and Wiwanitkit highlighted that while the advancements in generative artificial intelligence (AI) chatbots show promise, several challenges remain in their application to diabetic foot ulcer (DFU) management. In our reply, we emphasized the recent improvements in chatbots' capabilities, particularly in image interpretation and non-English language communication. We posit that these challenges will be overcome in the near future, enabling the clinical implementation of AI chatbots for DFU management.

16.
Health Care Sci ; 3(1): 3-18, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-38939167

RESUMO

Background: Given the strikingly high diagnostic error rate in hospitals, and the recent development of Large Language Models (LLMs), we set out to measure the diagnostic sensitivity of two popular LLMs: GPT-4 and PaLM2. Small-scale studies to evaluate the diagnostic ability of LLMs have shown promising results, with GPT-4 demonstrating high accuracy in diagnosing test cases. However, larger evaluations on real electronic patient data are needed to provide more reliable estimates. Methods: To fill this gap in the literature, we used a deidentified Electronic Health Record (EHR) data set of about 300,000 patients admitted to the Beth Israel Deaconess Medical Center in Boston. This data set contained blood, imaging, microbiology and vital sign information as well as the patients' medical diagnostic codes. Based on the available EHR data, doctors curated a set of diagnoses for each patient, which we will refer to as ground truth diagnoses. We then designed carefully-written prompts to get patient diagnostic predictions from the LLMs and compared this to the ground truth diagnoses in a random sample of 1000 patients. Results: Based on the proportion of correctly predicted ground truth diagnoses, we estimated the diagnostic hit rate of GPT-4 to be 93.9%. PaLM2 achieved 84.7% on the same data set. On these 1000 randomly selected EHRs, GPT-4 correctly identified 1116 unique diagnoses. Conclusion: The results suggest that artificial intelligence (AI) has the potential when working alongside clinicians to reduce cognitive errors which lead to hundreds of thousands of misdiagnoses every year. However, human oversight of AI remains essential: LLMs cannot replace clinicians, especially when it comes to human understanding and empathy. Furthermore, a significant number of challenges in incorporating AI into health care exist, including ethical, liability and regulatory barriers.

17.
Sensors (Basel) ; 24(12)2024 Jun 17.
Artigo em Inglês | MEDLINE | ID: mdl-38931689

RESUMO

Traffic flow prediction can provide important reference data for managers to maintain traffic order, and can also be based on personal travel plans for optimal route selection. On account of the development of sensors and data collection technology, large-scale road network historical data can be effectively used, but their high non-linearity makes it meaningful to establish effective prediction models. In this regard, this paper proposes a dual-stream cross AGFormer-GPT network with prompt engineering for traffic flow prediction, which integrates traffic occupancy and speed as two prompts into traffic flow in the form of cross-attention, and uniquely mines spatial correlation and temporal correlation information through the dual-stream cross structure, effectively combining the advantages of the adaptive graph neural network and large language model to improve prediction accuracy. The experimental results on two PeMS road network data sets have verified that the model has improved by about 1.2% in traffic prediction accuracy under different road networks.

18.
Int J Med Inform ; 189: 105531, 2024 Jun 26.
Artigo em Inglês | MEDLINE | ID: mdl-38943806

RESUMO

BACKGROUND: PRISMA-based literature reviews require meticulous scrutiny of extensive textual data by multiple reviewers, which is associated with considerable human effort. OBJECTIVE: To evaluate feasibility and reliability of using GPT-4 API as a complementary reviewer in systematic literature reviews based on the PRISMA framework. METHODOLOGY: A systematic literature review on the role of natural language processing and Large Language Models (LLMs) in automatic patient-trial matching was conducted using human reviewers and an AI-based reviewer (GPT-4 API). A RAG methodology with LangChain integration was used to process full-text articles. Agreement levels between two human reviewers and GPT-4 API for abstract screening and between a single reviewer and GPT-4 API for full-text parameter extraction were evaluated. RESULTS: An almost perfect GPT-human reviewer agreement in the abstract screening process (Cohen's kappa > 0.9) and a lower agreement in the full-text parameter extraction were observed. CONCLUSION: As GPT-4 has performed on a par with human reviewers in abstract screening, we conclude that GPT-4 has an exciting potential of being used as a main screening tool for systematic literature reviews, replacing at least one of the human reviewers.

19.
JMIR Form Res ; 8: e59267, 2024 Jun 26.
Artigo em Inglês | MEDLINE | ID: mdl-38924784

RESUMO

BACKGROUND: The potential of artificial intelligence (AI) chatbots, particularly ChatGPT with GPT-4 (OpenAI), in assisting with medical diagnosis is an emerging research area. However, it is not yet clear how well AI chatbots can evaluate whether the final diagnosis is included in differential diagnosis lists. OBJECTIVE: This study aims to assess the capability of GPT-4 in identifying the final diagnosis from differential-diagnosis lists and to compare its performance with that of physicians for case report series. METHODS: We used a database of differential-diagnosis lists from case reports in the American Journal of Case Reports, corresponding to final diagnoses. These lists were generated by 3 AI systems: GPT-4, Google Bard (currently Google Gemini), and Large Language Models by Meta AI 2 (LLaMA2). The primary outcome was focused on whether GPT-4's evaluations identified the final diagnosis within these lists. None of these AIs received additional medical training or reinforcement. For comparison, 2 independent physicians also evaluated the lists, with any inconsistencies resolved by another physician. RESULTS: The 3 AIs generated a total of 1176 differential diagnosis lists from 392 case descriptions. GPT-4's evaluations concurred with those of the physicians in 966 out of 1176 lists (82.1%). The Cohen κ coefficient was 0.63 (95% CI 0.56-0.69), indicating a fair to good agreement between GPT-4 and the physicians' evaluations. CONCLUSIONS: GPT-4 demonstrated a fair to good agreement in identifying the final diagnosis from differential-diagnosis lists, comparable to physicians for case report series. Its ability to compare differential diagnosis lists with final diagnoses suggests its potential to aid clinical decision-making support through diagnostic feedback. While GPT-4 showed a fair to good agreement for evaluation, its application in real-world scenarios and further validation in diverse clinical environments are essential to fully understand its utility in the diagnostic process.

20.
JMIR AI ; 3: e58342, 2024 05 31.
Artigo em Inglês | MEDLINE | ID: mdl-38875669

RESUMO

BACKGROUND: The integration of artificial intelligence (AI), particularly deep learning models, has transformed the landscape of medical technology, especially in the field of diagnosis using imaging and physiological data. In otolaryngology, AI has shown promise in image classification for middle ear diseases. However, existing models often lack patient-specific data and clinical context, limiting their universal applicability. The emergence of GPT-4 Vision (GPT-4V) has enabled a multimodal diagnostic approach, integrating language processing with image analysis. OBJECTIVE: In this study, we investigated the effectiveness of GPT-4V in diagnosing middle ear diseases by integrating patient-specific data with otoscopic images of the tympanic membrane. METHODS: The design of this study was divided into two phases: (1) establishing a model with appropriate prompts and (2) validating the ability of the optimal prompt model to classify images. In total, 305 otoscopic images of 4 middle ear diseases (acute otitis media, middle ear cholesteatoma, chronic otitis media, and otitis media with effusion) were obtained from patients who visited Shinshu University or Jichi Medical University between April 2010 and December 2023. The optimized GPT-4V settings were established using prompts and patients' data, and the model created with the optimal prompt was used to verify the diagnostic accuracy of GPT-4V on 190 images. To compare the diagnostic accuracy of GPT-4V with that of physicians, 30 clinicians completed a web-based questionnaire consisting of 190 images. RESULTS: The multimodal AI approach achieved an accuracy of 82.1%, which is superior to that of certified pediatricians at 70.6%, but trailing behind that of otolaryngologists at more than 95%. The model's disease-specific accuracy rates were 89.2% for acute otitis media, 76.5% for chronic otitis media, 79.3% for middle ear cholesteatoma, and 85.7% for otitis media with effusion, which highlights the need for disease-specific optimization. Comparisons with physicians revealed promising results, suggesting the potential of GPT-4V to augment clinical decision-making. CONCLUSIONS: Despite its advantages, challenges such as data privacy and ethical considerations must be addressed. Overall, this study underscores the potential of multimodal AI for enhancing diagnostic accuracy and improving patient care in otolaryngology. Further research is warranted to optimize and validate this approach in diverse clinical settings.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA