Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 58
Filtrar
1.
Artigo em Inglês | MEDLINE | ID: mdl-38729387

RESUMO

BACKGROUND & AIMS: Large language models including Chat Generative Pretrained Transformers version 4 (ChatGPT4) improve access to artificial intelligence, but their impact on the clinical practice of gastroenterology is undefined. This study compared the accuracy, concordance, and reliability of ChatGPT4 colonoscopy recommendations for colorectal cancer rescreening and surveillance with contemporary guidelines and real-world gastroenterology practice. METHODS: History of present illness, colonoscopy data, and pathology reports from patients undergoing procedures at 2 large academic centers were entered into ChatGPT4 and it was queried for the next recommended colonoscopy follow-up interval. Using the McNemar test and inter-rater reliability, we compared the recommendations made by ChatGPT4 with the actual surveillance interval provided in the endoscopist's procedure report (gastroenterology practice) and the appropriate US Multisociety Task Force (USMSTF) guidance. The latter was generated for each case by an expert panel using the clinical information and guideline documents as reference. RESULTS: Text input of de-identified data into ChatGPT4 from 505 consecutive patients undergoing colonoscopy between January 1 and April 30, 2023, elicited a successful follow-up recommendation in 99.2% of the queries. ChatGPT4 recommendations were in closer agreement with the USMSTF Panel (85.7%) than gastroenterology practice recommendations with the USMSTF Panel (75.4%) (P < .001). Of the 14.3% discordant recommendations between ChatGPT4 and the USMSTF Panel, recommendations were for later screening in 26 (5.1%) and for earlier screening in 44 (8.7%) cases. The inter-rater reliability was good for ChatGPT4 vs USMSTF Panel (Fleiss κ, 0.786; 95% CI, 0.734-0.838; P < .001). CONCLUSIONS: Initial real-world results suggest that ChatGPT4 can define routine colonoscopy screening intervals accurately based on verbatim input of clinical data. Large language models have potential for clinical applications, but further training is needed for broad use.

2.
Artigo em Inglês | MEDLINE | ID: mdl-38648756

RESUMO

OBJECTIVES: The efficacy of artificial intelligence (AI)-driven chatbots like ChatGPT4 in specialized medical consultations, particularly in rheumatology, remains underexplored. This study compares the proficiency of ChatGPT4' responses with practicing rheumatologists to inquiries from patients with systemic lupus erythematosus (SLE). METHODS: In this cross-sectional study, we curated 95 frequently asked questions (FAQs), including 55 in Chinese and 40 in English. Responses for FAQs from ChatGPT4 and 5 rheumatologists were scored separately by a panel of rheumatologists and a group of patients with SLE across 6 domains (scientific validity, logical consistency, comprehensibility, completeness, satisfaction level, and empathy) on a 0-10 scale (a score of 0 indicates entirely incorrect responses, while 10 indicates accurate and comprehensive answers). RESULTS: Rheumatologists' scoring revealed that ChatGPT4-generated responses outperformed those from rheumatologists in satisfaction level and empathy, with mean differences of 0.537 (95% CI, 0.252-0.823; p < 0.01) and 0.460 (95% CI, 0.227-0.693 p < 0.01), respectively. From the SLE patients' perspective, ChatGPT4-generated responses were comparable to the rheumatologist-provided answers in all 6 domains. Subgroup analysis revealed ChatGPT4 responses were more logically consistent and complete regardless of language, and exhibited greater comprehensibility, satisfaction, and empathy in Chinese. However, ChatGPT4 responses were inferior in comprehensibility for English FAQs. CONCLUSION: ChatGPT4 demonstrated comparable, possibly better in certain domains, to address FAQs from patients with SLE, when compared with the answers provided by specialists. This study showed the potential of applying ChatGPT4 to improve consultation in SLE patients.

3.
World J Urol ; 42(1): 20, 2024 Jan 10.
Artigo em Inglês | MEDLINE | ID: mdl-38197996

RESUMO

PURPOSE: This study is a comparative analysis of three Large Language Models (LLMs) evaluating their rate of correct answers (RoCA) and the reliability of generated answers on a set of urological knowledge-based questions spanning different levels of complexity. METHODS: ChatGPT-3.5, ChatGPT-4, and Bing AI underwent two testing rounds, with a 48-h gap in between, using the 100 multiple-choice questions from the 2022 European Board of Urology (EBU) In-Service Assessment (ISA). For conflicting responses, an additional consensus round was conducted to establish conclusive answers. RoCA was compared across various question complexities. Ten weeks after the consensus round, a subsequent testing round was conducted to assess potential knowledge gain and improvement in RoCA, respectively. RESULTS: Over three testing rounds, ChatGPT-3.5 achieved RoCa scores of 58%, 62%, and 59%. In contrast, ChatGPT-4 achieved RoCA scores of 63%, 77%, and 77%, while Bing AI yielded scores of 81%, 73%, and 77%, respectively. Agreement rates between rounds 1 and 2 were 84% (κ = 0.67, p < 0.001) for ChatGPT-3.5, 74% (κ = 0.40, p < 0.001) for ChatGPT-4, and 76% (κ = 0.33, p < 0.001) for BING AI. In the consensus round, ChatGPT-4 and Bing AI significantly outperformed ChatGPT-3.5 (77% and 77% vs. 59%, both p = 0.010). All LLMs demonstrated decreasing RoCA scores with increasing question complexity (p < 0.001). In the fourth round, no significant improvement in RoCA was observed across all three LLMs. CONCLUSIONS: The performance of the tested LLMs in addressing urological specialist inquiries warrants further refinement. Moreover, the deficiency in response reliability contributes to existing challenges related to their current utility for educational purposes.


Assuntos
Inteligência Artificial , Urologia , Humanos , Reprodutibilidade dos Testes , Exame Físico , Idioma
4.
Br J Clin Pharmacol ; 90(3): 662-674, 2024 03.
Artigo em Inglês | MEDLINE | ID: mdl-37949663

RESUMO

AIMS: The aim of this study was to compare the clinical decision-making for benzodiazepine deprescribing between a healthcare provider (HCP) and an artificial intelligence (AI) chatbot GPT4 (ChatGPT-4). METHODS: We analysed real-world data from a Croatian cohort of community-dwelling benzodiazepine patients (n = 154) within the EuroAgeism H2020 ESR 7 project. HCPs evaluated the data using pre-established deprescribing criteria to assess benzodiazepine discontinuation potential. The research team devised and tested AI prompts to ensure consistency with HCP judgements. An independent researcher employed ChatGPT-4 with predetermined prompts to simulate clinical decisions for each patient case. Data derived from human-HCP and ChatGPT-4 decisions were compared for agreement rates and Cohen's kappa. RESULTS: Both HPC and ChatGPT identified patients for benzodiazepine deprescribing (96.1% and 89.6%, respectively), showing an agreement rate of 95% (κ = .200, P = .012). Agreement on four deprescribing criteria ranged from 74.7% to 91.3% (lack of indication κ = .352, P < .001; prolonged use κ = .088, P = .280; safety concerns κ = .123, P = .006; incorrect dosage κ = .264, P = .001). Important limitations of GPT-4 responses were identified, including 22.1% ambiguous outputs, generic answers and inaccuracies, posing inappropriate decision-making risks. CONCLUSIONS: While AI-HCP agreement is substantial, sole AI reliance poses a risk for unsuitable clinical decision-making. This study's findings reveal both strengths and areas for enhancement of ChatGPT-4 in the deprescribing recommendations within a real-world sample. Our study underscores the need for additional research on chatbot functionality in patient therapy decision-making, further fostering the advancement of AI for optimal performance.


Assuntos
Inteligência Artificial , Desprescrições , Humanos , Benzodiazepinas/efeitos adversos , Tomada de Decisão Clínica , Pessoal de Saúde
5.
J Periodontal Res ; 2024 Jul 18.
Artigo em Inglês | MEDLINE | ID: mdl-39030766

RESUMO

INTRODUCTION: The emerging rise in novel computer technologies and automated data analytics has the potential to change the course of dental education. In line with our long-term goal of harnessing the power of AI to augment didactic teaching, the objective of this study was to quantify and compare the accuracy of responses provided by ChatGPT (GPT-4 and GPT-3.5) and Google Gemini, the three primary large language models (LLMs), to human graduate students (control group) to the annual in-service examination questions posed by the American Academy of Periodontology (AAP). METHODS: Under a comparative cross-sectional study design, a corpus of 1312 questions from the annual in-service examination of AAP administered between 2020 and 2023 were presented to the LLMs. Their responses were analyzed using chi-square tests, and the performance was juxtaposed to the scores of periodontal residents from corresponding years, as the human control group. Additionally, two sub-analyses were performed: one on the performance of the LLMs on each section of the exam; and in answering the most difficult questions. RESULTS: ChatGPT-4 (total average: 79.57%) outperformed all human control groups as well as GPT-3.5 and Google Gemini in all exam years (p < .001). This chatbot showed an accuracy range between 78.80% and 80.98% across the various exam years. Gemini consistently recorded superior performance with scores of 70.65% (p = .01), 73.29% (p = .02), 75.73% (p < .01), and 72.18% (p = .0008) for the exams from 2020 to 2023 compared to ChatGPT-3.5, which achieved 62.5%, 68.24%, 69.83%, and 59.27% respectively. Google Gemini (72.86%) surpassed the average scores achieved by first- (63.48% ± 31.67) and second-year residents (66.25% ± 31.61) when all exam years combined. However, it could not surpass that of third-year residents (69.06% ± 30.45). CONCLUSIONS: Within the confines of this analysis, ChatGPT-4 exhibited a robust capability in answering AAP in-service exam questions in terms of accuracy and reliability while Gemini and ChatGPT-3.5 showed a weaker performance. These findings underscore the potential of deploying LLMs as an educational tool in periodontics and oral implantology domains. However, the current limitations of these models such as inability to effectively process image-based inquiries, the propensity for generating inconsistent responses to the same prompts, and achieving high (80% by GPT-4) but not absolute accuracy rates should be considered. An objective comparison of their capability versus their capacity is required to further develop this field of study.

6.
J Med Internet Res ; 26: e54419, 2024 Apr 22.
Artigo em Inglês | MEDLINE | ID: mdl-38648636

RESUMO

BACKGROUND: Medical documentation plays a crucial role in clinical practice, facilitating accurate patient management and communication among health care professionals. However, inaccuracies in medical notes can lead to miscommunication and diagnostic errors. Additionally, the demands of documentation contribute to physician burnout. Although intermediaries like medical scribes and speech recognition software have been used to ease this burden, they have limitations in terms of accuracy and addressing provider-specific metrics. The integration of ambient artificial intelligence (AI)-powered solutions offers a promising way to improve documentation while fitting seamlessly into existing workflows. OBJECTIVE: This study aims to assess the accuracy and quality of Subjective, Objective, Assessment, and Plan (SOAP) notes generated by ChatGPT-4, an AI model, using established transcripts of History and Physical Examination as the gold standard. We seek to identify potential errors and evaluate the model's performance across different categories. METHODS: We conducted simulated patient-provider encounters representing various ambulatory specialties and transcribed the audio files. Key reportable elements were identified, and ChatGPT-4 was used to generate SOAP notes based on these transcripts. Three versions of each note were created and compared to the gold standard via chart review; errors generated from the comparison were categorized as omissions, incorrect information, or additions. We compared the accuracy of data elements across versions, transcript length, and data categories. Additionally, we assessed note quality using the Physician Documentation Quality Instrument (PDQI) scoring system. RESULTS: Although ChatGPT-4 consistently generated SOAP-style notes, there were, on average, 23.6 errors per clinical case, with errors of omission (86%) being the most common, followed by addition errors (10.5%) and inclusion of incorrect facts (3.2%). There was significant variance between replicates of the same case, with only 52.9% of data elements reported correctly across all 3 replicates. The accuracy of data elements varied across cases, with the highest accuracy observed in the "Objective" section. Consequently, the measure of note quality, assessed by PDQI, demonstrated intra- and intercase variance. Finally, the accuracy of ChatGPT-4 was inversely correlated to both the transcript length (P=.05) and the number of scorable data elements (P=.05). CONCLUSIONS: Our study reveals substantial variability in errors, accuracy, and note quality generated by ChatGPT-4. Errors were not limited to specific sections, and the inconsistency in error types across replicates complicated predictability. Transcript length and data complexity were inversely correlated with note accuracy, raising concerns about the model's effectiveness in handling complex medical cases. The quality and reliability of clinical notes produced by ChatGPT-4 do not meet the standards required for clinical use. Although AI holds promise in health care, caution should be exercised before widespread adoption. Further research is needed to address accuracy, variability, and potential errors. ChatGPT-4, while valuable in various applications, should not be considered a safe alternative to human-generated clinical documentation at this time.


Assuntos
Relações Médico-Paciente , Humanos , Documentação/métodos , Registros Eletrônicos de Saúde , Inteligência Artificial
7.
Med Teach ; 46(3): 366-372, 2024 03.
Artigo em Inglês | MEDLINE | ID: mdl-37839017

RESUMO

PURPOSE: ChatGPT-4 is an upgraded version of an artificial intelligence chatbot. The performance of ChatGPT-4 on the United States Medical Licensing Examination (USMLE) has not been independently characterized. We aimed to assess the performance of ChatGPT-4 at responding to USMLE Step 1, Step 2CK, and Step 3 practice questions. METHOD: Practice multiple-choice questions for the USMLE Step 1, Step 2CK, and Step 3 were compiled. Of 376 available questions, 319 (85%) were analyzed by ChatGPT-4 on March 21st, 2023. Our primary outcome was the performance of ChatGPT-4 for the practice USMLE Step 1, Step 2CK, and Step 3 examinations, measured as the proportion of multiple-choice questions answered correctly. Our secondary outcomes were the mean length of questions and responses provided by ChatGPT-4. RESULTS: ChatGPT-4 responded to 319 text-based multiple-choice questions from USMLE practice test material. ChatGPT-4 answered 82 of 93 (88%) questions correctly on USMLE Step 1, 91 of 106 (86%) on Step 2CK, and 108 of 120 (90%) on Step 3. ChatGPT-4 provided explanations for all questions. ChatGPT-4 spent 30.8 ± 11.8 s on average responding to practice questions for USMLE Step 1, 23.0 ± 9.4 s per question for Step 2CK, and 23.1 ± 8.3 s per question for Step 3. The mean length of practice USMLE multiple-choice questions that were answered correctly and incorrectly by ChatGPT-4 was similar (difference = 17.48 characters, SE = 59.75, 95%CI = [-100.09,135.04], t = 0.29, p = 0.77). The mean length of ChatGPT-4's correct responses to practice questions was significantly shorter than the mean length of incorrect responses (difference = 79.58 characters, SE = 35.42, 95%CI = [9.89,149.28], t = 2.25, p = 0.03). CONCLUSIONS: ChatGPT-4 answered a remarkably high proportion of practice questions correctly for USMLE examinations. ChatGPT-4 performed substantially better at USMLE practice questions than previous models of the same AI chatbot.


Assuntos
Inteligência Artificial , Software , Humanos , Licenciamento , Exame Físico
8.
BMC Med Educ ; 24(1): 536, 2024 May 15.
Artigo em Inglês | MEDLINE | ID: mdl-38750546

RESUMO

BACKGROUND: An illness script is a specific script format geared to represent patient-oriented clinical knowledge organized around enabling conditions, faults (i.e., pathophysiological process), and consequences. Generative artificial intelligence (AI) stands out as an educational aid in continuing medical education. The effortless creation of a typical illness script by generative AI could help the comprehension of key features of diseases and increase diagnostic accuracy. No systematic summary of specific examples of illness scripts has been reported since illness scripts are unique to each physician. OBJECTIVE: This study investigated whether generative AI can generate illness scripts. METHODS: We utilized ChatGPT-4, a generative AI, to create illness scripts for 184 diseases based on the diseases and conditions integral to the National Model Core Curriculum in Japan for undergraduate medical education (2022 revised edition) and primary care specialist training in Japan. Three physicians applied a three-tier grading scale: "A" denotes that the content of each disease's illness script proves sufficient for training medical students, "B" denotes that it is partially lacking but acceptable, and "C" denotes that it is deficient in multiple respects. RESULTS: By leveraging ChatGPT-4, we successfully generated each component of the illness script for 184 diseases without any omission. The illness scripts received "A," "B," and "C" ratings of 56.0% (103/184), 28.3% (52/184), and 15.8% (29/184), respectively. CONCLUSION: Useful illness scripts were seamlessly and instantaneously created using ChatGPT-4 by employing prompts appropriate for medical students. The technology-driven illness script is a valuable tool for introducing medical students to key features of diseases.


Assuntos
Competência Clínica , Educação de Graduação em Medicina , Humanos , Japão , Inteligência Artificial , Currículo , Avaliação Educacional , Estudantes de Medicina
9.
BMC Oral Health ; 24(1): 605, 2024 May 24.
Artigo em Inglês | MEDLINE | ID: mdl-38789962

RESUMO

BACKGROUND: The use of artificial intelligence in the field of health sciences is becoming widespread. It is known that patients benefit from artificial intelligence applications on various health issues, especially after the pandemic period. One of the most important issues in this regard is the accuracy of the information provided by artificial intelligence applications. OBJECTIVE: The purpose of this study was to the frequently asked questions about dental amalgam, as determined by the United States Food and Drug Administration (FDA), which is one of these information resources, to Chat Generative Pre-trained Transformer version 4 (ChatGPT-4) and to compare the content of the answers given by the application with the answers of the FDA. METHODS: The questions were directed to ChatGPT-4 on May 8th and May 16th, 2023, and the responses were recorded and compared at the word and meaning levels using ChatGPT. The answers from the FDA webpage were also recorded. The responses were compared for content similarity in "Main Idea", "Quality Analysis", "Common Ideas", and "Inconsistent Ideas" between ChatGPT-4's responses and FDA's responses. RESULTS: ChatGPT-4 provided similar responses at one-week intervals. In comparison with FDA guidance, it provided answers with similar information content to frequently asked questions. However, although there were some similarities in the general aspects of the recommendation regarding amalgam removal in the question, the two texts are not the same, and they offered different perspectives on the replacement of fillings. CONCLUSIONS: The findings of this study indicate that ChatGPT-4, an artificial intelligence based application, encompasses current and accurate information regarding dental amalgam and its removal, providing it to individuals seeking access to such information. Nevertheless, we believe that numerous studies are required to assess the validity and reliability of ChatGPT-4 across diverse subjects.


Assuntos
Amálgama Dentário , United States Food and Drug Administration , Estados Unidos , Humanos , Inteligência Artificial , Inquéritos e Questionários
10.
J Med Internet Res ; 25: e49771, 2023 12 14.
Artigo em Inglês | MEDLINE | ID: mdl-38096014

RESUMO

BACKGROUND: The COVID-19 pandemic, caused by the SARS-CoV-2 virus, has necessitated reliable and authoritative information for public guidance. The World Health Organization (WHO) has been a primary source of such information, disseminating it through a question and answer format on its official website. Concurrently, ChatGPT 3.5 and 4.0, a deep learning-based natural language generation system, has shown potential in generating diverse text types based on user input. OBJECTIVE: This study evaluates the accuracy of COVID-19 information generated by ChatGPT 3.5 and 4.0, assessing its potential as a supplementary public information source during the pandemic. METHODS: We extracted 487 COVID-19-related questions from the WHO's official website and used ChatGPT 3.5 and 4.0 to generate corresponding answers. These generated answers were then compared against the official WHO responses for evaluation. Two clinical experts scored the generated answers on a scale of 0-5 across 4 dimensions-accuracy, comprehensiveness, relevance, and clarity-with higher scores indicating better performance in each dimension. The WHO responses served as the reference for this assessment. Additionally, we used the BERT (Bidirectional Encoder Representations from Transformers) model to generate similarity scores (0-1) between the generated and official answers, providing a dual validation mechanism. RESULTS: The mean (SD) scores for ChatGPT 3.5-generated answers were 3.47 (0.725) for accuracy, 3.89 (0.719) for comprehensiveness, 4.09 (0.787) for relevance, and 3.49 (0.809) for clarity. For ChatGPT 4.0, the mean (SD) scores were 4.15 (0.780), 4.47 (0.641), 4.56 (0.600), and 4.09 (0.698), respectively. All differences were statistically significant (P<.001), with ChatGPT 4.0 outperforming ChatGPT 3.5. The BERT model verification showed mean (SD) similarity scores of 0.83 (0.07) for ChatGPT 3.5 and 0.85 (0.07) for ChatGPT 4.0 compared with the official WHO answers. CONCLUSIONS: ChatGPT 3.5 and 4.0 can generate accurate and relevant COVID-19 information to a certain extent. However, compared with official WHO responses, gaps and deficiencies exist. Thus, users of ChatGPT 3.5 and 4.0 should also reference other reliable information sources to mitigate potential misinformation risks. Notably, ChatGPT 4.0 outperformed ChatGPT 3.5 across all evaluated dimensions, a finding corroborated by BERT model validation.


Assuntos
COVID-19 , Humanos , SARS-CoV-2 , Pandemias , Idioma , Organização Mundial da Saúde
11.
Aesthetic Plast Surg ; 2023 Sep 28.
Artigo em Inglês | MEDLINE | ID: mdl-37770637

RESUMO

BACKGROUND: With the increasing integration of artificial intelligence (AI) in health care, AI chatbots like ChatGPT-4 are being used to deliver health information. OBJECTIVES: This study aimed to assess the capability of ChatGPT-4 in answering common questions related to abdominoplasty, evaluating its potential as an adjunctive tool in patient education and preoperative consultation. METHODS: A variety of common questions about abdominoplasty were submitted to ChatGPT-4. These questions were sourced from a question list provided by the American Society of Plastic Surgery to ensure their relevance and comprehensiveness. An experienced plastic surgeon meticulously evaluated the responses generated by ChatGPT-4 in terms of informational depth, response articulation, and competency to determine the proficiency of the AI in providing patient-centered information. RESULTS: The study showed that ChatGPT-4 can give clear answers, making it useful for answering common queries. However, it struggled with personalized advice and sometimes provided incorrect or outdated references. Overall, ChatGPT-4 can effectively share abdominoplasty information, which may help patients better understand the procedure. Despite these positive findings, the AI needs more refinement, especially in providing personalized and accurate information, to fully meet patient education needs in plastic surgery. CONCLUSIONS: Although ChatGPT-4 shows promise as a resource for patient education, continuous improvements and rigorous checks are essential for its beneficial integration into healthcare settings. The study emphasizes the need for further research, particularly focused on improving the personalization and accuracy of AI responses. LEVEL OF EVIDENCE V: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .

12.
J Orthop Traumatol ; 24(1): 61, 2023 Nov 28.
Artigo em Inglês | MEDLINE | ID: mdl-38015298

RESUMO

BACKGROUND: The spread of artificial intelligence (AI) has led to transformative advancements in diverse sectors, including healthcare. Specifically, generative writing systems have shown potential in various applications, but their effectiveness in clinical settings has been barely investigated. In this context, we evaluated the proficiency of ChatGPT-4 in diagnosing gonarthrosis and coxarthrosis and recommending appropriate treatments compared with orthopaedic specialists. METHODS: A retrospective review was conducted using anonymized medical records of 100 patients previously diagnosed with either knee or hip arthrosis. ChatGPT-4 was employed to analyse these historical records, formulating both a diagnosis and potential treatment suggestions. Subsequently, a comparative analysis was conducted to assess the concordance between the AI's conclusions and the original clinical decisions made by the physicians. RESULTS: In diagnostic evaluations, ChatGPT-4 consistently aligned with the conclusions previously drawn by physicians. In terms of treatment recommendations, there was an 83% agreement between the AI and orthopaedic specialists. The therapeutic concordance was verified by the calculation of a Cohen's Kappa coefficient of 0.580 (p < 0.001). This indicates a moderate-to-good level of agreement. In recommendations pertaining to surgical treatment, the AI demonstrated a sensitivity and specificity of 78% and 80%, respectively. Multivariable logistic regression demonstrated that the variables reduced quality of life (OR 49.97, p < 0.001) and start-up pain (OR 12.54, p = 0.028) have an influence on ChatGPT-4's recommendation for a surgery. CONCLUSION: This study emphasises ChatGPT-4's notable potential in diagnosing conditions such as gonarthrosis and coxarthrosis and in aligning its treatment recommendations with those of orthopaedic specialists. However, it is crucial to acknowledge that AI tools such as ChatGPT-4 are not meant to replace the nuanced expertise and clinical judgment of seasoned orthopaedic surgeons, particularly in complex decision-making scenarios regarding treatment indications. Due to the exploratory nature of the study, further research with larger patient populations and more complex diagnoses is necessary to validate the findings and explore the broader potential of AI in healthcare. LEVEL OF EVIDENCE: Level III evidence.


Assuntos
Osteoartrite do Quadril , Osteoartrite do Joelho , Humanos , Osteoartrite do Quadril/diagnóstico , Osteoartrite do Quadril/terapia , Inteligência Artificial , Qualidade de Vida , Osteoartrite do Joelho/diagnóstico , Osteoartrite do Joelho/terapia , Articulação do Joelho
13.
Diagnosis (Berl) ; 2024 May 07.
Artigo em Inglês | MEDLINE | ID: mdl-38709491

RESUMO

OBJECTIVES: Validate the diagnostic accuracy of the Artificial Intelligence Large Language Model ChatGPT4 by comparing diagnosis lists produced by ChatGPT4 to Isabel Pro. METHODS: This study used 201 cases, comparing ChatGPT4 to Isabel Pro. Systems inputs were identical. Mean Reciprocal Rank (MRR) compares the correct diagnosis's rank between systems. Isabel Pro ranks by the frequency with which the symptoms appear in the reference dataset. The mechanism ChatGPT4 uses to rank the diagnoses is unknown. A Wilcoxon Signed Rank Sum test failed to reject the null hypothesis. RESULTS: Both systems produced comprehensive differential diagnosis lists. Isabel Pro's list appears immediately upon submission, while ChatGPT4 takes several minutes. Isabel Pro produced 175 (87.1 %) correct diagnoses and ChatGPT4 165 (82.1 %). The MRR for ChatGPT4 was 0.428 (rank 2.31), and Isabel Pro was 0.389 (rank 2.57), an average rank of three for each. ChatGPT4 outperformed on Recall at Rank 1, 5, and 10, with Isabel Pro outperforming at 20, 30, and 40. The Wilcoxon Signed Rank Sum Test confirmed that the sample size was inadequate to conclude that the systems are equivalent. ChatGPT4 fabricated citations and DOIs, producing 145 correct references (87.9 %) but only 52 correct DOIs (31.5 %). CONCLUSIONS: This study validates the promise of Clinical Diagnostic Decision Support Systems, including the Large Language Model form of artificial intelligence (AI). Until the issue of hallucination of references and, perhaps diagnoses, is resolved in favor of absolute accuracy, clinicians will make cautious use of Large Language Model systems in diagnosis, if at all.

14.
Diagnostics (Basel) ; 14(11)2024 May 22.
Artigo em Inglês | MEDLINE | ID: mdl-38893609

RESUMO

This study assesses the efficacy of ChatGPT-4, an advanced artificial intelligence (AI) language model, in delivering precise and comprehensive answers to inquiries regarding managing polycystic ovary syndrome (PCOS)-related infertility. The research team, comprising experienced gynecologists, formulated 460 structured queries encompassing a wide range of common and intricate PCOS scenarios. The queries were: true/false (170), open-ended (165), and multiple-choice (125) and further classified as 'easy', 'moderate', and 'hard'. For true/false questions, ChatGPT-4 achieved a flawless accuracy rate of 100% initially and upon reassessment after 30 days. In the open-ended category, there was a noteworthy enhancement in accuracy, with scores increasing from 5.53 ± 0.89 initially to 5.88 ± 0.43 at the 30-day mark (p < 0.001). Completeness scores for open-ended queries also experienced a significant improvement, rising from 2.35 ± 0.58 to 2.92 ± 0.29 (p < 0.001). In the multiple-choice category, although the accuracy score exhibited a minor decline from 5.96 ± 0.44 to 5.92 ± 0.63 after 30 days (p > 0.05). Completeness scores for multiple-choice questions remained consistent, with initial and 30-day means of 2.98 ± 0.18 and 2.97 ± 0.25, respectively (p > 0.05). ChatGPT-4 demonstrated exceptional performance in true/false queries and significantly improved handling of open-ended questions during the 30 days. These findings emphasize the potential of AI, particularly ChatGPT-4, in enhancing decision-making support for healthcare professionals managing PCOS-related infertility.

15.
Otolaryngol Head Neck Surg ; 170(6): 1519-1526, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38591726

RESUMO

OBJECTIVE: To study the performance of Chatbot Generative Pretrained Transformer-4 (ChatGPT-4) in the management of cases in otolaryngology-head and neck surgery. STUDY DESIGN: Prospective case series. SETTING: Multicenter University Hospitals. METHODS: History, clinical, physical, and additional examinations of adult outpatients consulting in otolaryngology departments of CHU Saint-Pierre and Dour Medical Center were presented to ChatGPT-4, which was interrogated for differential diagnoses, management, and treatment(s). According to specialty, the ChatGPT-4 responses were assessed by 2 distinct, blinded board-certified otolaryngologists with the Artificial Intelligence Performance Instrument. RESULTS: One hundred cases were presented to ChatGPT-4. ChaGPT-4 indicated a mean of 3.34 (95% confidence interval [CI]: 3.09, 3.59) additional examinations per patient versus 2.10 (95% CI: 1.76, 2.34; P = .001) for the practitioners. There was strong consistency (k > 0.600) between otolaryngologists and ChatGPT-4 for the indication of upper aerodigestive tract endoscopy, positron emission tomography and computed tomography, audiometry, tympanometry, and psychophysical evaluations. Primary diagnosis was correctly performed by ChatGPT-4 in 38% to 86% of cases depending on subspecialty. Additional examinations indicated by ChatGPT-4 were pertinent and necessary in 8% to 31% of cases, while the treatment regimen was pertinent in 12% to 44% of cases. The performance of ChatGPT-4 was not influenced by the human-reported level of difficulty of clinical cases. CONCLUSION: ChatGPT-4 may be a promising adjunctive tool in otolaryngology, providing extensive documentation about additional examinations, primary and differential diagnoses, and treatments. The ChatGPT-4 is more effective in providing a primary diagnosis, and less effective in the selection of additional examinations and treatments.


Assuntos
Otolaringologia , Humanos , Estudos Prospectivos , Masculino , Feminino , Adulto , Otorrinolaringologistas , Pessoa de Meia-Idade , Otorrinolaringopatias/terapia , Otorrinolaringopatias/diagnóstico , Diagnóstico Diferencial
16.
J Stomatol Oral Maxillofac Surg ; : 101842, 2024 Mar 21.
Artigo em Inglês | MEDLINE | ID: mdl-38521243

RESUMO

The attainment of academic superiority relies heavily upon the accessibility of scholarly resources and the expression of research findings through faultless language usage. Although modern tools, such as the Publish or Perish software program, are proficient in sourcing academic papers based on specific keywords, they often fall short of extracting comprehensive content, including crucial references. The challenge of linguistic precision remains a prominent issue, particularly for research papers composed by non-native English speakers who may encounter word usage errors. This manuscript serves a twofold purpose: firstly, it reassesses the effectiveness of ChatGPT-4 in the context of retrieving pertinent references tailored to specific research topics. Secondly, it introduces a suite of language editing services that are skilled in rectifying word usage errors, ensuring the refined presentation of research outcomes. The article also provides practical guidelines for formulating precise queries to mitigate the risks of erroneous language usage and the inclusion of spurious references. In the ever-evolving realm of academic discourse, leveraging the potential of advanced AI, such as ChatGPT-4, can significantly enhance the quality and impact of scientific publications.

17.
Sci Rep ; 14(1): 9330, 2024 04 23.
Artigo em Inglês | MEDLINE | ID: mdl-38654011

RESUMO

While there is data assessing the test performance of artificial intelligence (AI) chatbots, including the Generative Pre-trained Transformer 4.0 (GPT 4) chatbot (ChatGPT 4.0), there is scarce data on its diagnostic accuracy of clinical cases. We assessed the large language model (LLM), ChatGPT 4.0, on its ability to answer questions from the United States Medical Licensing Exam (USMLE) Step 2, as well as its ability to generate a differential diagnosis based on corresponding clinical vignettes from published case reports. A total of 109 Step 2 Clinical Knowledge (CK) practice questions were inputted into both ChatGPT 3.5 and ChatGPT 4.0, asking ChatGPT to pick the correct answer. Compared to its previous version, ChatGPT 3.5, we found improved accuracy of ChatGPT 4.0 when answering these questions, from 47.7 to 87.2% (p = 0.035) respectively. Utilizing the topics tested on Step 2 CK questions, we additionally found 63 corresponding published case report vignettes and asked ChatGPT 4.0 to come up with its top three differential diagnosis. ChatGPT 4.0 accurately created a shortlist of differential diagnoses in 74.6% of the 63 case reports (74.6%). We analyzed ChatGPT 4.0's confidence in its diagnosis by asking it to rank its top three differentials from most to least likely. Out of the 47 correct diagnoses, 33 were the first (70.2%) on the differential diagnosis list, 11 were second (23.4%), and three were third (6.4%). Our study shows the continued iterative improvement in ChatGPT's ability to answer standardized USMLE questions accurately and provides insights into ChatGPT's clinical diagnostic accuracy.


Assuntos
Inteligência Artificial , Humanos , Estados Unidos , Diagnóstico Diferencial , Licenciamento em Medicina , Competência Clínica , Avaliação Educacional/métodos
18.
F1000Res ; 13: 308, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38845823

RESUMO

Background: Although artificial intelligence technologies are still in their infancy, it is seen that they can bring together both hope and anxiety for the future. In the research, it is focused on examining the ChatGPT-4 version, which is one of the most well-known artificial intelligence applications and claimed to have self-learning feature, within the scope of business establishment processes. Methods: In this direction, the assessment questions in the Entrepreneurship Handbook, published as open access by the Small and Medium Enterprises Development Organization of Turkey, which focuses on guiding the entrepreneurial processes in Turkey and creating the perception of entrepreneurship, were combined with the artificial intelligence model ChatGPT-4 and analysed within three stages. The way of solving the questions of artificial intelligence modelling and the answers it provides have the opportunity to be compared with the entrepreneurship literature. Results: It has been seen that the artificial intelligence modelling ChatGPT-4, being an outstanding entrepreneurship example itself, has succeeded in answering the questions posed in the context of 16 modules in the entrepreneurship handbook in an original way by analysing deeply. Conclusion: It has also been concluded that it is quite creative in developing new alternatives to the correct answers specified in the entrepreneurship handbook. The original aspect of the research is that it is one of the pioneers of the study on artificial intelligence and entrepreneurship in literature.


Assuntos
Inteligência Artificial , Empreendedorismo , Humanos , Modelos Teóricos
19.
Int Forum Allergy Rhinol ; 14(6): 1123-1130, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38268099

RESUMO

KEY POINTS: Chatbot Generative Pre-trained Transformer (ChatGPT)-4 indicated more than twice additional examinations than practitioners in the management of clinical cases in rhinology. The consistency between ChatGPT-4 and practitioner in the indication of additional examinations may significantly vary from one examination to another. The ChatGPT-4 proposed a plausible and correct primary diagnosis in 62.5% cases, while pertinent and necessary additional examinations and therapeutic regimen were indicated in 7.5%-30.0% and 7.5%-32.5% of cases, respectively. The stability of ChatGPT-4 responses is moderate-to-high. The performance of ChatGPT-4 was not influenced by the human-reported level of difficulty of clinical cases.


Assuntos
Otolaringologia , Humanos , Rinite/diagnóstico , Rinite/terapia , Rinite/tratamento farmacológico , Masculino , Feminino
20.
Clin Neurol Neurosurg ; 239: 108238, 2024 04.
Artigo em Inglês | MEDLINE | ID: mdl-38507989

RESUMO

OBJECTIVE: Assess the capabilities of ChatGPT-3.5 and 4 to provide accurate diagnoses, treatment options, and treatment plans for brain tumors in example neuro-oncology cases. METHODS: ChatGPT-3.5 and 4 were provided with twenty example neuro-oncology cases of brain tumors, all selected from medical textbooks. The artificial intelligence programs were asked to give a diagnosis, treatment option, and treatment plan for each of these twenty example cases. Team members first determined in which cases ChatGPT-3.5 and 4 provided the correct diagnosis or treatment plan. Twenty neurosurgeons from the researchers' institution then independently rated the diagnoses, treatment options, and treatment plans provided by both artificial intelligence programs for each of the twenty example cases, on a scale of one to ten, with ten being the highest score. To determine whether the difference between the scores of ChatGPT-3.5 and 4 was statistically significant, a paired t-test was conducted for the average scores given to the programs for each example case. RESULTS: In the initial analysis of correct responses, ChatGPT-4 had an accuracy of 85% for its diagnoses of example brain tumors and an accuracy of 75% for its provided treatment plans, while ChatGPT-3.5 only had an accuracy of 65% and 10%, respectively. The average scores given by the twenty independent neurosurgeons to ChatGPT-4 for its accuracy of diagnosis, provided treatment options, and provided treatment plan were 8.3, 8.4, and 8.5 out of 10, respectively, while ChatGPT-3.5's average scores for these categories of assessment were 5.9, 5.7, and 5.7. These differences in average score are statistically significant on a paired t-test, with a p-value of less than 0.001 for each difference. CONCLUSIONS: ChatGPT-4 demonstrates great promise as a diagnostic tool for brain tumors in neuro-oncology, as attested to by the program's performance in this study and its assessment by surveyed neurosurgeon reviewers.


Assuntos
Inteligência Artificial , Neoplasias Encefálicas , Humanos , Neoplasias Encefálicas/diagnóstico , Neoplasias Encefálicas/terapia , Neurocirurgiões , Pesquisadores , Aprendizado de Máquina
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa