RESUMO
BACKGROUND: ChatGPT, a publicly available artificial intelligence large language model, has allowed for sophisticated artificial intelligence technology on demand. Indeed, use of ChatGPT has already begun to make its way into medical research. However, the medical community has yet to understand the capabilities and ethical considerations of artificial intelligence within this context, and unknowns exist regarding ChatGPT's writing abilities, accuracy, and implications for authorship. OBJECTIVE: We hypothesize that human reviewers and artificial intelligence detection software differ in their ability to correctly identify original published abstracts and artificial intelligence-written abstracts in the subjects of Gynecology and Urogynecology. We also suspect that concrete differences in writing errors, readability, and perceived writing quality exist between original and artificial intelligence-generated text. STUDY DESIGN: Twenty-five articles published in high-impact medical journals and a collection of Gynecology and Urogynecology journals were selected. ChatGPT was prompted to write 25 corresponding artificial intelligence-generated abstracts, providing the abstract title, journal-dictated abstract requirements, and select original results. The original and artificial intelligence-generated abstracts were reviewed by blinded Gynecology and Urogynecology faculty and fellows to identify the writing as original or artificial intelligence-generated. All abstracts were analyzed by publicly available artificial intelligence detection software GPTZero, Originality, and Copyleaks, and were assessed for writing errors and quality by artificial intelligence writing assistant Grammarly. RESULTS: A total of 157 reviews of 25 original and 25 artificial intelligence-generated abstracts were conducted by 26 faculty and 4 fellows; 57% of original abstracts and 42.3% of artificial intelligence-generated abstracts were correctly identified, yielding an average accuracy of 49.7% across all abstracts. All 3 artificial intelligence detectors rated the original abstracts as less likely to be artificial intelligence-written than the ChatGPT-generated abstracts (GPTZero, 5.8% vs 73.3%; P<.001; Originality, 10.9% vs 98.1%; P<.001; Copyleaks, 18.6% vs 58.2%; P<.001). The performance of the 3 artificial intelligence detection software differed when analyzing all abstracts (P=.03), original abstracts (P<.001), and artificial intelligence-generated abstracts (P<.001). Grammarly text analysis identified more writing issues and correctness errors in original than in artificial intelligence abstracts, including lower Grammarly score reflective of poorer writing quality (82.3 vs 88.1; P=.006), more total writing issues (19.2 vs 12.8; P<.001), critical issues (5.4 vs 1.3; P<.001), confusing words (0.8 vs 0.1; P=.006), misspelled words (1.7 vs 0.6; P=.02), incorrect determiner use (1.2 vs 0.2; P=.002), and comma misuse (0.3 vs 0.0; P=.005). CONCLUSION: Human reviewers are unable to detect the subtle differences between human and ChatGPT-generated scientific writing because of artificial intelligence's ability to generate tremendously realistic text. Artificial intelligence detection software improves the identification of artificial intelligence-generated writing, but still lacks complete accuracy and requires programmatic improvements to achieve optimal detection. Given that reviewers and editors may be unable to reliably detect artificial intelligence-generated texts, clear guidelines for reporting artificial intelligence use by authors and implementing artificial intelligence detection software in the review process will need to be established as artificial intelligence chatbots gain more widespread use.
Assuntos
Inteligência Artificial , Ginecologia , Urologia , Humanos , Indexação e Redação de Resumos , Publicações Periódicas como Assunto , Software , Redação , AutoriaRESUMO
OBJECTIVES: To evaluate and compare the performance of ChatGPT™ (Open AI®) and Bing AI™ (Microsoft®) for responding to kidney stone treatment-related questions in accordance with the American Urological Association (AUA) guidelines and assess factors such as appropriateness, emphasis on consulting healthcare providers, references, and adherence to guidelines by each chatbot. METHODS: We developed 20 kidney stone evaluation and treatment-related questions based on the AUA Surgical Management of Stones guideline. Questions were asked to ChatGPT and Bing AI chatbots. We compared their responses utilizing the brief DISCERN tool as well as response appropriateness. RESULTS: ChatGPT significantly outperformed Bing AI for questions 1-3, which evaluate clarity, achievement, and relevance of responses (12.77 ± 1.71 vs. 10.17 ± 3.27; p < 0.01). In contrast, Bing AI always incorporated references, whereas ChatGPT never did. Consequently, the results for questions 4-6, which evaluated the quality of sources, were consistently favored Bing AI over ChatGPT (10.8 vs. 4.28; p < 0.01). Notably, neither chatbot offered guidance against guidelines for pre-operative testing. However, recommendations against guidelines were notable for specific scenarios: 30.5% for the treatment of adults with ureteral stones, 52.5% for adults with renal stones, and 20.5% for all patient treatment. CONCLUSIONS: ChatGPT significantly outperformed Bing AI in terms of providing responses with clear aim, achieving such aim, and relevant and appropriate responses based on AUA surgical stone management guidelines. However, Bing AI provides references, allowing information quality assessment. Additional studies are needed to further evaluate these chatbots and their potential use by clinicians and patients for urologic healthcare-related questions.
Assuntos
Cálculos Renais , Humanos , Cálculos Renais/terapia , Cálculos Renais/cirurgia , Guias de Prática Clínica como Assunto , Inquéritos e Questionários , Fidelidade a Diretrizes , UrologiaRESUMO
BACKGROUND AND OBJECTIVES: The potential impacts of artificial intelligence (AI) chatbots on care for patients with bone sarcoma is poorly understood. Elucidating potential risks and benefits would allow surgeons to define appropriate roles for these tools in clinical care. METHODS: Eleven questions on bone sarcoma diagnosis, treatment, and recovery were inputted into three AI chatbots. Answers were assessed on a 5-point Likert scale for five clinical accuracy metrics: relevance to the question, balance and lack of bias, basis on established data, factual accuracy, and completeness in scope. Responses were quantitatively assessed for empathy and readability. The Patient Education Materials Assessment Tool (PEMAT) was assessed for understandability and actionability. RESULTS: Chatbots scored highly on relevance (4.24) and balance/lack of bias (4.09) but lower on basing responses on established data (3.77), completeness (3.68), and factual accuracy (3.66). Responses generally scored well on understandability (84.30%), while actionability scores were low for questions on treatment (64.58%) and recovery (60.64%). GPT-4 exhibited the highest empathy (4.12). Readability scores averaged between 10.28 for diagnosis questions to 11.65 for recovery questions. CONCLUSIONS: While AI chatbots are promising tools, current limitations in factual accuracy and completeness, as well as concerns of inaccessibility to populations with lower health literacy, may significantly limit their clinical utility.
RESUMO
BACKGROUND: Assessment of artificial intelligence (AI)-based models across languages is crucial to ensure equitable access and accuracy of information in multilingual contexts. This study aimed to compare AI model efficiency in English and Arabic for infectious disease queries. METHODS: The study employed the METRICS checklist for the design and reporting of AI-based studies in healthcare. The AI models tested included ChatGPT-3.5, ChatGPT-4, Bing, and Bard. The queries comprised 15 questions on HIV/AIDS, tuberculosis, malaria, COVID-19, and influenza. The AI-generated content was assessed by two bilingual experts using the validated CLEAR tool. RESULTS: In comparing AI models' performance in English and Arabic for infectious disease queries, variability was noted. English queries showed consistently superior performance, with Bard leading, followed by Bing, ChatGPT-4, and ChatGPT-3.5 (P = .012). The same trend was observed in Arabic, albeit without statistical significance (P = .082). Stratified analysis revealed higher scores for English in most CLEAR components, notably in completeness, accuracy, appropriateness, and relevance, especially with ChatGPT-3.5 and Bard. Across the five infectious disease topics, English outperformed Arabic, except for flu queries in Bing and Bard. The four AI models' performance in English was rated as "excellent", significantly outperforming their "above-average" Arabic counterparts (P = .002). CONCLUSIONS: Disparity in AI model performance was noticed between English and Arabic in response to infectious disease queries. This language variation can negatively impact the quality of health content delivered by AI models among native speakers of Arabic. This issue is recommended to be addressed by AI developers, with the ultimate goal of enhancing health outcomes.
Assuntos
Inteligência Artificial , Doenças Transmissíveis , Idioma , Humanos , COVID-19RESUMO
Background: Medical practitioners are increasingly using artificial intelligence (AI) chatbots for easier and faster access to information. To our knowledge, the accuracy and availability of AI-generated chemotherapy protocols has not yet been studied. Methods: Nine simulated cancer patient cases were designed and AI chatbots, ChatGPT version 3.5 (OpenAI) and Bing (Microsoft), were used to generate chemotherapy protocols for each case. Results: Generated chemotherapy protocols were compared with the original protocols for nine simulated cancer patients. ChatGPT's overall performance was 5 out of 9 on protocol generation, and Bing's was 4 out of 9; this was statistically nonsignificant (p = 1). Conclusion: AI chatbots show both potential and limitations in generating chemotherapy protocols. The overall performance is low, and they should be used carefully in oncological practice.
[Box: see text].
RESUMO
INTRODUCTION: The rapid advancement of artificial intelligence (AI), particularly in large language models like ChatGPT and Google's Gemini AI, marks a transformative era in technological innovation. This study explores the potential of AI in ophthalmology, focusing on the capabilities of ChatGPT and Gemini AI. While these models hold promise for medical education and clinical support, their integration requires comprehensive evaluation. This research aims to bridge a gap in the literature by comparing Gemini AI and ChatGPT, assessing their performance against ophthalmology residents using a dataset derived from ophthalmology board exams. METHODS: A dataset comprising 600 questions across 12 subspecialties was curated from Israeli ophthalmology residency exams, encompassing text and image-based formats. Four AI models - ChatGPT-3.5, ChatGPT-4, Gemini, and Gemini Advanced - underwent testing on this dataset. The study includes a comparative analysis with Israeli ophthalmology residents, employing specific metrics for performance assessment. RESULTS: Gemini Advanced demonstrated superior performance with a 66% accuracy rate. Notably, ChatGPT-4 exhibited improvement at 62%, Gemini at 58%, and ChatGPT-3.5 served as the reference at 46%. Comparative analysis with residents offered insights into AI models' performance relative to human-level medical knowledge. Further analysis delved into yearly performance trends, topic-specific variations, and the impact of images on chatbot accuracy. CONCLUSION: The study unveils nuanced AI model capabilities in ophthalmology, emphasizing domain-specific variations. The superior performance of Gemini Advanced superior performance indicates significant advancements, while ChatGPT-4's improvement is noteworthy. Both Gemini and ChatGPT-3.5 demonstrated commendable performance. The comparative analysis underscores AI's evolving role as a supplementary tool in medical education. This research contributes vital insights into AI effectiveness in ophthalmology, highlighting areas for refinement. As AI models evolve, targeted improvements can enhance adaptability across subspecialties, making them valuable tools for medical professionals and enriching patient care. KEY MESSAGES: What is known AI breakthroughs, like ChatGPT and Google's Gemini AI, are reshaping healthcare. In ophthalmology, AI integration has overhauled clinical workflows, particularly in analyzing images for diseases like diabetic retinopathy and glaucoma. What is new This study presents a pioneering comparison between Gemini AI and ChatGPT, evaluating their performance against ophthalmology residents using a meticulously curated dataset derived from real-world ophthalmology board exams. Notably, Gemini Advanced demonstrates superior performance, showcasing substantial advancements, while the evolution of ChatGPT-4 also merits attention. Both models exhibit commendable capabilities. These findings offer crucial insights into the efficacy of AI in ophthalmology, shedding light on areas ripe for further enhancement and optimization.
RESUMO
INTRODUCTION AND OBJECTIVES: Autoimmune liver diseases (AILDs) are rare and require precise evaluation, which is often challenging for medical providers. Chatbots are innovative solutions to assist healthcare professionals in clinical management. In our study, ten liver specialists systematically evaluated four chatbots to determine their utility as clinical decision support tools in the field of AILDs. MATERIALS AND METHODS: We constructed a 56-question questionnaire focusing on AILD evaluation, diagnosis, and management of Autoimmune Hepatitis (AIH), Primary Biliary Cholangitis (PBC), and Primary Sclerosing Cholangitis (PSC). Four chatbots -ChatGPT 3.5, Claude, Microsoft Copilot, and Google Bard- were presented with the questions in their free tiers in December 2023. Responses underwent critical evaluation by ten liver specialists using a standardized 1 to 10 Likert scale. The analysis included mean scores, the number of highest-rated replies, and the identification of common shortcomings in chatbots performance. RESULTS: Among the assessed chatbots, specialists rated Claude highest with a mean score of 7.37 (SD = 1.91), followed by ChatGPT (7.17, SD = 1.89), Microsoft Copilot (6.63, SD = 2.10), and Google Bard (6.52, SD = 2.27). Claude also excelled with 27 best-rated replies, outperforming ChatGPT (20), while Microsoft Copilot and Google Bard lagged with only 6 and 9, respectively. Common deficiencies included listing details over specific advice, limited dosing options, inaccuracies for pregnant patients, insufficient recent data, over-reliance on CT and MRI imaging, and inadequate discussion regarding off-label use and fibrates in PBC treatment. Notably, internet access for Microsoft Copilot and Google Bard did not enhance precision compared to pre-trained models. CONCLUSIONS: Chatbots hold promise in AILD support, but our study underscores key areas for improvement. Refinement is needed in providing specific advice, accuracy, and focused up-to-date information. Addressing these shortcomings is essential for enhancing the utility of chatbots in AILD management, guiding future development, and ensuring their effectiveness as clinical decision-support tools.
RESUMO
Mental health chatbots (MHCBs) designed to support individuals in coping with mental health issues are rapidly advancing. Currently, these MHCBs are predominantly used in commercial rather than clinical contexts, but this might change soon. The question is whether this use is ethically desirable. This paper addresses a critical yet understudied concern: assuming that MHCBs cannot have genuine emotions, how this assumption may affect psychotherapy, and consequently the quality of treatment outcomes. We argue that if MHCBs lack emotions, they cannot have genuine (affective) empathy or utilise countertransference. Consequently, this gives reason to worry that MHCBs are (a) more liable to harm and (b) less likely to benefit patients than human therapists. We discuss some responses to this worry and conclude that further empirical research is necessary to determine whether these worries are valid. We conclude that, even if these worries are valid, it does not mean that we should never use MHCBs. By discussing the broader ethical debate on the clinical use of chatbots, we point towards how further research can help us establish ethical boundaries for how we should use mental health chatbots.
Assuntos
Emoções , Empatia , Psicoterapeutas , Psicoterapia , Humanos , Psicoterapia/ética , Contratransferência , Transtornos Mentais/terapia , Saúde Mental , Adaptação PsicológicaRESUMO
The chapter provides an in-depth analysis of digital therapeutics (DTx) as a revolutionary approach to managing major depressive disorder (MDD). It discusses the evolution and definition of DTx, their application across various medical fields, regulatory considerations, and their benefits and limitations. This chapter extensively covers DTx for MDD, including smartphone applications, virtual reality interventions, cognitive-behavioral therapy (CBT) platforms, artificial intelligence (AI) and chatbot therapies, biofeedback, wearable technologies, and serious games. It evaluates the effectiveness of these digital interventions, comparing them with traditional treatments and examining patient perspectives, compliance, and engagement. The integration of DTx into clinical practice is also explored, along with the challenges and barriers to their adoption, such as technological limitations, data privacy concerns, ethical considerations, reimbursement issues, and the need for improved digital literacy. This chapter concludes by looking at the future direction of DTx in mental healthcare, emphasizing the need for personalized treatment plans, integration with emerging modalities, and the expansion of access to these innovative solutions globally.
Assuntos
Inteligência Artificial , Terapia Cognitivo-Comportamental , Transtorno Depressivo Maior , Humanos , Transtorno Depressivo Maior/terapia , Terapia Cognitivo-Comportamental/métodos , Telemedicina/tendências , Aplicativos Móveis , Biorretroalimentação Psicológica/métodos , Smartphone , Dispositivos Eletrônicos Vestíveis , Jogos de VídeoRESUMO
While digital innovation in health was already rapidly evolving, the COVID-19 pandemic has accelerated the generation of digital technology tools, such as chatbots, to help increase access to crucial health information and services to those who were cut off or had limited contact with health services. This theme issue titled "Chatbots and COVID-19" presents articles from researchers and practitioners across the globe, describing the development, implementation, and evaluation of chatbots designed to address a wide range of health concerns and services. In this editorial, we present some of the key challenges and lessons learned arising from the content of this theme issue. Most notably, we note that a stronger evidence base is needed to ensure that chatbots and other digital tools are developed to best serve the needs of population health.
Assuntos
COVID-19 , Saúde da População , Humanos , Pandemias/prevenção & controle , Tecnologia DigitalRESUMO
BACKGROUND: In recent years, there has been an increase in the use of conversational agents for health promotion and service delivery. To date, health professionals' views on the use of this technology have received limited attention in the literature. OBJECTIVE: The purpose of this study was to gain a better understanding of how health professionals view the use of conversational agents for health care. METHODS: Physicians, nurses, and regulated mental health professionals were recruited using various web-based methods. Participants were interviewed individually using the Zoom (Zoom Video Communications, Inc) videoconferencing platform. Interview questions focused on the potential benefits and risks of using conversational agents for health care, as well as the best way to integrate conversational agents into the health care system. Interviews were transcribed verbatim and uploaded to NVivo (version 12; QSR International, Inc) for thematic analysis. RESULTS: A total of 24 health professionals participated in the study (19 women, 5 men; mean age 42.75, SD 10.71 years). Participants said that the use of conversational agents for health care could have certain benefits, such as greater access to care for patients or clients and workload support for health professionals. They also discussed potential drawbacks, such as an added burden on health professionals (eg, program familiarization) and the limited capabilities of these programs. Participants said that conversational agents could be used for routine or basic tasks, such as screening and assessment, providing information and education, and supporting individuals between appointments. They also said that health professionals should have some oversight in terms of the development and implementation of these programs. CONCLUSIONS: The results of this study provide insight into health professionals' views on the use of conversational agents for health care, particularly in terms of the benefits and drawbacks of these programs and how they should be integrated into the health care system. These collective findings offer useful information and guidance to stakeholders who have an interest in the development and implementation of this technology.
Assuntos
Pessoal de Saúde , Pesquisa Qualitativa , Humanos , Feminino , Masculino , Adulto , Pessoal de Saúde/psicologia , Pessoa de Meia-Idade , Comunicação , Atitude do Pessoal de Saúde , Comunicação por Videoconferência , Atenção à SaúdeRESUMO
This editorial explores the evolving and transformative role of large language models (LLMs) in enhancing the capabilities of virtual assistants (VAs) in the health care domain, highlighting recent research on the performance of VAs and LLMs in health care information sharing. Focusing on recent research, this editorial unveils the marked improvement in the accuracy and clinical relevance of responses from LLMs, such as GPT-4, compared to current VAs, especially in addressing complex health care inquiries, like those related to postpartum depression. The improved accuracy and clinical relevance with LLMs mark a paradigm shift in digital health tools and VAs. Furthermore, such LLM applications have the potential to dynamically adapt and be integrated into existing VA platforms, offering cost-effective, scalable, and inclusive solutions. These suggest a significant increase in the applicable range of VA applications, as well as the increased value, risk, and impact in health care, moving toward more personalized digital health ecosystems. However, alongside these advancements, it is necessary to develop and adhere to ethical guidelines, regulatory frameworks, governance principles, and privacy and safety measures. We need a robust interdisciplinary collaboration to navigate the complexities of safely and effectively integrating LLMs into health care applications, ensuring that these emerging technologies align with the diverse needs and ethical considerations of the health care domain.
Assuntos
Depressão Pós-Parto , Ecossistema , Feminino , Humanos , Saúde Digital , Disseminação de Informação , IdiomaRESUMO
BACKGROUND: The COVID-19 pandemic has profoundly affected mental health, leading to an increased prevalence of depression and insomnia. Currently, artificial intelligence (AI) and deep learning have thoroughly transformed health care-related mobile apps, offered more effective mental health support, and alleviated the psychological stress that may have emerged during the pandemic. Early reviews outlined the use of mobile apps for dealing with depression and insomnia separately. However, there is now an urgent need for a systematic evaluation of mobile apps that address both depression and insomnia to reveal new applications and research gaps. OBJECTIVE: This study aims to systematically review and evaluate mobile apps targeting depression and insomnia, highlighting their features, effectiveness, and gaps in the current research. METHODS: We systematically searched PubMed, Scopus, and Web of Science for peer-reviewed journal articles published between 2017 and 2023. The inclusion criteria were studies that (1) focused on mobile apps addressing both depression and insomnia, (2) involved young people or adult participants, and (3) provided data on treatment efficacy. Data extraction was independently conducted by 2 reviewers. Title and abstract screening, as well as full-text screening, were completed in duplicate. Data were extracted by a single reviewer and verified by a second reviewer, and risk of bias assessments were completed accordingly. RESULTS: Of the initial 383 studies we found, 365 were excluded after title, abstract screening, and removal of duplicates. Eventually, 18 full-text articles met our criteria and underwent full-text screening. The analysis revealed that mobile apps related to depression and insomnia were primarily utilized for early detection, assessment, and screening (n=5 studies); counseling and psychological support (n=3 studies); and cognitive behavioral therapy (CBT; n=10 studies). Among the 10 studies related to depression, our findings showed that chatbots demonstrated significant advantages in improving depression symptoms, a promising development in the field. Additionally, 2 studies evaluated the effectiveness of mobile apps as alternative interventions for depression and sleep, further expanding the potential applications of this technology. CONCLUSIONS: The integration of AI and deep learning into mobile apps, particularly chatbots, is a promising avenue for personalized mental health support. Through innovative features, such as early detection, assessment, counseling, and CBT, these apps significantly contribute toward improving sleep quality and addressing depression. The reviewed chatbots leveraged advanced technologies, including natural language processing, machine learning, and generative dialog, to provide intelligent and autonomous interactions. Compared with traditional face-to-face therapies, their feasibility, acceptability, and potential efficacy highlight their user-friendly, cost-effective, and accessible nature with the aim of enhancing sleep and mental health outcomes.
Assuntos
Depressão , Aplicativos Móveis , Distúrbios do Início e da Manutenção do Sono , Humanos , Inteligência Artificial , COVID-19/psicologia , Depressão/complicações , Depressão/terapia , Distúrbios do Início e da Manutenção do Sono/etiologia , Distúrbios do Início e da Manutenção do Sono/psicologia , Distúrbios do Início e da Manutenção do Sono/terapia , TelemedicinaRESUMO
BACKGROUND: Family health history (FHx) is an important predictor of a person's genetic risk but is not collected by many adults in the United States. OBJECTIVE: This study aims to test and compare the usability, engagement, and report usefulness of 2 web-based methods to collect FHx. METHODS: This mixed methods study compared FHx data collection using a flow-based chatbot (KIT; the curious interactive test) and a form-based method. KIT's design was optimized to reduce user burden. We recruited and randomized individuals from 2 crowdsourced platforms to 1 of the 2 FHx methods. All participants were asked to complete a questionnaire to assess the method's usability, the usefulness of a report summarizing their experience, user-desired chatbot enhancements, and general user experience. Engagement was studied using log data collected by the methods. We used qualitative findings from analyzing free-text comments to supplement the primary quantitative results. RESULTS: Participants randomized to KIT reported higher usability than those randomized to the form, with a mean System Usability Scale score of 80.2 versus 61.9 (P<.001), respectively. The engagement analysis reflected design differences in the onboarding process. KIT users spent less time entering FHx information and reported more conditions than form users (mean 5.90 vs 7.97 min; P=.04; and mean 7.8 vs 10.1 conditions; P=.04). Both KIT and form users somewhat agreed that the report was useful (Likert scale ratings of 4.08 and 4.29, respectively). Among desired enhancements, personalization was the highest-rated feature (188/205, 91.7% rated medium- to high-priority). Qualitative analyses revealed positive and negative characteristics of both KIT and the form-based method. Among respondents randomized to KIT, most indicated it was easy to use and navigate and that they could respond to and understand user prompts. Negative comments addressed KIT's personality, conversational pace, and ability to manage errors. For KIT and form respondents, qualitative results revealed common themes, including a desire for more information about conditions and a mutual appreciation for the multiple-choice button response format. Respondents also said they wanted to report health information beyond KIT's prompts (eg, personal health history) and for KIT to provide more personalized responses. CONCLUSIONS: We showed that KIT provided a usable way to collect FHx. We also identified design considerations to improve chatbot-based FHx data collection: First, the final report summarizing the FHx collection experience should be enhanced to provide more value for patients. Second, the onboarding chatbot prompt may impact data quality and should be carefully considered. Finally, we highlighted several areas that could be improved by moving from a flow-based chatbot to a large language model implementation strategy.
Assuntos
Anamnese , Humanos , Feminino , Masculino , Anamnese/métodos , Anamnese/estatística & dados numéricos , Adulto , Saúde da Família , Inquéritos e Questionários , Pessoa de Meia-Idade , Coleta de Dados/métodos , InternetRESUMO
BACKGROUND: Artificial intelligence and the language models derived from it, such as ChatGPT, offer immense possibilities, particularly in the field of medicine. It is already evident that ChatGPT can provide adequate and, in some cases, expert-level responses to health-related queries and advice for patients. However, it is currently unknown how patients perceive these capabilities, whether they can derive benefit from them, and whether potential risks, such as harmful suggestions, are detected by patients. OBJECTIVE: This study aims to clarify whether patients can get useful and safe health care advice from an artificial intelligence chatbot assistant. METHODS: This cross-sectional study was conducted using 100 publicly available health-related questions from 5 medical specialties (trauma, general surgery, otolaryngology, pediatrics, and internal medicine) from a web-based platform for patients. Responses generated by ChatGPT-4.0 and by an expert panel (EP) of experienced physicians from the aforementioned web-based platform were packed into 10 sets consisting of 10 questions each. The blinded evaluation was carried out by patients regarding empathy and usefulness (assessed through the question: "Would this answer have helped you?") on a scale from 1 to 5. As a control, evaluation was also performed by 3 physicians in each respective medical specialty, who were additionally asked about the potential harm of the response and its correctness. RESULTS: In total, 200 sets of questions were submitted by 64 patients (mean 45.7, SD 15.9 years; 29/64, 45.3% male), resulting in 2000 evaluated answers of ChatGPT and the EP each. ChatGPT scored higher in terms of empathy (4.18 vs 2.7; P<.001) and usefulness (4.04 vs 2.98; P<.001). Subanalysis revealed a small bias in terms of levels of empathy given by women in comparison with men (4.46 vs 4.14; P=.049). Ratings of ChatGPT were high regardless of the participant's age. The same highly significant results were observed in the evaluation of the respective specialist physicians. ChatGPT outperformed significantly in correctness (4.51 vs 3.55; P<.001). Specialists rated the usefulness (3.93 vs 4.59) and correctness (4.62 vs 3.84) significantly lower in potentially harmful responses from ChatGPT (P<.001). This was not the case among patients. CONCLUSIONS: The results indicate that ChatGPT is capable of supporting patients in health-related queries better than physicians, at least in terms of written advice through a web-based platform. In this study, ChatGPT's responses had a lower percentage of potentially harmful advice than the web-based EP. However, it is crucial to note that this finding is based on a specific study design and may not generalize to all health care settings. Alarmingly, patients are not able to independently recognize these potential dangers.
Assuntos
Relações Médico-Paciente , Humanos , Estudos Transversais , Masculino , Feminino , Adulto , Pessoa de Meia-Idade , Inteligência Artificial , Médicos/psicologia , Internet , Empatia , Inquéritos e QuestionáriosRESUMO
BACKGROUND: Artificial intelligence is increasingly being applied to many workflows. Large language models (LLMs) are publicly accessible platforms trained to understand, interact with, and produce human-readable text; their ability to deliver relevant and reliable information is also of particular interest for the health care providers and the patients. Hematopoietic stem cell transplantation (HSCT) is a complex medical field requiring extensive knowledge, background, and training to practice successfully and can be challenging for the nonspecialist audience to comprehend. OBJECTIVE: We aimed to test the applicability of 3 prominent LLMs, namely ChatGPT-3.5 (OpenAI), ChatGPT-4 (OpenAI), and Bard (Google AI), in guiding nonspecialist health care professionals and advising patients seeking information regarding HSCT. METHODS: We submitted 72 open-ended HSCT-related questions of variable difficulty to the LLMs and rated their responses based on consistency-defined as replicability of the response-response veracity, language comprehensibility, specificity to the topic, and the presence of hallucinations. We then rechallenged the 2 best performing chatbots by resubmitting the most difficult questions and prompting to respond as if communicating with either a health care professional or a patient and to provide verifiable sources of information. Responses were then rerated with the additional criterion of language appropriateness, defined as language adaptation for the intended audience. RESULTS: ChatGPT-4 outperformed both ChatGPT-3.5 and Bard in terms of response consistency (66/72, 92%; 54/72, 75%; and 63/69, 91%, respectively; P=.007), response veracity (58/66, 88%; 40/54, 74%; and 16/63, 25%, respectively; P<.001), and specificity to the topic (60/66, 91%; 43/54, 80%; and 27/63, 43%, respectively; P<.001). Both ChatGPT-4 and ChatGPT-3.5 outperformed Bard in terms of language comprehensibility (64/66, 97%; 53/54, 98%; and 52/63, 83%, respectively; P=.002). All displayed episodes of hallucinations. ChatGPT-3.5 and ChatGPT-4 were then rechallenged with a prompt to adapt their language to the audience and to provide source of information, and responses were rated. ChatGPT-3.5 showed better ability to adapt its language to nonmedical audience than ChatGPT-4 (17/21, 81% and 10/22, 46%, respectively; P=.03); however, both failed to consistently provide correct and up-to-date information resources, reporting either out-of-date materials, incorrect URLs, or unfocused references, making their output not verifiable by the reader. CONCLUSIONS: In conclusion, despite LLMs' potential capability in confronting challenging medical topics such as HSCT, the presence of mistakes and lack of clear references make them not yet appropriate for routine, unsupervised clinical use, or patient counseling. Implementation of LLMs' ability to access and to reference current and updated websites and research papers, as well as development of LLMs trained in specialized domain knowledge data sets, may offer potential solutions for their future clinical application.
Assuntos
Pessoal de Saúde , Transplante de Células-Tronco Hematopoéticas , Humanos , Inteligência Artificial , IdiomaRESUMO
BACKGROUND: Recent surveys indicate that 48% of consumers actively use generative artificial intelligence (AI) for health-related inquiries. Despite widespread adoption and the potential to improve health care access, scant research examines the performance of AI chatbot responses regarding emergency care advice. OBJECTIVE: We assessed the quality of AI chatbot responses to common emergency care questions. We sought to determine qualitative differences in responses from 4 free-access AI chatbots, for 10 different serious and benign emergency conditions. METHODS: We created 10 emergency care questions that we fed into the free-access versions of ChatGPT 3.5 (OpenAI), Google Bard, Bing AI Chat (Microsoft), and Claude AI (Anthropic) on November 26, 2023. Each response was graded by 5 board-certified emergency medicine (EM) faculty for 8 domains of percentage accuracy, presence of dangerous information, factual accuracy, clarity, completeness, understandability, source reliability, and source relevancy. We determined the correct, complete response to the 10 questions from reputable and scholarly emergency medical references. These were compiled by an EM resident physician. For the readability of the chatbot responses, we used the Flesch-Kincaid Grade Level of each response from readability statistics embedded in Microsoft Word. Differences between chatbots were determined by the chi-square test. RESULTS: Each of the 4 chatbots' responses to the 10 clinical questions were scored across 8 domains by 5 EM faculty, for 400 assessments for each chatbot. Together, the 4 chatbots had the best performance in clarity and understandability (both 85%), intermediate performance in accuracy and completeness (both 50%), and poor performance (10%) for source relevance and reliability (mostly unreported). Chatbots contained dangerous information in 5% to 35% of responses, with no statistical difference between chatbots on this metric (P=.24). ChatGPT, Google Bard, and Claud AI had similar performances across 6 out of 8 domains. Only Bing AI performed better with more identified or relevant sources (40%; the others had 0%-10%). Flesch-Kincaid Reading level was 7.7-8.9 grade for all chatbots, except ChatGPT at 10.8, which were all too advanced for average emergency patients. Responses included both dangerous (eg, starting cardiopulmonary resuscitation with no pulse check) and generally inappropriate advice (eg, loosening the collar to improve breathing without evidence of airway compromise). CONCLUSIONS: AI chatbots, though ubiquitous, have significant deficiencies in EM patient advice, despite relatively consistent performance. Information for when to seek urgent or emergent care is frequently incomplete and inaccurate, and patients may be unaware of misinformation. Sources are not generally provided. Patients who use AI to guide health care decisions assume potential risks. AI chatbots for health should be subject to further research, refinement, and regulation. We strongly recommend proper medical consultation to prevent potential adverse outcomes.
Assuntos
Serviços Médicos de Emergência , Humanos , Serviços Médicos de Emergência/métodos , Serviços Médicos de Emergência/normas , Estudos Prospectivos , Inteligência Artificial , IdiomaRESUMO
BACKGROUND: Artificial intelligence, particularly chatbot systems, is becoming an instrumental tool in health care, aiding clinical decision-making and patient engagement. OBJECTIVE: This study aims to analyze the performance of ChatGPT-3.5 and ChatGPT-4 in addressing complex clinical and ethical dilemmas, and to illustrate their potential role in health care decision-making while comparing seniors' and residents' ratings, and specific question types. METHODS: A total of 4 specialized physicians formulated 176 real-world clinical questions. A total of 8 senior physicians and residents assessed responses from GPT-3.5 and GPT-4 on a 1-5 scale across 5 categories: accuracy, relevance, clarity, utility, and comprehensiveness. Evaluations were conducted within internal medicine, emergency medicine, and ethics. Comparisons were made globally, between seniors and residents, and across classifications. RESULTS: Both GPT models received high mean scores (4.4, SD 0.8 for GPT-4 and 4.1, SD 1.0 for GPT-3.5). GPT-4 outperformed GPT-3.5 across all rating dimensions, with seniors consistently rating responses higher than residents for both models. Specifically, seniors rated GPT-4 as more beneficial and complete (mean 4.6 vs 4.0 and 4.6 vs 4.1, respectively; P<.001), and GPT-3.5 similarly (mean 4.1 vs 3.7 and 3.9 vs 3.5, respectively; P<.001). Ethical queries received the highest ratings for both models, with mean scores reflecting consistency across accuracy and completeness criteria. Distinctions among question types were significant, particularly for the GPT-4 mean scores in completeness across emergency, internal, and ethical questions (4.2, SD 1.0; 4.3, SD 0.8; and 4.5, SD 0.7, respectively; P<.001), and for GPT-3.5's accuracy, beneficial, and completeness dimensions. CONCLUSIONS: ChatGPT's potential to assist physicians with medical issues is promising, with prospects to enhance diagnostics, treatments, and ethics. While integration into clinical workflows may be valuable, it must complement, not replace, human expertise. Continued research is essential to ensure safe and effective implementation in clinical environments.
Assuntos
Tomada de Decisão Clínica , Humanos , Inteligência ArtificialRESUMO
BACKGROUND: Artificial intelligence chatbots such as ChatGPT (OpenAI) have garnered excitement about their potential for delegating writing tasks ordinarily performed by humans. Many of these tasks (eg, writing recommendation letters) have social and professional ramifications, making the potential social biases in ChatGPT's underlying language model a serious concern. OBJECTIVE: Three preregistered studies used the text analysis program Linguistic Inquiry and Word Count to investigate gender bias in recommendation letters written by ChatGPT in human-use sessions (N=1400 total letters). METHODS: We conducted analyses using 22 existing Linguistic Inquiry and Word Count dictionaries, as well as 6 newly created dictionaries based on systematic reviews of gender bias in recommendation letters, to compare recommendation letters generated for the 200 most historically popular "male" and "female" names in the United States. Study 1 used 3 different letter-writing prompts intended to accentuate professional accomplishments associated with male stereotypes, female stereotypes, or neither. Study 2 examined whether lengthening each of the 3 prompts while holding the between-prompt word count constant modified the extent of bias. Study 3 examined the variability within letters generated for the same name and prompts. We hypothesized that when prompted with gender-stereotyped professional accomplishments, ChatGPT would evidence gender-based language differences replicating those found in systematic reviews of human-written recommendation letters (eg, more affiliative, social, and communal language for female names; more agentic and skill-based language for male names). RESULTS: Significant differences in language between letters generated for female versus male names were observed across all prompts, including the prompt hypothesized to be neutral, and across nearly all language categories tested. Historically female names received significantly more social referents (5/6, 83% of prompts), communal or doubt-raising language (4/6, 67% of prompts), personal pronouns (4/6, 67% of prompts), and clout language (5/6, 83% of prompts). Contradicting the study hypotheses, some gender differences (eg, achievement language and agentic language) were significant in both the hypothesized and nonhypothesized directions, depending on the prompt. Heteroscedasticity between male and female names was observed in multiple linguistic categories, with greater variance for historically female names than for historically male names. CONCLUSIONS: ChatGPT reproduces many gender-based language biases that have been reliably identified in investigations of human-written reference letters, although these differences vary across prompts and language categories. Caution should be taken when using ChatGPT for tasks that have social consequences, such as reference letter writing. The methods developed in this study may be useful for ongoing bias testing among progressive generations of chatbots across a range of real-world scenarios. TRIAL REGISTRATION: OSF Registries osf.io/ztv96; https://osf.io/ztv96.
Assuntos
Inteligência Artificial , Sexismo , Humanos , Feminino , Masculino , Revisões Sistemáticas como Assunto , Idioma , LinguísticaRESUMO
This paper examines the use of text message (SMS) interventions for health-related behavioral support. It first outlines the historical progress in SMS intervention research publications and the variety of funds from US government agencies. A narrative review follows, highlighting the effectiveness of SMS interventions in key health areas, such as physical activity, diet and weight loss, mental health, and substance use, based on published meta-analyses. It then outlines advantages of text messaging compared to other digital modalities, including the real-time capability to collect information and deliver microdoses of intervention support. Crucial design elements are proposed to optimize effectiveness and longitudinal engagement across communication strategies, psychological foundations, and behavior change tactics. We then discuss advanced functionalities, such as the potential for generative artificial intelligence to improve user interaction. Finally, major challenges to implementation are highlighted, including the absence of a dedicated commercial platform, privacy and security concerns with SMS technology, difficulties integrating SMS interventions with medical informatics systems, and concerns about user engagement. Proposed solutions aim to facilitate the broader application and effectiveness of SMS interventions. Our hope is that these insights can assist researchers and practitioners in using SMS interventions to improve health outcomes and reducing disparities.