RESUMEN
PURPOSE: The purpose of the study was to evaluate the knowledge level of the Chat Generative Pretrained Transformer (ChatGPT), Bard, and Bing artificial intelligence (AI) chatbots regarding ocular inflammation, uveal diseases, and treatment modalities, and to investigate their relative performance compared to one another. MATERIALS AND METHODS: Thirty-six questions related to ocular inflammation, uveal diseases, and treatment modalities were posed to the ChatGPT, Bard, and Bing AI chatbots, and both correct and incorrect responses were recorded. The accuracy rates were compared using the Chi-squared test. RESULTS: The ChatGPT provided correct answers to 52.8% of the questions, while Bard answered 38.9% correctly, and Bing answered 44.4% correctly. All three AI programs provided identical responses to 20 (55.6%) of the questions, with 45% of these responses being correct and 55% incorrect. No significant difference was observed between the correct and incorrect responses from the three AI chatbots (P = 0.654). CONCLUSION: AI chatbots should be developed to provide widespread access to accurate information about ocular inflammation, uveal diseases, and treatment modalities. Future research could explore ways to enhance the performance of these chatbots.
RESUMEN
Aim: To evaluate the appropriateness and readability of the medical knowledge provided by ChatGPT-3.5 and Google Bard, artificial-intelligence-powered conversational search engines, regarding surgical treatment for glaucoma. Methods: In this retrospective, cross-sectional study, 25 common questions related to the surgical management of glaucoma were asked on ChatGPT-3.5 and Google Bard. Glaucoma specialists graded the responses' appropriateness, and different scores assessed readability. Results: Appropriate answers to the posed questions were obtained in 68% of the responses with Google Bard and 96% with ChatGPT-3.5. On average, the responses generated by Google Bard had a significantly lower proportion of sentences, having more than 30 and 20 syllables (23% and 52% respectively) compared to ChatGPT-3.5 (66% and 82% respectively), as noted by readability. Google Bard had significantly (p<0.0001) lower readability grade scores and significantly higher "Flesch Reading ease score", implying greater ease of readability amongst the answers generated by Google Bard. Discussion: Many patients and their families turn to LLM chatbots for information, necessitating clear and accurate content. Assessments of online glaucoma information have shown variability in quality and readability, with institutional websites generally performing better than private ones. We found that ChatGPT-3.5, while precise, has lower readability than Google Bard, which is more accessible but less precise. For example, the Flesch Reading Ease Score was 57.6 for Google Bard and 22.6 for ChatGPT, indicating Google Bard's content is easier to read. Moreover, the Gunning Fog Index scores suggested that Google Bard's text is more suitable for a broader audience. ChatGPT's knowledge is limited to data up to 2021, whereas Google Bard, trained with real-time data, offers more current information. Further research is needed to evaluate these tools across various medical topics. Conclusion: The answers generated by ChatGPT-3.5™ AI are more accurate than the ones given by Google Bard. However, comprehension of ChatGPT-3.5™ answers may be difficult for the public with glaucoma. This study emphasized the importance of verifying the accuracy and clarity of online information that glaucoma patients rely on to make informed decisions about their ocular health. This is an exciting new area for patient education and health literacy.
Asunto(s)
Comprensión , Glaucoma , Humanos , Glaucoma/cirugía , Estudios Transversales , Estudios Retrospectivos , Motor de Búsqueda , Internet , Alfabetización en SaludRESUMEN
Women with germline BRCA1 mutations face an increased risk of developing breast and ovarian cancers. BARD1 (BRCA1 associated RING domain 1) is an essential heterodimeric partner of BRCA1, and mutations in BARD1 are also associated with these cancers. While BARD1 mutations are recognized for their cancer susceptibility, the exact roles of numerous BARD1 missense mutations remain unclear. In this study, we conducted functional assays to assess the homology-directed DNA repair (HDR) activity of all BARD1 missense substitutions identified in 55 breast and ovarian cancer samples, using the real-world data from the COSMIC and cBioPortal databases. Seven BARD1 variants (V85M, P187A, G491R, R565C, P669L, T719R, and Q730L) were confirmed to impair DNA damage repair. Furthermore, cells harboring these BARD1 variants exhibited increased sensitivity to the chemotherapeutic drugs, cisplatin, and olaparib, compared to cells expressing wild-type BARD1. These findings collectively suggest that these seven missense BARD1 variants are likely pathogenic and may respond well to cisplatin-olaparib combination therapy. This study not only enhances our understanding of BARD1's role in DNA damage repair but also offers valuable insights into predicting therapy responses in patients with specific BARD1 missense mutations.
RESUMEN
PURPOSE: The utility of machine learning, specifically large language models (LLMs), in the medical field has gained considerable attention. However, there is a scarcity of studies that focus on the application of LLMs in generating custom subspecialty radiology impressions. The primary objective of this study is to evaluate and compare the performance of multiple LLMs in generating specialized, accurate, and clinically useful radiology impressions for degenerative cervical spine MRI reports. MATERIALS AND METHODS: The study employed a comparative analysis of multiple LLMs, including OpenAI's ChatGPT-3.5 and GPT-4 (OpenAI, San Francisco, CA), Antrhopic's Claude 2 (Anthropic PBC, San Francisco, CA), Google's Bard (Google Inc., Mountain View, CA), and Meta's Llama 2 (Meta Platforms, Inc., Menlo Park, CA). This was performed during January-February 2024. These models were evaluated using a few-shot learning approach on a dataset consisting of 10 examples from 50 synthetically generated MRI reports. Performance metrics evaluated were diagnostic accuracy, stylistic accuracy, and redundancy. RESULTS: While Claude 2 maintained consistent high performance across 40 cases, GPT-4 required midway re-training to improve its declining scores. Both Claude 2 and GPT-4 demonstrated the ability to generate structured impressions, but Claude 2's specialized summarization capabilities provided an edge in maintaining accuracy without continuous feedback. The other LLMs' performance was subpar. CONCLUSION: The findings of this study suggest that LLMs can be a valuable tool in automating the generation of radiology impressions. Claude 2, in particular, exhibited promising results, indicating its potential for clinical implementation. However, the study also points to the necessity for further research, especially in optimizing model performance and evaluating real-world applicability.
RESUMEN
Large language models (LLMs) have significantly impacted various fields with their ability to understand and generate human-like text. This study explores the potential benefits and limitations of integrating LLMs, such as ChatGPT, into haematology practices. Utilizing systematic review methodologies, we analysed studies published after 1 December 2022, from databases like PubMed, Web of Science and Scopus, and assessing each for bias with the QUADAS-2 tool. We reviewed 10 studies that applied LLMs in various haematology contexts. These models demonstrated proficiency in specific tasks, such as achieving 76% diagnostic accuracy for haemoglobinopathies. However, the research highlighted inconsistencies in performance and reference accuracy, indicating variability in reliability across different uses. Additionally, the limited scope of these studies and constraints on datasets could potentially limit the generalizability of our findings. The findings suggest that, while LLMs provide notable advantages in enhancing diagnostic processes and educational resources within haematology, their integration into clinical practice requires careful consideration. Before implementing them in haematology, rigorous testing and specific adaptation are essential. This involves validating their accuracy and reliability across different scenarios. Given the field's complexity, it is also critical to continuously monitor these models and adapt them responsively.
RESUMEN
Male breast cancer is an uncommon diagnosis with limited research on management and prognosis due to its rarity. We discuss a case of a 55-year-old male with a non-contributory past medical history who presented with an enlarging palpable mass of his right breast tissue at the 10:00 position. The ultrasound of the right breast showed a 2.8 cm heterogenous mass with irregular borders highly suspicious for malignancy. The follow-up sonogram-guided core biopsy was performed, and the pathology of the mass confirmed high-grade infiltrating ductal carcinoma. A modified radical mastectomy of the right breast with extensive axillary lymph node excision was performed. Genetic testing of the excised tumor revealed a MUTYH gene mutation and a BARD1 (BRCA1-associated RING domain 1) gene mutation of unknown significance. Histopathological analysis confirmed a Grade 2, ER/PR-positive, KI 67-positive, and HER2-negative tumor.
RESUMEN
Artificial intelligence chatbots based on large language models have recently emerged as an alternative to traditional online searches and are also entering the nutrition space. In this study, we wanted to investigate whether the artificial intelligence chatbots ChatGPT and Bard (now Gemini) can create meal plans that meet the dietary reference intake (DRI) for different dietary patterns. We further hypothesized that nutritional adequacy could be improved by modifying the prompts used. Meal plans were generated by 3 accounts for different dietary patterns (omnivorous, vegetarian, and vegan) using 2 distinct prompts resulting in 108 meal plans total. The nutrient content of the plans was subsequently analyzed and compared to the DRIs. On average, the meal plans contained less energy and carbohydrates but mostly exceeded the DRI for protein. Vitamin D and fluoride fell below the DRI for all plans, whereas only the vegan plans contained insufficient vitamin B12. ChatGPT suggested using vitamin B12 supplements in 5 of 18 instances, whereas Bard never recommended supplements. There were no significant differences between the prompts or the tools. Although the meal plans generated by ChatGPT and Bard met most DRIs, there were some exceptions, particularly for vegan diets. These tools maybe useful for individuals looking for general dietary inspiration, but they should not be relied on to create nutritionally adequate meal plans, especially for individuals with restrictive dietary needs.
Asunto(s)
Inteligencia Artificial , Patrones Dietéticos , Ingestión de Energía , Comidas , Humanos , Carbohidratos de la Dieta/administración & dosificación , Suplementos Dietéticos , Nutrientes , Valor Nutritivo , Ingesta Diaria RecomendadaRESUMEN
Large language models (LLM) have been widely used to provide information in many fields, including obstetrics and gynecology. Which model performs best in providing answers to commonly asked pregnancy questions is unknown. A qualitative analysis of Chat Generative Pre-Training Transformer Version 3.5 (ChatGPT-3.5) (OpenAI, Inc., San Francisco, California, United States) and Bard, recently renamed Google Gemini (Google LLC, Mountain View, California, United States), was performed in August of 2023. Each LLM was queried on 12 commonly asked pregnancy questions and asked for their references. Review and grading of the responses and references for both LLMs were performed by the co-authors individually and then as a group to formulate a consensus. Query responses were graded as "acceptable" or "not acceptable" based on correctness and completeness in comparison to American College of Obstetricians and Gynecologists (ACOG) publications, PubMed-indexed evidence, and clinical experience. References were classified as "verified," "broken," "irrelevant," "non-existent," and "no references." Grades of "acceptable" were given to 58% of ChatGPT-3.5 responses (seven out of 12) and 83% of Bard responses (10 out of 12). In regard to references, ChatGPT-3.5 had reference issues in 100% of its references, and Bard had discrepancies in 8% of its references (one out of 12). When comparing ChatGPT-3.5 responses between May 2023 and August 2023, a change in "acceptable" responses was noted: 50% versus 58%, respectively. Bard answered more questions correctly than ChatGPT-3.5 when queried on a small sample of commonly asked pregnancy questions. ChatGPT-3.5 performed poorly in terms of reference verification. The overall performance of ChatGPT-3.5 remained stable over time, with approximately one-half of responses being "acceptable" in both May and August of 2023. Both LLMs need further evaluation and vetting before being accepted as accurate and reliable sources of information for pregnant women.
RESUMEN
BACKGROUND: Qualitative methods are incredibly beneficial to the dissemination and implementation of new digital health interventions; however, these methods can be time intensive and slow down dissemination when timely knowledge from the data sources is needed in ever-changing health systems. Recent advancements in generative artificial intelligence (GenAI) and their underlying large language models (LLMs) may provide a promising opportunity to expedite the qualitative analysis of textual data, but their efficacy and reliability remain unknown. OBJECTIVE: The primary objectives of our study were to evaluate the consistency in themes, reliability of coding, and time needed for inductive and deductive thematic analyses between GenAI (ie, ChatGPT and Bard) and human coders. METHODS: The qualitative data for this study consisted of 40 brief SMS text message reminder prompts used in a digital health intervention for promoting antiretroviral medication adherence among people with HIV who use methamphetamine. Inductive and deductive thematic analyses of these SMS text messages were conducted by 2 independent teams of human coders. An independent human analyst conducted analyses following both approaches using ChatGPT and Bard. The consistency in themes (or the extent to which the themes were the same) and reliability (or agreement in coding of themes) between methods were compared. RESULTS: The themes generated by GenAI (both ChatGPT and Bard) were consistent with 71% (5/7) of the themes identified by human analysts following inductive thematic analysis. The consistency in themes was lower between humans and GenAI following a deductive thematic analysis procedure (ChatGPT: 6/12, 50%; Bard: 7/12, 58%). The percentage agreement (or intercoder reliability) for these congruent themes between human coders and GenAI ranged from fair to moderate (ChatGPT, inductive: 31/66, 47%; ChatGPT, deductive: 22/59, 37%; Bard, inductive: 20/54, 37%; Bard, deductive: 21/58, 36%). In general, ChatGPT and Bard performed similarly to each other across both types of qualitative analyses in terms of consistency of themes (inductive: 6/6, 100%; deductive: 5/6, 83%) and reliability of coding (inductive: 23/62, 37%; deductive: 22/47, 47%). On average, GenAI required significantly less overall time than human coders when conducting qualitative analysis (20, SD 3.5 min vs 567, SD 106.5 min). CONCLUSIONS: The promising consistency in the themes generated by human coders and GenAI suggests that these technologies hold promise in reducing the resource intensiveness of qualitative thematic analysis; however, the relatively lower reliability in coding between them suggests that hybrid approaches are necessary. Human coders appeared to be better than GenAI at identifying nuanced and interpretative themes. Future studies should consider how these powerful technologies can be best used in collaboration with human coders to improve the efficiency of qualitative research in hybrid approaches while also mitigating potential ethical risks that they may pose.
RESUMEN
A large language model (LLM) is an artificial intelligence (AI) model that uses natural language processing (NLP) to understand, interpret, and generate human-like language responses from unstructured text input. Its real-time response capabilities and eloquent dialogue enhance the interactive user experience in human-AI communication like never before. By gathering several sources on the internet, LLM chatbots can interact and respond to a wide range of queries, including problem solving, text summarization, and creating informative notes. Since ophthalmology is one of the medical fields integrating image analysis, telemedicine, AI, and other technologies, LLMs are likely to play an important role in eye care in the near future. This review summarizes the performance and potential applicability of LLMs in ophthalmology according to currently available publications.
RESUMEN
This study explores disparities and opportunities in healthcare information provided by AI chatbots. We focused on recommendations for adjuvant therapy in endometrial cancer, analyzing responses across four regions (Indonesia, Nigeria, Taiwan, USA) and three platforms (Bard, Bing, ChatGPT-3.5). Utilizing previously published cases, we asked identical questions to chatbots from each location within a 24-h window. Responses were evaluated in a double-blinded manner on relevance, clarity, depth, focus, and coherence by ten experts in endometrial cancer. Our analysis revealed significant variations across different countries/regions (p < 0.001). Interestingly, Bing's responses in Nigeria consistently outperformed others (p < 0.05), excelling in all evaluation criteria (p < 0.001). Bard also performed better in Nigeria compared to other regions (p < 0.05), consistently surpassing them across all categories (p < 0.001, with relevance reaching p < 0.01). Notably, Bard's overall scores were significantly higher than those of ChatGPT-3.5 and Bing in all locations (p < 0.001). These findings highlight disparities and opportunities in the quality of AI-powered healthcare information based on user location and platform. This emphasizes the necessity for more research and development to guarantee equal access to trustworthy medical information through AI technologies.
Asunto(s)
Inteligencia Artificial , Femenino , Humanos , Nigeria , Taiwán , Estados Unidos , Neoplasias Endometriales/diagnóstico , Neoplasias Endometriales/terapiaRESUMEN
AIMS: This study compared three artificial intelligence (AI) platforms' potential to identify drug therapy communication competencies expected of a graduating medical doctor. METHODS: We presented three AI platforms, namely, Poe Assistant©, ChatGPT© and Google Bard©, with structured queries to generate communication skill competencies and case scenarios appropriate for graduating medical doctors. These case scenarios comprised 15 prototypical medical conditions that required drug prescriptions. Two authors independently evaluated the AI-enhanced clinical encounters, which integrated a diverse range of information to create patient-centred care plans. Through a consensus-based approach using a checklist, the communication components generated for each scenario were assessed. The instructions and warnings provided for each case scenario were evaluated by referencing the British National Formulary. RESULTS: AI platforms demonstrated overlap in competency domains generated, albeit with variations in wording. The domains of knowledge (basic and clinical pharmacology, prescribing, communication and drug safety) were unanimously recognized by all platforms. A broad consensus among Poe Assistant© and ChatGPT© on drug therapy-related communication issues specific to each case scenario was evident. The consensus primarily encompassed salutation, generic drug prescribed, treatment goals and follow-up schedules. Differences were observed in patient instruction clarity, listed side effects, warnings and patient empowerment. Google Bard did not provide guidance on patient communication issues. CONCLUSIONS: AI platforms recognized competencies with variations in how these were stated. Poe Assistant© and ChatGPT© exhibited alignment of communication issues. However, significant discrepancies were observed in specific skill components, indicating the necessity of human intervention to critically evaluate AI-generated outputs.
RESUMEN
OBJECTIVE: In the digital age, patients turn to online sources for lumbar spine fusion information, necessitating a careful study of large language models (LLMs) like chat generative pre-trained transformer (ChatGPT) for patient education. METHODS: Our study aims to assess the response quality of Open AI (artificial intelligence)'s ChatGPT 3.5 and Google's Bard to patient questions on lumbar spine fusion surgery. We identified 10 critical questions from 158 frequently asked ones via Google search, which were then presented to both chatbots. Five blinded spine surgeons rated the responses on a 4-point scale from 'unsatisfactory' to 'excellent.' The clarity and professionalism of the answers were also evaluated using a 5-point Likert scale. RESULTS: In our evaluation of 10 questions across ChatGPT 3.5 and Bard, 97% of responses were rated as excellent or satisfactory. Specifically, ChatGPT had 62% excellent and 32% minimally clarifying responses, with only 6% needing moderate or substantial clarification. Bard's responses were 66% excellent and 24% minimally clarifying, with 10% requiring more clarification. No significant difference was found in the overall rating distribution between the 2 models. Both struggled with 3 specific questions regarding surgical risks, success rates, and selection of surgical approaches (Q3, Q4, and Q5). Interrater reliability was low for both models (ChatGPT: k = 0.041, p = 0.622; Bard: k = -0.040, p = 0.601). While both scored well on understanding and empathy, Bard received marginally lower ratings in empathy and professionalism. CONCLUSION: ChatGPT3.5 and Bard effectively answered lumbar spine fusion FAQs, but further training and research are needed to solidify LLMs' role in medical education and healthcare communication.
RESUMEN
INTRODUCTION: The emerging rise in novel computer technologies and automated data analytics has the potential to change the course of dental education. In line with our long-term goal of harnessing the power of AI to augment didactic teaching, the objective of this study was to quantify and compare the accuracy of responses provided by ChatGPT (GPT-4 and GPT-3.5) and Google Gemini, the three primary large language models (LLMs), to human graduate students (control group) to the annual in-service examination questions posed by the American Academy of Periodontology (AAP). METHODS: Under a comparative cross-sectional study design, a corpus of 1312 questions from the annual in-service examination of AAP administered between 2020 and 2023 were presented to the LLMs. Their responses were analyzed using chi-square tests, and the performance was juxtaposed to the scores of periodontal residents from corresponding years, as the human control group. Additionally, two sub-analyses were performed: one on the performance of the LLMs on each section of the exam; and in answering the most difficult questions. RESULTS: ChatGPT-4 (total average: 79.57%) outperformed all human control groups as well as GPT-3.5 and Google Gemini in all exam years (p < .001). This chatbot showed an accuracy range between 78.80% and 80.98% across the various exam years. Gemini consistently recorded superior performance with scores of 70.65% (p = .01), 73.29% (p = .02), 75.73% (p < .01), and 72.18% (p = .0008) for the exams from 2020 to 2023 compared to ChatGPT-3.5, which achieved 62.5%, 68.24%, 69.83%, and 59.27% respectively. Google Gemini (72.86%) surpassed the average scores achieved by first- (63.48% ± 31.67) and second-year residents (66.25% ± 31.61) when all exam years combined. However, it could not surpass that of third-year residents (69.06% ± 30.45). CONCLUSIONS: Within the confines of this analysis, ChatGPT-4 exhibited a robust capability in answering AAP in-service exam questions in terms of accuracy and reliability while Gemini and ChatGPT-3.5 showed a weaker performance. These findings underscore the potential of deploying LLMs as an educational tool in periodontics and oral implantology domains. However, the current limitations of these models such as inability to effectively process image-based inquiries, the propensity for generating inconsistent responses to the same prompts, and achieving high (80% by GPT-4) but not absolute accuracy rates should be considered. An objective comparison of their capability versus their capacity is required to further develop this field of study.
RESUMEN
Background: Many patients use artificial intelligence (AI) chatbots as a rapid source of health information. This raises important questions about the reliability and effectiveness of AI chatbots in delivering accurate and understandable information. Purpose: To evaluate and compare the accuracy, conciseness, and readability of responses from OpenAI ChatGPT-4 and Google Bard to patient inquiries concerning the novel 177Lu-PSMA-617 therapy for prostate cancer. Materials and methods: Two experts listed the 12 most commonly asked questions by patients on 177Lu-PSMA-617 therapy. These twelve questions were prompted to OpenAI ChatGPT-4 and Google Bard. AI-generated responses were distributed using an online survey platform (Qualtrics) and blindly rated by eight experts. The performances of the AI chatbots were evaluated and compared across three domains: accuracy, conciseness, and readability. Additionally, potential safety concerns associated with AI-generated answers were also examined. The Mann-Whitney U and chi-square tests were utilized to compare the performances of AI chatbots. Results: Eight experts participated in the survey, evaluating 12 AI-generated responses across the three domains of accuracy, conciseness, and readability, resulting in 96 assessments (12 responses x 8 experts) for each domain per chatbot. ChatGPT-4 provided more accurate answers than Bard (2.95 ± 0.671 vs 2.73 ± 0.732, p=0.027). Bard's responses had better readability than ChatGPT-4 (2.79 ± 0.408 vs 2.94 ± 0.243, p=0.003). Both ChatGPT-4 and Bard achieved comparable conciseness scores (3.14 ± 0.659 vs 3.11 ± 0.679, p=0.798). Experts categorized the AI-generated responses as incorrect or partially correct at a rate of 16.6% for ChatGPT-4 and 29.1% for Bard. Bard's answers contained significantly more misleading information than those of ChatGPT-4 (p = 0.039). Conclusion: AI chatbots have gained significant attention, and their performance is continuously improving. Nonetheless, these technologies still need further improvements to be considered reliable and credible sources for patients seeking medical information on 177Lu-PSMA-617 therapy.
RESUMEN
PURPOSE: Large language models (LLMs) are a form of artificial intelligence (AI) that uses deep learning techniques to understand, summarize and generate content. The potential benefits of LLMs in healthcare is predicted to be immense. The objective of this study was to examine the quality of patient information leaflets (PILs) produced by 3 LLMs on urological topics. METHODS: Prompts were created to generate PILs from 3 LLMs: ChatGPT-4, PaLM 2 (Google Bard) and Llama 2 (Meta) across four urology topics (circumcision, nephrectomy, overactive bladder syndrome, and transurethral resection of the prostate). PILs were evaluated using a quality assessment checklist. PIL readability was assessed by the Average Reading Level Consensus Calculator. RESULTS: PILs generated by PaLM 2 had the highest overall average quality score (3.58), followed by Llama 2 (3.34) and ChatGPT-4 (3.08). PaLM 2 generated PILs were of the highest quality in all topics except TURP and was the only LLM to include images. Medical inaccuracies were present in all generated content including instances of significant error. Readability analysis identified PaLM 2 generated PILs as the simplest (age 14-15 average reading level). Llama 2 PILs were the most difficult (age 16-17 average). CONCLUSION: While LLMs can generate PILs that may help reduce healthcare professional workload, generated content requires clinician input for accuracy and inclusion of health literacy aids, such as images. LLM-generated PILs were above the average reading level for adults, necessitating improvement in LLM algorithms and/or prompt design. How satisfied patients are to LLM-generated PILs remains to be evaluated.
Asunto(s)
Inteligencia Artificial , Urología , Humanos , Educación del Paciente como Asunto/métodos , Lenguaje , Enfermedades Urológicas/cirugíaRESUMEN
BACKGROUND: In recent years, the integration ofArtificial intelligence (AI) into various fields of medicine including Gynaecology, has shown promising potential. Surgical treatment of fibroid is myomectomy if uterine preservation and fertility are the primary aims. AI usage begins with the involvement of LLM (Large Language Model) from the point when a patient visits a gynecologist, from identifying signs and symptoms to reaching a diagnosis, providing treatment plans, and patient counseling. OBJECTIVE: Use of AI (ChatGPT versus Google Bard) in the surgical management of fibroid. STUDY DESIGN: Identifyingthe patient's problems using LLMs like ChatGPT and Google Bard and giving a treatment optionin 8 clinical scenarios of fibroid. Data entry was done using M.S. Excel and was statistically analyzed using Statistical Package for Social Sciences (SPSS Version 26) for M.S. Windows 2010. All results were presented in tabular form. Data were analyzed using nonparametric tests Chi-square tests or Fisher exact test.pvalues < 0.05 were considered statistically significant. The sensitivity of both techniques was calculated. We have used Cohen's Kappa to know the degree of agreement. RESULTS: We found that on the first attempt, ChatGPT gave general answers in 62.5 % of cases and specific answers in 37.5 % of cases. ChatGPT showed improved sensitivity on successive prompts 37.5 % to 62.5 % on the third prompt. Google Bard could not identify the clinical question in 50 % of cases and gave incorrect answers in 12.5 % of cases (p = 0.04). Google Bard showed the same sensitivity of 25 % on all prompts. CONCLUSION: AI helps to reduce the time to diagnose and plan a treatment strategy for fibroid and acts as a powerful tool in the hands of a gynecologist. However, the usage of AI by patients for self-treatment is to be avoided and should be used only for education and counseling about fibroids.
Asunto(s)
Inteligencia Artificial , Ginecología , Leiomioma , Neoplasias Uterinas , Humanos , Femenino , Leiomioma/cirugía , Neoplasias Uterinas/cirugía , Miomectomía Uterina/métodos , AdultoRESUMEN
This study assessed the potential of large language models (OpenAI's ChatGPT 3.5 and 4.0, Google Bard, Meta Llama2, and Anthropic Claude2) in addressing 30 common migraine-related queries, providing a foundation to advance artificial intelligence-assisted patient education and insights for a holistic approach to migraine management.
Asunto(s)
Benchmarking , Trastornos Migrañosos , Educación del Paciente como Asunto , Humanos , Trastornos Migrañosos/terapia , Educación del Paciente como Asunto/métodos , Inteligencia ArtificialRESUMEN
PURPOSE: The Sequential Organ Failure Assessment (SOFA) score plays a crucial role in intensive care units (ICUs) by providing a reliable measure of a patient's organ function or extent of failure. However, the precise assessment is time-consuming, and daily assessment in clinical practice in the ICU can be challenging. METHODS: Realistic scenarios in an ICU setting were created, and the data mining precision of ChatGPT 4.0 Plus, Bard, and Perplexity AI were assessed using Spearman's as well as the intraclass correlation coefficients regarding the accuracy in determining the SOFA score. RESULTS: The strongest correlation was observed between the actual SOFA score and the score calculated by ChatGPT 4.0 Plus (r correlation coefficient 0.92) (p<0.001). In contrast, the correlation between the actual SOFA and that calculated by Bard was moderate (r=0.59, p=0.070), while the correlation with Perplexity AI was substantial, at 0.89, with a p<0.001. The interclass correlation coefficient analysis of SOFA with those of ChatGPT 4.0 Plus, Bard, and Perplexity AI was ICC=0.94. CONCLUSION: Artificial intelligence (AI) tools, particularly ChatGPT 4.0 Plus, show significant promise in assisting with automated SOFA score calculations via AI data mining in ICU settings. They offer a pathway to reduce the manual workload and increase the efficiency of continuous patient monitoring and assessment. However, further development and validation are necessary to ensure accuracy and reliability in a critical care environment.
RESUMEN
BACKGROUND: Due to recent advances in artificial intelligence (AI), language model applications can generate logical text output that is difficult to distinguish from human writing. ChatGPT (OpenAI) and Bard (subsequently rebranded as "Gemini"; Google AI) were developed using distinct approaches, but little has been studied about the difference in their capability to generate the abstract. The use of AI to write scientific abstracts in the field of spine surgery is the center of much debate and controversy. OBJECTIVE: The objective of this study is to assess the reproducibility of the structured abstracts generated by ChatGPT and Bard compared to human-written abstracts in the field of spine surgery. METHODS: In total, 60 abstracts dealing with spine sections were randomly selected from 7 reputable journals and used as ChatGPT and Bard input statements to generate abstracts based on supplied paper titles. A total of 174 abstracts, divided into human-written abstracts, ChatGPT-generated abstracts, and Bard-generated abstracts, were evaluated for compliance with the structured format of journal guidelines and consistency of content. The likelihood of plagiarism and AI output was assessed using the iThenticate and ZeroGPT programs, respectively. A total of 8 reviewers in the spinal field evaluated 30 randomly extracted abstracts to determine whether they were produced by AI or human authors. RESULTS: The proportion of abstracts that met journal formatting guidelines was greater among ChatGPT abstracts (34/60, 56.6%) compared with those generated by Bard (6/54, 11.1%; P<.001). However, a higher proportion of Bard abstracts (49/54, 90.7%) had word counts that met journal guidelines compared with ChatGPT abstracts (30/60, 50%; P<.001). The similarity index was significantly lower among ChatGPT-generated abstracts (20.7%) compared with Bard-generated abstracts (32.1%; P<.001). The AI-detection program predicted that 21.7% (13/60) of the human group, 63.3% (38/60) of the ChatGPT group, and 87% (47/54) of the Bard group were possibly generated by AI, with an area under the curve value of 0.863 (P<.001). The mean detection rate by human reviewers was 53.8% (SD 11.2%), achieving a sensitivity of 56.3% and a specificity of 48.4%. A total of 56.3% (63/112) of the actual human-written abstracts and 55.9% (62/128) of AI-generated abstracts were recognized as human-written and AI-generated by human reviewers, respectively. CONCLUSIONS: Both ChatGPT and Bard can be used to help write abstracts, but most AI-generated abstracts are currently considered unethical due to high plagiarism and AI-detection rates. ChatGPT-generated abstracts appear to be superior to Bard-generated abstracts in meeting journal formatting guidelines. Because humans are unable to accurately distinguish abstracts written by humans from those produced by AI programs, it is crucial to exercise special caution and examine the ethical boundaries of using AI programs, including ChatGPT and Bard.