Pesquisa | Biblioteca Virtual em Saúde

1.

[Healthcare 4.0-Medicine in transition]. / Healthcare 4.0 Medizin im Wandel.

Roßkopf, Steffen; Meder, Benjamin.

Herz ; 2024 Aug 08.

Artigo em Alemão | MEDLINE | ID: mdl-39115627

RESUMO

Healthcare 4.0 describes the future transformation of the healthcare sector driven by the combination of digital technologies, such as artificial intelligence (AI), big data and the Internet of Medical Things, enabling the advancement of precision medicine. This overview article addresses various areas such as large language models (LLM), diagnostics and robotics, shedding light on the positive aspects of Healthcare 4.0 and showcasing exciting methods and application examples in cardiology. It delves into the broad knowledge base and enormous potential of LLMs, highlighting their immediate benefits as digital assistants or for administrative tasks. In diagnostics, the increasing usefulness of wearables is emphasized and an AI for predicting heart filling pressures based on cardiac magnetic resonance imaging (MRI) is introduced. Additionally, it discusses the revolutionary methodology of a digital simulation of the physical heart (digital twin). Finally, it addresses both regulatory frameworks and a brief vision of data-driven healthcare delivery, explaining the need for investments in technical personnel and infrastructure to achieve a more effective medicine.

2.

(Ir)rationality and cognitive biases in large language models.

Macmillan-Scott, Olivia; Musolesi, Mirco.

R Soc Open Sci ; 11(6): 240255, 2024 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-39100158

RESUMO

Do large language models (LLMs) display rational reasoning? LLMs have been shown to contain human biases due to the data they have been trained on; whether this is reflected in rational reasoning remains less clear. In this paper, we answer this question by evaluating seven language models using tasks from the cognitive psychology literature. We find that, like humans, LLMs display irrationality in these tasks. However, the way this irrationality is displayed does not reflect that shown by humans. When incorrect answers are given by LLMs to these tasks, they are often incorrect in ways that differ from human-like biases. On top of this, the LLMs reveal an additional layer of irrationality in the significant inconsistency of the responses. Aside from the experimental results, this paper seeks to make a methodological contribution by showing how we can assess and compare different capabilities of these types of models, in this case with respect to rational reasoning.

3.

The Opportunities and Risks of Large Language Models in Mental Health.

Lawrence, Hannah R; Schneider, Renee A; Rubin, Susan B; Mataric, Maja J; McDuff, Daniel J; Jones Bell, Megan.

JMIR Ment Health ; 11: e59479, 2024 Jul 29.

Artigo em Inglês | MEDLINE | ID: mdl-39105570

RESUMO

Unlabelled: Global rates of mental health concerns are rising, and there is increasing realization that existing models of mental health care will not adequately expand to meet the demand. With the emergence of large language models (LLMs) has come great optimism regarding their promise to create novel, large-scale solutions to support mental health. Despite their nascence, LLMs have already been applied to mental health-related tasks. In this paper, we summarize the extant literature on efforts to use LLMs to provide mental health education, assessment, and intervention and highlight key opportunities for positive impact in each area. We then highlight risks associated with LLMs' application to mental health and encourage the adoption of strategies to mitigate these risks. The urgent need for mental health support must be balanced with responsible development, testing, and deployment of mental health LLMs. It is especially critical to ensure that mental health LLMs are fine-tuned for mental health, enhance mental health equity, and adhere to ethical standards and that people, including those with lived experience with mental health concerns, are involved in all stages from development through deployment. Prioritizing these efforts will minimize potential harms to mental health and maximize the likelihood that LLMs will positively impact mental health globally.

Assuntos

Serviços de Saúde Mental , Humanos , Idioma , Transtornos Mentais/epidemiologia , Saúde Mental

4.

Fact Check: Assessing the Response of ChatGPT to Alzheimer's Disease Myths.

Huang, Sean S; Song, Qingyuan; Beiting, Kimberly J; Duggan, Maria C; Hines, Kristin; Murff, Harvey; Leung, Vania; Powers, James; Harvey, T S; Malin, Bradley; Yin, Zhijun.

J Am Med Dir Assoc ; : 105178, 2024 Aug 03.

Artigo em Inglês | MEDLINE | ID: mdl-39106968

RESUMO

There are many myths regarding Alzheimer's disease (AD) that have been circulated on the internet, each exhibiting varying degrees of accuracy, inaccuracy, and misinformation. Large language models, such as ChatGPT, may be a valuable tool to help assess these myths for veracity and inaccuracy; however, they can induce misinformation as well. This study assesses ChatGPT's ability to identify and address AD myths with reliable information. We conducted a cross-sectional study of attending geriatric medicine clinicians' evaluation of ChatGPT (GPT 4.0) responses to 16 selected AD myths. We prompted ChatGPT to express its opinion on each myth and implemented a survey using REDCap to determine the degree to which clinicians agreed with the accuracy of each of ChatGPT's explanations. We also collected their explanations of any disagreements with ChatGPT's responses. We used a 5-category Likert-type scale with a score ranging from -2 to 2 to quantify clinicians' agreement in each aspect of the evaluation. The clinicians (n = 10) were generally satisfied with ChatGPT's explanations. Among the 16 myths, the clinicians were generally satisfied with these explanations, with (mean [SD] score of 1.1[±0.3]). Most clinicians selected "Agree" or "Strongly Agree" for each statement. Some statements obtained a small number of "Disagree" responses. There were no "Strongly Disagree" responses. Most surveyed health care professionals acknowledged the potential value of ChatGPT in mitigating AD misinformation; however, the need for more refined and detailed explanations of the disease's mechanisms and treatments was highlighted.

5.

Do large language models have a legal duty to tell the truth?

Wachter, Sandra; Mittelstadt, Brent; Russell, Chris.

R Soc Open Sci ; 11(8): 240197, 2024 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-39113763

RESUMO

Careless speech is a new type of harm created by large language models (LLM) that poses cumulative, long-term risks to science, education and shared social truth in democratic societies. LLMs produce responses that are plausible, helpful and confident, but that contain factual inaccuracies, misleading references and biased information. These subtle mistruths are poised to cumulatively degrade and homogenize knowledge over time. This article examines the existence and feasibility of a legal duty for LLM providers to create models that 'tell the truth'. We argue that LLM providers should be required to mitigate careless speech and better align their models with truth through open, democratic processes. We define careless speech against 'ground truth' in LLMs and related risks including hallucinations, misinformation and disinformation. We assess the existence of truth-related obligations in EU human rights law and the Artificial Intelligence Act, Digital Services Act, Product Liability Directive and Artificial Intelligence Liability Directive. Current frameworks contain limited, sector-specific truth duties. Drawing on duties in science and academia, education, archives and libraries, and a German case in which Google was held liable for defamation caused by autocomplete, we propose a pathway to create a legal truth duty for providers of narrow- and general-purpose LLMs.

6.

Harnessing LLMs for multi-dimensional writing assessment: Reliability and alignment with human judgments.

Tang, Xiaoyi; Chen, Hongwei; Lin, Daoyu; Li, Kexin.

Heliyon ; 10(14): e34262, 2024 Jul 30.

Artigo em Inglês | MEDLINE | ID: mdl-39113951

RESUMO

Recent advancements in natural language processing, computational linguistics, and Artificial Intelligence (AI) have propelled the use of Large Language Models (LLMs) in Automated Essay Scoring (AES), offering efficient and unbiased writing assessment. This study assesses the reliability of LLMs in AES tasks, focusing on scoring consistency and alignment with human raters. We explore the impact of prompt engineering, temperature settings, and multi-level rating dimensions on the scoring performance of LLMs. Results indicate that prompt engineering significantly affects the reliability of LLMs, with GPT-4 showing marked improvement over GPT-3.5 and Claude 2, achieving 112% and 114% increase in scoring accuracy under the criteria and sample-referenced justification prompt. Temperature settings also influence the output consistency of LLMs, with lower temperatures producing scores more in line with human evaluations, which is essential for maintaining fairness in large-scale assessment. Regarding multi-dimensional writing assessment, results indicate that GPT-4 performs well in dimensions regarding Ideas (QWK=0.551) and Organization (QWK=0.584) under well-crafted prompt engineering. These findings pave the way for a comprehensive exploration of LLMs' broader educational implications, offering insights into their capability to refine and potentially transform writing instruction, assessment, and the delivery of diagnostic and personalized feedback in the AI-powered educational age. While this study attached importance to the reliability and alignment of LLM-powered multi-dimensional AES, future research should broaden its scope to encompass diverse writing genres and a more extensive sample from varied backgrounds.

7.

Assessing ChatGPT as a Medical Consultation Assistant for Chronic Hepatitis B: Cross-Language Study of English and Chinese.

Wang, Yijie; Chen, Yining; Sheng, Jifang.

JMIR Med Inform ; 12: e56426, 2024 Aug 08.

Artigo em Inglês | MEDLINE | ID: mdl-39115930

RESUMO

BACKGROUND: Chronic hepatitis B (CHB) imposes substantial economic and social burdens globally. The management of CHB involves intricate monitoring and adherence challenges, particularly in regions like China, where a high prevalence of CHB intersects with health care resource limitations. This study explores the potential of ChatGPT-3.5, an emerging artificial intelligence (AI) assistant, to address these complexities. With notable capabilities in medical education and practice, ChatGPT-3.5's role is examined in managing CHB, particularly in regions with distinct health care landscapes. OBJECTIVE: This study aimed to uncover insights into ChatGPT-3.5's potential and limitations in delivering personalized medical consultation assistance for CHB patients across diverse linguistic contexts. METHODS: Questions sourced from published guidelines, online CHB communities, and search engines in English and Chinese were refined, translated, and compiled into 96 inquiries. Subsequently, these questions were presented to both ChatGPT-3.5 and ChatGPT-4.0 in independent dialogues. The responses were then evaluated by senior physicians, focusing on informativeness, emotional management, consistency across repeated inquiries, and cautionary statements regarding medical advice. Additionally, a true-or-false questionnaire was employed to further discern the variance in information accuracy for closed questions between ChatGPT-3.5 and ChatGPT-4.0. RESULTS: Over half of the responses (228/370, 61.6%) from ChatGPT-3.5 were considered comprehensive. In contrast, ChatGPT-4.0 exhibited a higher percentage at 74.5% (172/222; P<.001). Notably, superior performance was evident in English, particularly in terms of informativeness and consistency across repeated queries. However, deficiencies were identified in emotional management guidance, with only 3.2% (6/186) in ChatGPT-3.5 and 8.1% (15/154) in ChatGPT-4.0 (P=.04). ChatGPT-3.5 included a disclaimer in 10.8% (24/222) of responses, while ChatGPT-4.0 included a disclaimer in 13.1% (29/222) of responses (P=.46). When responding to true-or-false questions, ChatGPT-4.0 achieved an accuracy rate of 93.3% (168/180), significantly surpassing ChatGPT-3.5's accuracy rate of 65.0% (117/180) (P<.001). CONCLUSIONS: In this study, ChatGPT demonstrated basic capabilities as a medical consultation assistant for CHB management. The choice of working language for ChatGPT-3.5 was considered a potential factor influencing its performance, particularly in the use of terminology and colloquial language, and this potentially affects its applicability within specific target populations. However, as an updated model, ChatGPT-4.0 exhibits improved information processing capabilities, overcoming the language impact on information accuracy. This suggests that the implications of model advancement on applications need to be considered when selecting large language models as medical consultation assistants. Given that both models performed inadequately in emotional guidance management, this study highlights the importance of providing specific language training and emotional management strategies when deploying ChatGPT for medical purposes. Furthermore, the tendency of these models to use disclaimers in conversations should be further investigated to understand the impact on patients' experiences in practical applications.

8.

Harnessing Large Language Models for Structured Reporting in Breast Ultrasound: A Comparative Study of Open AI (GPT-4.0) and Microsoft Bing (GPT-4).

Liu, ChaoXu; Wei, MinYan; Qin, Yu; Zhang, MeiXiang; Jiang, Huan; Xu, JiaLe; Zhang, YuNing; Hua, Qing; Hou, YiQing; Dong, YiJie; Xia, ShuJun; Li, Ning; Zhou, JianQiao.

Ultrasound Med Biol ; 2024 Aug 12.

Artigo em Inglês | MEDLINE | ID: mdl-39138026

RESUMO

OBJECTIVES: To assess the capabilities of large language models (LLMs), including Open AI (GPT-4.0) and Microsoft Bing (GPT-4), in generating structured reports, the Breast Imaging Reporting and Data System (BI-RADS) categories, and management recommendations from free-text breast ultrasound reports. MATERIALS AND METHODS: In this retrospective study, 100 free-text breast ultrasound reports from patients who underwent surgery between January and May 2023 were gathered. The capabilities of Open AI (GPT-4.0) and Microsoft Bing (GPT-4) to convert these unstructured reports into structured ultrasound reports were studied. The quality of structured reports, BI-RADS categories, and management recommendations generated by GPT-4.0 and Bing were evaluated by senior radiologists based on the guidelines. RESULTS: Open AI (GPT-4.0) was better than Microsoft Bing (GPT-4) in terms of performance in generating structured reports (88% vs. 55%; p < 0.001), giving correct BI-RADS categories (54% vs. 47%; p = 0.013) and providing reasonable management recommendations (81% vs. 63%; p < 0.001). As the ability to predict benign and malignant characteristics, GPT-4.0 performed significantly better than Bing (AUC, 0.9317 vs. 0.8177; p < 0.001), while both performed significantly inferior to senior radiologists (AUC, 0.9763; both p < 0.001). CONCLUSION: This study highlights the potential of LLMs, specifically Open AI (GPT-4.0), in converting unstructured breast ultrasound reports into structured ones, offering accurate diagnoses and providing reasonable recommendations.

9.

Integrating vision-based AI and large language models for real-time water pollution surveillance.

Samuel, Dinesh Jackson; Sermet, Yusuf; Cwiertny, David; Demir, Ibrahim.

Water Environ Res ; 96(8): e11092, 2024 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-39129273

RESUMO

Water pollution has become a major concern in recent years, affecting over 2 billion people worldwide, according to UNESCO. This pollution can occur by either naturally, such as algal blooms, or man-made when toxic substances are released into water bodies like lakes, rivers, springs, and oceans. To address this issue and monitor surface-level water pollution in local water bodies, an informative real-time vision-based surveillance system has been developed in conjunction with large language models (LLMs). This system has an integrated camera connected to a Raspberry Pi for processing input frames and is further linked to LLMs for generating contextual information regarding the type, causes, and impact of pollutants on both human health and the environment. This multi-model setup enables local authorities to monitor water pollution and take necessary steps to mitigate it. To train the vision model, seven major types of pollutants found in water bodies like algal bloom, synthetic foams, dead fishes, oil spills, wooden logs, industrial waste run-offs, and trashes were used for achieving accurate detection. ChatGPT API has been integrated with the model to generate contextual information about pollution detected. Thus, the multi-model system can conduct surveillance over water bodies and autonomously alert local authorities to take immediate action, eliminating the need for human intervention. PRACTITIONER POINTS: Combines cameras and LLMs with Raspberry Pi for processing and generating pollutant information. Uses YOLOv5 to detect algal blooms, synthetic foams, dead fish, oil spills, and industrial waste. Supports various modules and environments, including drones and mobile apps for broad monitoring. Educates on environmental healthand alerts authorities about water pollution.

Assuntos

Monitoramento Ambiental , Poluição da Água , Monitoramento Ambiental/métodos , Poluição da Água/análise , Inteligência Artificial , Modelos Teóricos

10.

A Review of Ophthalmology Education in the Era of Generative Artificial Intelligence.

Heinke, Anna; Radgoudarzi, Niloofar; Huang, Bonnie B; Baxter, Sally L.

Asia Pac J Ophthalmol (Phila) ; : 100089, 2024 Aug 10.

Artigo em Inglês | MEDLINE | ID: mdl-39134176

RESUMO

PURPOSE: To explore the integration of generative AI, specifically large language models (LLMs), in ophthalmology education and practice, addressing their applications, benefits, challenges, and future directions. DESIGN: A literature review and analysis of current AI applications and educational programs in ophthalmology. METHODS: Analysis of published studies, reviews, articles, websites, and institutional reports on AI use in ophthalmology. Examination of educational programs incorporating AI, including curriculum frameworks, training methodologies, and evaluations of AI performance on medical examinations and clinical case studies. RESULTS: Generative AI, particularly LLMs, shows potential to improve diagnostic accuracy and patient care in ophthalmology. Applications include aiding in patient, physician, and medical students' education. However, challenges such as AI hallucinations, biases, lack of interpretability, and outdated training data limit clinical deployment. Studies revealed varying levels of accuracy of LLMs on ophthalmology board exam questions, underscoring the need for more reliable AI integration. Several educational programs nationwide provide AI and data science training relevant to clinical medicine and ophthalmology. CONCLUSIONS: Generative AI and LLMs offer promising advancements in ophthalmology education and practice. Addressing challenges through comprehensive curricula that include fundamental AI principles, ethical guidelines, and updated, unbiased training data is crucial. Future directions include developing clinically relevant evaluation metrics, implementing hybrid models with human oversight, leveraging image-rich data, and benchmarking AI performance against ophthalmologists. Robust policies on data privacy, security, and transparency are essential for fostering a safe and ethical environment for AI applications in ophthalmology.

11.

Clinical artificial intelligence: teaching a large language model to generate recommendations that align with guidelines for the surgical management of GERD.

Huo, Bright; Marfo, Nana; Sylla, Patricia; Calabrese, Elisa; Kumar, Sunjay; Slater, Bethany J; Walsh, Danielle S; Vosburg, Wesley.

Surg Endosc ; 2024 Aug 12.

Artigo em Inglês | MEDLINE | ID: mdl-39134725

RESUMO

BACKGROUND: Large Language Models (LLMs) provide clinical guidance with inconsistent accuracy due to limitations with their training dataset. LLMs are "teachable" through customization. We compared the ability of the generic ChatGPT-4 model and a customized version of ChatGPT-4 to provide recommendations for the surgical management of gastroesophageal reflux disease (GERD) to both surgeons and patients. METHODS: Sixty patient cases were developed using eligibility criteria from the Society of American Gastrointestinal and Endoscopic Surgeons (SAGES) & United European Gastroenterology (UEG)-European Association of Endoscopic. Surgery (EAES) guidelines for the surgical management of GERD. Standardized prompts were engineered for physicians as the end-user, with separate layperson prompts for patients. A customized GPT was developed to generate recommendations based on guidelines, called the GERD Tool for Surgery (GTS). Both the GTS and generic ChatGPT-4 were queried July 21st, 2024. Model performance was evaluated by comparing responses to SAGES & UEG-EAES guideline recommendations. Outcome data was presented using descriptive statistics including counts and percentages. RESULTS: The GTS provided accurate recommendations for the surgical management of GERD for 60/60 (100.0%) surgeon inquiries and 40/40 (100.0%) patient inquiries based on guideline recommendations. The Generic ChatGPT-4 model generated accurate guidance for 40/60 (66.7%) surgeon inquiries and 19/40 (47.5%) patient inquiries. The GTS produced recommendations based on the 2021 SAGES & UEG-EAES guidelines on the surgical management of GERD, while the generic ChatGPT-4 model generated guidance without citing evidence to support its recommendations. CONCLUSION: ChatGPT-4 can be customized to overcome limitations with its training dataset to provide recommendations for the surgical management of GERD with reliable accuracy and consistency. The training of LLM models can be used to help integrate this efficient technology into the creation of robust and accurate information for both surgeons and patients. Prospective data is needed to assess its effectiveness in a pragmatic clinical environment.

12.

A shared model-based linguistic space for transmitting our thoughts from brain to brain in natural conversations.

Zada, Zaid; Goldstein, Ariel; Michelmann, Sebastian; Simony, Erez; Price, Amy; Hasenfratz, Liat; Barham, Emily; Zadbood, Asieh; Doyle, Werner; Friedman, Daniel; Dugan, Patricia; Melloni, Lucia; Devore, Sasha; Flinker, Adeen; Devinsky, Orrin; Nastase, Samuel A; Hasson, Uri.

Neuron ; 2024 Aug 02.

Artigo em Inglês | MEDLINE | ID: mdl-39096896

RESUMO

Effective communication hinges on a mutual understanding of word meaning in different contexts. We recorded brain activity using electrocorticography during spontaneous, face-to-face conversations in five pairs of epilepsy patients. We developed a model-based coupling framework that aligns brain activity in both speaker and listener to a shared embedding space from a large language model (LLM). The context-sensitive LLM embeddings allow us to track the exchange of linguistic information, word by word, from one brain to another in natural conversations. Linguistic content emerges in the speaker's brain before word articulation and rapidly re-emerges in the listener's brain after word articulation. The contextual embeddings better capture word-by-word neural alignment between speaker and listener than syntactic and articulatory models. Our findings indicate that the contextual embeddings learned by LLMs can serve as an explicit numerical model of the shared, context-rich meaning space humans use to communicate their thoughts to one another.

13.

The Potential Impact of Large Language Models on Doctor-Patient Communication: A Case Study in Prostate Cancer.

Geanta, Marius; Badescu, Daniel; Chirca, Narcis; Nechita, Ovidiu Catalin; Radu, Cosmin George; Rascu, Stefan; Radavoi, Daniel; Sima, Cristian; Toma, Cristian; Jinga, Viorel.

Healthcare (Basel) ; 12(15)2024 Aug 05.

Artigo em Inglês | MEDLINE | ID: mdl-39120251

RESUMO

BACKGROUND: In recent years, the integration of large language models (LLMs) into healthcare has emerged as a revolutionary approach to enhancing doctor-patient communication, particularly in the management of diseases such as prostate cancer. METHODS: Our paper evaluated the effectiveness of three prominent LLMs-ChatGPT (3.5), Gemini (Pro), and Co-Pilot (the free version)-against the official Romanian Patient's Guide on prostate cancer. Employing a randomized and blinded method, our study engaged eight medical professionals to assess the responses of these models based on accuracy, timeliness, comprehensiveness, and user-friendliness. RESULTS: The primary objective was to explore whether LLMs, when operating in Romanian, offer comparable or superior performance to the Patient's Guide, considering their potential to personalize communication and enhance the informational accessibility for patients. Results indicated that LLMs, particularly ChatGPT, generally provided more accurate and user-friendly information compared to the Guide. CONCLUSIONS: The findings suggest a significant potential for LLMs to enhance healthcare communication by providing accurate and accessible information. However, variability in performance across different models underscores the need for tailored implementation strategies. We highlight the importance of integrating LLMs with a nuanced understanding of their capabilities and limitations to optimize their use in clinical settings.

14.

MedExpQA: Multilingual benchmarking of Large Language Models for Medical Question Answering.

Alonso, Iñigo; Oronoz, Maite; Agerri, Rodrigo.

Artif Intell Med ; 155: 102938, 2024 Jul 31.

Artigo em Inglês | MEDLINE | ID: mdl-39121544

RESUMO

Large Language Models (LLMs) have the potential of facilitating the development of Artificial Intelligence technology to assist medical experts for interactive decision support. This potential has been illustrated by the state-of-the-art performance obtained by LLMs in Medical Question Answering, with striking results such as passing marks in licensing medical exams. However, while impressive, the required quality bar for medical applications remains far from being achieved. Currently, LLMs remain challenged by outdated knowledge and by their tendency to generate hallucinated content. Furthermore, most benchmarks to assess medical knowledge lack reference gold explanations which means that it is not possible to evaluate the reasoning of LLMs predictions. Finally, the situation is particularly grim if we consider benchmarking LLMs for languages other than English which remains, as far as we know, a totally neglected topic. In order to address these shortcomings, in this paper we present MedExpQA, the first multilingual benchmark based on medical exams to evaluate LLMs in Medical Question Answering. To the best of our knowledge, MedExpQA includes for the first time reference gold explanations, written by medical doctors, of the correct and incorrect options in the exams. Comprehensive multilingual experimentation using both the gold reference explanations and Retrieval Augmented Generation (RAG) approaches show that performance of LLMs, with best results around 75 accuracy for English, still has large room for improvement, especially for languages other than English, for which accuracy drops 10 points. Therefore, despite using state-of-the-art RAG methods, our results also demonstrate the difficulty of obtaining and integrating readily available medical knowledge that may positively impact results on downstream evaluations for Medical Question Answering. Data, code, and fine-tuned models will be made publicly available.1.

15.

Distilling mathematical reasoning capabilities into Small Language Models.

Zhu, Xunyu; Li, Jian; Liu, Yong; Ma, Can; Wang, Weiping.

Neural Netw ; 179: 106594, 2024 Aug 02.

Artigo em Inglês | MEDLINE | ID: mdl-39121788

RESUMO

This work addresses the challenge of democratizing advanced Large Language Models (LLMs) by compressing their mathematical reasoning capabilities into sub-billion parameter Small Language Models (SLMs) without compromising performance. We introduce Equation-of-Thought Distillation (EoTD), a novel technique that encapsulates the reasoning process into equation-based representations to construct an EoTD dataset for fine-tuning SLMs. Additionally, we propose the Ensemble Thoughts Distillation (ETD) framework to enhance the reasoning performance of SLMs. This involves creating a reasoning dataset with multiple thought processes, including Chain-of-Thought (CoT), Program-of-Thought (PoT), and Equation-of-Thought (EoT), and using it for fine-tuning. Our experimental performance demonstrates that EoTD significantly boosts the reasoning abilities of SLMs, while ETD enables these models to achieve state-of-the-art reasoning performance.

16.

Is ChatGPT a Reliable Source of Patient Information on Asthma?

Alabdulmohsen, Dalal M; Almahmudi, Mesa A; Alhashim, Jehad N; Almahdi, Mohammed H; Alkishy, Eman F; Almossabeh, Modhahir J; Alkhalifah, Saleh A.

Cureus ; 16(7): e64114, 2024 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-39119408

RESUMO

INTRODUCTION: ChatGPT (OpenAI, San Francisco, CA, USA) is a novel artificial intelligence (AI) application that is used by millions of people, and the numbers are growing by the day. Because it has the potential to be a source of patient information, the study aimed to evaluate the ability of ChatGPT to answer frequently asked questions (FAQs) about asthma with consistent reliability, acceptability, and easy readability. METHODS: We collected 30 FAQs about asthma from the Global Initiative for Asthma website. ChatGPT was asked each question twice, by two different users, to assess for consistency. The responses were evaluated by five board-certified internal medicine physicians for reliability and acceptability. The consistency of responses was determined by the differences in evaluation between the two answers to the same question. The readability of all responses was measured using the Flesch Reading Ease Scale (FRES), the Flesch-Kincaid Grade Level (FKGL), and the Simple Measure of Gobbledygook (SMOG). RESULTS: Sixty responses were collected for evaluation. Fifty-six (93.33%) of the responses were of good reliability. The average rating of the responses was 3.65 out of 4 total points. 78.3% (n=47) of the responses were found acceptable by the evaluators to be the only answer for an asthmatic patient. Only two (6.67%) of the 30 questions had inconsistent answers. The average readability of all responses was determined to be 33.50±14.37 on the FRES, 12.79±2.89 on the FKGL, and 13.47±2.38 on the SMOG. CONCLUSION: Compared to online websites, we found that ChatGPT can be a reliable and acceptable source of information for asthma patients in terms of information quality. However, all responses were of difficult readability, and none followed the recommended readability levels. Therefore, the readability of this AI application requires improvement to be more suitable for patients.

17.

Evaluating large language models for health-related text classification tasks with public social media data.

Guo, Yuting; Ovadje, Anthony; Al-Garadi, Mohammed Ali; Sarker, Abeed.

J Am Med Inform Assoc ; 2024 Aug 09.

Artigo em Inglês | MEDLINE | ID: mdl-39121174

RESUMO

OBJECTIVES: Large language models (LLMs) have demonstrated remarkable success in natural language processing (NLP) tasks. This study aimed to evaluate their performances on social media-based health-related text classification tasks. MATERIALS AND METHODS: We benchmarked 1 Support Vector Machine (SVM), 3 supervised pretrained language models (PLMs), and 2 LLMs-based classifiers across 6 text classification tasks. We developed 3 approaches for leveraging LLMs: employing LLMs as zero-shot classifiers, using LLMs as data annotators, and utilizing LLMs with few-shot examples for data augmentation. RESULTS: Across all tasks, the mean (SD) F1 score differences for RoBERTa, BERTweet, and SocBERT trained on human-annotated data were 0.24 (±0.10), 0.25 (±0.11), and 0.23 (±0.11), respectively, compared to those trained on the data annotated using GPT3.5, and were 0.16 (±0.07), 0.16 (±0.08), and 0.14 (±0.08) using GPT4, respectively. The GPT3.5 and GPT4 zero-shot classifiers outperformed SVMs in a single task and in 5 out of 6 tasks, respectively. When leveraging LLMs for data augmentation, the RoBERTa models trained on GPT4-augmented data demonstrated superior or comparable performance compared to those trained on human-annotated data alone. DISCUSSION: The results revealed that using LLM-annotated data only for training supervised classification models was ineffective. However, employing the LLM as a zero-shot classifier exhibited the potential to outperform traditional SVM models and achieved a higher recall than the advanced transformer-based model RoBERTa. Additionally, our results indicated that utilizing GPT3.5 for data augmentation could potentially harm model performance. In contrast, data augmentation with GPT4 demonstrated improved model performances, showcasing the potential of LLMs in reducing the need for extensive training data. CONCLUSIONS: By leveraging the data augmentation strategy, we can harness the power of LLMs to develop smaller, more effective domain-specific NLP models. Using LLM-annotated data without human guidance for training lightweight supervised classification models is an ineffective strategy. However, LLM, as a zero-shot classifier, shows promise in excluding false negatives and potentially reducing the human effort required for data annotation.

18.

Performance of Large Language Models in Patient Complaint Resolution: Web-Based Cross-Sectional Survey.

Yong, Lorraine Pei Xian; Tung, Joshua Yi Min; Lee, Zi Yao; Kuan, Win Sen; Chua, Mui Teng.

J Med Internet Res ; 26: e56413, 2024 Aug 09.

Artigo em Inglês | MEDLINE | ID: mdl-39121468

RESUMO

BACKGROUND: Patient complaints are a perennial challenge faced by health care institutions globally, requiring extensive time and effort from health care workers. Despite these efforts, patient dissatisfaction remains high. Recent studies on the use of large language models (LLMs) such as the GPT models developed by OpenAI in the health care sector have shown great promise, with the ability to provide more detailed and empathetic responses as compared to physicians. LLMs could potentially be used in responding to patient complaints to improve patient satisfaction and complaint response time. OBJECTIVE: This study aims to evaluate the performance of LLMs in addressing patient complaints received by a tertiary health care institution, with the goal of enhancing patient satisfaction. METHODS: Anonymized patient complaint emails and associated responses from the patient relations department were obtained. ChatGPT-4.0 (OpenAI, Inc) was provided with the same complaint email and tasked to generate a response. The complaints and the respective responses were uploaded onto a web-based questionnaire. Respondents were asked to rate both responses on a 10-point Likert scale for 4 items: appropriateness, completeness, empathy, and satisfaction. Participants were also asked to choose a preferred response at the end of each scenario. RESULTS: There was a total of 188 respondents, of which 115 (61.2%) were health care workers. A majority of the respondents, including both health care and non-health care workers, preferred replies from ChatGPT (n=164, 87.2% to n=183, 97.3%). GPT-4.0 responses were rated higher in all 4 assessed items with all median scores of 8 (IQR 7-9) compared to human responses (appropriateness 5, IQR 3-7; empathy 4, IQR 3-6; quality 5, IQR 3-6; satisfaction 5, IQR 3-6; P<.001) and had higher average word counts as compared to human responses (238 vs 76 words). Regression analyses showed that a higher word count was a statistically significant predictor of higher score in all 4 items, with every 1-word increment resulting in an increase in scores of between 0.015 and 0.019 (all P<.001). However, on subgroup analysis by authorship, this only held true for responses written by patient relations department staff and not those generated by ChatGPT which received consistently high scores irrespective of response length. CONCLUSIONS: This study provides significant evidence supporting the effectiveness of LLMs in resolution of patient complaints. ChatGPT demonstrated superiority in terms of response appropriateness, empathy, quality, and overall satisfaction when compared against actual human responses to patient complaints. Future research can be done to measure the degree of improvement that artificial intelligence generated responses can bring in terms of time savings, cost-effectiveness, patient satisfaction, and stress reduction for the health care system.

Assuntos

Satisfação do Paciente , Humanos , Estudos Transversais , Satisfação do Paciente/estatística & dados numéricos , Feminino , Inquéritos e Questionários , Masculino , Adulto , Internet , Idioma , Pessoa de Meia-Idade , Correio Eletrônico

19.

Is AI my co-author? The ethics of using artificial intelligence in scientific publishing.

Moffatt, Barton; Hall, Alicia.

Account Res ; : 1-17, 2024 Aug 07.

Artigo em Inglês | MEDLINE | ID: mdl-39109816

RESUMO

The recent emergence of Large Language Models (LLMs) and other forms of Artificial Intelligence (AI) has led people to wonder whether they could act as an author on a scientific paper. This paper argues that AI systems should not be included on the author by-line. We agree with current commentators that LLMs are incapable of taking responsibility for their work and thus do not meet current authorship guidelines. We identify other problems with responsibility and authorship. In addition, the problems go deeper as AI tools also do not write in a meaningful sense nor do they have persistent identities. From a broader publication ethics perspective, adopting AI authorship would have detrimental effects on an already overly competitive and stressed publishing ecosystem. Deterrence is possible as backward-looking tools will likely be able to identify past AI usage. Finally, we question the value of using AI to produce more research simply for publication's sake.

20.

Leveraging a large language model to predict protein phase transition: A physical, multiscale, and interpretable approach.

Frank, Mor; Ni, Pengyu; Jensen, Matthew; Gerstein, Mark B.

Proc Natl Acad Sci U S A ; 121(33): e2320510121, 2024 Aug 13.

Artigo em Inglês | MEDLINE | ID: mdl-39110734

RESUMO

Protein phase transitions (PPTs) from the soluble state to a dense liquid phase (forming droplets via liquid-liquid phase separation) or to solid aggregates (such as amyloids) play key roles in pathological processes associated with age-related diseases such as Alzheimer's disease. Several computational frameworks are capable of separately predicting the formation of droplets or amyloid aggregates based on protein sequences, yet none have tackled the prediction of both within a unified framework. Recently, large language models (LLMs) have exhibited great success in protein structure prediction; however, they have not yet been used for PPTs. Here, we fine-tune a LLM for predicting PPTs and demonstrate its usage in evaluating how sequence variants affect PPTs, an operation useful for protein design. In addition, we show its superior performance compared to suitable classical benchmarks. Due to the "black-box" nature of the LLM, we also employ a classical random forest model along with biophysical features to facilitate interpretation. Finally, focusing on Alzheimer's disease-related proteins, we demonstrate that greater aggregation is associated with reduced gene expression in Alzheimer's disease, suggesting a natural defense mechanism.

Assuntos

Doença de Alzheimer , Transição de Fase , Doença de Alzheimer/metabolismo , Humanos , Amiloide/metabolismo , Amiloide/química , Proteínas/química , Proteínas/metabolismo

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA