RESUMO
Increasing volumes of biomedical data are amassing in databases. Large-scale analyses of these data have wide-ranging applications in biology and medicine. Such analyses require tools to characterize and process entries at scale. However, existing tools, mainly centered on extracting predefined fields, often fail to comprehensively process database entries or correct evident errors-a task humans can easily perform. These tools also lack the ability to reason like domain experts, hindering their robustness and analytical depth. Recent advances with large language models (LLMs) provide a fundamentally new way to query databases. But while a tool such as ChatGPT is adept at answering questions about manually input records, challenges arise when scaling up this process. First, interactions with the LLM need to be automated. Second, limitations on input length may require a record pruning or summarization pre-processing step. Third, to behave reliably as desired, the LLM needs either well-designed, short, 'few-shot' examples, or fine-tuning based on a larger set of well-curated examples. Here, we report ChIP-GPT, based on fine-tuning of the generative pre-trained transformer (GPT) model Llama and on a program prompting the model iteratively and handling its generation of answer text. This model is designed to extract metadata from the Sequence Read Archive, emphasizing the identification of chromatin immunoprecipitation (ChIP) targets and cell lines. When trained with 100 examples, ChIP-GPT demonstrates 90-94% accuracy. Notably, it can seamlessly extract data from records with typos or absent field labels. Our proposed method is easily adaptable to customized questions and different databases.
Assuntos
Medicina , Humanos , Linhagem Celular , Imunoprecipitação da Cromatina , Bases de Dados Factuais , IdiomaRESUMO
The performance of deep Neural Networks (NNs) in the text (ChatGPT) and image (DALL-E2) domains has attracted worldwide attention. Convolutional NNs (CNNs), Large Language Models (LLMs), Denoising Diffusion Probabilistic Models (DDPMs)/Noise Conditional Score Networks (NCSNs), and Graph NNs (GNNs) have impacted computer vision, language editing and translation, automated conversation, image generation, and social network management. Proteins can be viewed as texts written with the alphabet of amino acids, as images, or as graphs of interacting residues. Each of these perspectives suggests the use of tools from a different area of deep learning for protein structural biology. Here, I review how CNNs, LLMs, DDPMs/NCSNs, and GNNs have led to major advances in protein structure prediction, inverse folding, protein design, and small molecule design. This review is primarily intended as a deep learning primer for practicing experimental structural biologists. However, extensive references to the deep learning literature should also make it relevant to readers who have a background in machine learning, physics or statistics, and an interest in protein structural biology.
RESUMO
The ability of recent Large Language Models (LLMs) such as GPT-3.5 and GPT-4 to generate human-like texts suggests that social scientists could use these LLMs to construct measures of semantic similarity that match human judgment. In this article, we provide an empirical test of this intuition. We use GPT-4 to construct a measure of typicality-the similarity of a text document to a concept. We evaluate its performance against other model-based typicality measures in terms of the correlation with human typicality ratings. We conduct this comparative analysis in two domains: the typicality of books in literary genres (using an existing dataset of book descriptions) and the typicality of tweets authored by US Congress members in the Democratic and Republican parties (using a novel dataset). The typicality measure produced with GPT-4 meets or exceeds the performance of the previous state-of-the art typicality measure we introduced in a recent paper [G. Le Mens, B. Kovács, M. T. Hannan, G. Pros Rius, Sociol. Sci. 2023, 82-117 (2023)]. It accomplishes this without any training with the research data (it is zero-shot learning). This is a breakthrough because the previous state-of-the-art measure required fine-tuning an LLM on hundreds of thousands of text documents to achieve its performance.
RESUMO
In recent years, there has been a surge in the publication of clinical trial reports, making it challenging to conduct systematic reviews. Automatically extracting Population, Intervention, Comparator, and Outcome (PICO) from clinical trial studies can alleviate the traditionally time-consuming process of manually scrutinizing systematic reviews. Existing approaches of PICO frame extraction involves supervised approach that relies on the existence of manually annotated data points in the form of BIO label tagging. Recent approaches, such as In-Context Learning (ICL), which has been shown to be effective for a number of downstream NLP tasks, require the use of labeled examples. In this work, we adopt ICL strategy by employing the pretrained knowledge of Large Language Models (LLMs), gathered during the pretraining phase of an LLM, to automatically extract the PICO-related terminologies from clinical trial documents in unsupervised set up to bypass the availability of large number of annotated data instances. Additionally, to showcase the highest effectiveness of LLM in oracle scenario where large number of annotated samples are available, we adopt the instruction tuning strategy by employing Low Rank Adaptation (LORA) to conduct the training of gigantic model in low resource environment for the PICO frame extraction task. More specifically, both of the proposed frameworks utilize AlpaCare as base LLM which employs both few-shot in-context learning and instruction tuning techniques to extract PICO-related terms from the clinical trial reports. We applied these approaches to the widely used coarse-grained datasets such as EBM-NLP, EBM-COMET and fine-grained datasets such as EBM-NLPrev and EBM-NLPh. Our empirical results show that our proposed ICL-based framework produces comparable results on all the version of EBM-NLP datasets and the proposed instruction tuned version of our framework produces state-of-the-art results on all the different EBM-NLP datasets. Our project is available at https://github.com/shrimonmuke0202/AlpaPICO.git.
Assuntos
Ensaios Clínicos como Assunto , Processamento de Linguagem Natural , Humanos , Ensaios Clínicos como Assunto/métodos , Mineração de Dados/métodos , Aprendizado de MáquinaRESUMO
BACKGROUND: There has been a considerable advancement in AI technologies like LLM and machine learning to support biomedical knowledge discovery. MAIN BODY: We propose a novel biomedical neural search service called 'VAIV Bio-Discovery', which supports enhanced knowledge discovery and document search on unstructured text such as PubMed. It mainly handles with information related to chemical compound/drugs, gene/proteins, diseases, and their interactions (chemical compounds/drugs-proteins/gene including drugs-targets, drug-drug, and drug-disease). To provide comprehensive knowledge, the system offers four search options: basic search, entity and interaction search, and natural language search. We employ T5slim_dec, which adapts the autoregressive generation task of the T5 (text-to-text transfer transformer) to the interaction extraction task by removing the self-attention layer in the decoder block. It also assists in interpreting research findings by summarizing the retrieved search results for a given natural language query with Retrieval Augmented Generation (RAG). The search engine is built with a hybrid method that combines neural search with the probabilistic search, BM25. CONCLUSION: As a result, our system can better understand the context, semantics and relationships between terms within the document, enhancing search accuracy. This research contributes to the rapidly evolving biomedical field by introducing a new service to access and discover relevant knowledge.
Assuntos
Processamento de Linguagem Natural , Mineração de Dados/métodos , Descoberta do Conhecimento/métodos , PubMed , Ferramenta de Busca , Aprendizado de Máquina , Armazenamento e Recuperação da Informação/métodos , Redes Neurais de ComputaçãoRESUMO
PURPOSE: Large language models (LLMs) are a form of artificial intelligence (AI) that uses deep learning techniques to understand, summarize and generate content. The potential benefits of LLMs in healthcare is predicted to be immense. The objective of this study was to examine the quality of patient information leaflets (PILs) produced by 3 LLMs on urological topics. METHODS: Prompts were created to generate PILs from 3 LLMs: ChatGPT-4, PaLM 2 (Google Bard) and Llama 2 (Meta) across four urology topics (circumcision, nephrectomy, overactive bladder syndrome, and transurethral resection of the prostate). PILs were evaluated using a quality assessment checklist. PIL readability was assessed by the Average Reading Level Consensus Calculator. RESULTS: PILs generated by PaLM 2 had the highest overall average quality score (3.58), followed by Llama 2 (3.34) and ChatGPT-4 (3.08). PaLM 2 generated PILs were of the highest quality in all topics except TURP and was the only LLM to include images. Medical inaccuracies were present in all generated content including instances of significant error. Readability analysis identified PaLM 2 generated PILs as the simplest (age 14-15 average reading level). Llama 2 PILs were the most difficult (age 16-17 average). CONCLUSION: While LLMs can generate PILs that may help reduce healthcare professional workload, generated content requires clinician input for accuracy and inclusion of health literacy aids, such as images. LLM-generated PILs were above the average reading level for adults, necessitating improvement in LLM algorithms and/or prompt design. How satisfied patients are to LLM-generated PILs remains to be evaluated.
Assuntos
Inteligência Artificial , Urologia , Humanos , Educação de Pacientes como Assunto/métodos , Idioma , Doenças Urológicas/cirurgiaRESUMO
PURPOSE: This study is a comparative analysis of three Large Language Models (LLMs) evaluating their rate of correct answers (RoCA) and the reliability of generated answers on a set of urological knowledge-based questions spanning different levels of complexity. METHODS: ChatGPT-3.5, ChatGPT-4, and Bing AI underwent two testing rounds, with a 48-h gap in between, using the 100 multiple-choice questions from the 2022 European Board of Urology (EBU) In-Service Assessment (ISA). For conflicting responses, an additional consensus round was conducted to establish conclusive answers. RoCA was compared across various question complexities. Ten weeks after the consensus round, a subsequent testing round was conducted to assess potential knowledge gain and improvement in RoCA, respectively. RESULTS: Over three testing rounds, ChatGPT-3.5 achieved RoCa scores of 58%, 62%, and 59%. In contrast, ChatGPT-4 achieved RoCA scores of 63%, 77%, and 77%, while Bing AI yielded scores of 81%, 73%, and 77%, respectively. Agreement rates between rounds 1 and 2 were 84% (κ = 0.67, p < 0.001) for ChatGPT-3.5, 74% (κ = 0.40, p < 0.001) for ChatGPT-4, and 76% (κ = 0.33, p < 0.001) for BING AI. In the consensus round, ChatGPT-4 and Bing AI significantly outperformed ChatGPT-3.5 (77% and 77% vs. 59%, both p = 0.010). All LLMs demonstrated decreasing RoCA scores with increasing question complexity (p < 0.001). In the fourth round, no significant improvement in RoCA was observed across all three LLMs. CONCLUSIONS: The performance of the tested LLMs in addressing urological specialist inquiries warrants further refinement. Moreover, the deficiency in response reliability contributes to existing challenges related to their current utility for educational purposes.
Assuntos
Inteligência Artificial , Urologia , Humanos , Reprodutibilidade dos Testes , Exame Físico , IdiomaRESUMO
Drug-drug interactions (DDIs) present a significant health burden, compounded by clinician time constraints and poor patient health literacy. We assessed the ability of ChatGPT (generative artificial intelligence-based large language model) to predict DDIs in a real-world setting. Demographics, diagnoses and prescribed medicines for 120 hospitalized patients were input through three standardized prompts to ChatGPT version 3.5 and compared against pharmacist DDI evaluation to estimate diagnostic accuracy. Area under receiver operating characteristic and inter-rater reliability (Cohen's and Fleiss' kappa coefficients) were calculated. ChatGPT's responses differed based on prompt wording style, with higher sensitivity for prompts mentioning 'drug interaction'. Confusion matrices displayed low true positive and high true negative rates, and there was minimal agreement between ChatGPT and pharmacists (Cohen's kappa values 0.077-0.143). Low sensitivity values suggest a lack of success in identifying DDIs by ChatGPT, and further development is required before it can reliably assess potential DDIs in real-world scenarios.
RESUMO
With the advancement of artificial intelligence(AI), platforms like ChatGPT have gained traction in different fields, including Medicine. This study aims to evaluate the potential of ChatGPT in addressing questions related to HIV prevention and to assess its accuracy, completeness, and inclusivity. A team consisting of 15 physicians, six members from HIV communities, and three experts in gender and queer studies designed an assessment of ChatGPT. Queries were categorized into five thematic groups: general HIV information, behaviors increasing HIV acquisition risk, HIV and pregnancy, HIV testing, and the prophylaxis use. A team of medical doctors was in charge of developing questions to be submitted to ChatGPT. The other members critically assessed the generated responses regarding level of expertise, accuracy, completeness, and inclusivity. The median accuracy score was 5.5 out of 6, with 88.4% of responses achieving a score ≥ 5. Completeness had a median of 3 out of 3, while the median for inclusivity was 2 out of 3. Some thematic groups, like behaviors associated with HIV transmission and prophylaxis, exhibited higher accuracy, indicating variable performance across different topics. Issues of inclusivity were identified, notably the use of outdated terms and a lack of representation for some communities. ChatGPT demonstrates significant potential in providing accurate information on HIV-related topics. However, while responses were often scientifically accurate, they sometimes lacked the socio-political context and inclusivity essential for effective health communication. This underlines the importance of aligning AI-driven platforms with contemporary health communication strategies and ensuring the balance of accuracy and inclusivity.
Assuntos
Infecções por HIV , Humanos , Infecções por HIV/prevenção & controle , Feminino , Masculino , Comunicação , Inteligência Artificial , Teste de HIV , Comunicação em Saúde/métodos , Conhecimentos, Atitudes e Prática em SaúdeRESUMO
We compared different LLMs, notably chatGPT, GPT4, and Google Bard and we tested whether their performance differs in subspeciality domains, in executing examinations from four different courses of the European Society of Neuroradiology (ESNR) notably anatomy/embryology, neuro-oncology, head and neck and pediatrics. Written exams of ESNR were used as input data, related to anatomy/embryology (30 questions), neuro-oncology (50 questions), head and neck (50 questions), and pediatrics (50 questions). All exams together, and each exam separately were introduced to the three LLMs: chatGPT 3.5, GPT4, and Google Bard. Statistical analyses included a group-wise Friedman test followed by a pair-wise Wilcoxon test with multiple comparison corrections. Overall, there was a significant difference between the 3 LLMs (p < 0.0001), with GPT4 having the highest accuracy (70%), followed by chatGPT 3.5 (54%) and Google Bard (36%). The pair-wise comparison showed significant differences between chatGPT vs GPT 4 (p < 0.0001), chatGPT vs Bard (p < 0. 0023), and GPT4 vs Bard (p < 0.0001). Analyses per subspecialty showed the highest difference between the best LLM (GPT4, 70%) versus the worst LLM (Google Bard, 24%) in the head and neck exam, while the difference was least pronounced in neuro-oncology (GPT4, 62% vs Google Bard, 48%). We observed significant differences in the performance of the three different LLMs in the running of official exams organized by ESNR. Overall GPT 4 performed best, and Google Bard performed worst. This difference varied depending on subspeciality and was most pronounced in head and neck subspeciality.
Assuntos
Sociedades Médicas , Humanos , Europa (Continente) , Avaliação Educacional , Radiologia/educação , NeurorradiografiaRESUMO
BACKGROUND: Artificial intelligence (AI) large language models (LLMs) such as ChatGPT have demonstrated the ability to pass standardized exams. These models are not trained for a specific task, but instead trained to predict sequences of text from large corpora of documents sourced from the internet. It has been shown that even models trained on this general task can pass exams in a variety of domain-specific fields, including the United States Medical Licensing Examination. We asked if large language models would perform as well on a much narrower subdomain tests designed for medical specialists. Furthermore, we wanted to better understand how progressive generations of GPT (generative pre-trained transformer) models may be evolving in the completeness and sophistication of their responses even while generational training remains general. In this study, we evaluated the performance of two versions of GPT (GPT 3 and 4) on their ability to pass the certification exam given to physicians to work as osteoporosis specialists and become a certified clinical densitometrists. The CCD exam has a possible score range of 150 to 400. To pass, you need a score of 300. METHODS: A 100-question multiple-choice practice exam was obtained from a 3rd party exam preparation website that mimics the accredited certification tests given by the ISCD (International Society for Clinical Densitometry). The exam was administered to two versions of GPT, the free version (GPT Playground) and ChatGPT+, which are based on GPT-3 and GPT-4, respectively (OpenAI, San Francisco, CA). The systems were prompted with the exam questions verbatim. If the response was purely textual and did not specify which of the multiple-choice answers to select, the authors matched the text to the closest answer. Each exam was graded and an estimated ISCD score was provided from the exam website. In addition, each response was evaluated by a rheumatologist CCD and ranked for accuracy using a 5-level scale. The two GPT versions were compared in terms of response accuracy and length. RESULTS: The average response length was 11.6 ±19 words for GPT-3 and 50.0±43.6 words for GPT-4. GPT-3 answered 62 questions correctly resulting in a failing ISCD score of 289. However, GPT-4 answered 82 questions correctly with a passing score of 342. GPT-3 scored highest on the "Overview of Low Bone Mass and Osteoporosis" category (72â¯% correct) while GPT-4 scored well above 80â¯% accuracy on all categories except "Imaging Technology in Bone Health" (65â¯% correct). Regarding subjective accuracy, GPT-3 answered 23 questions with nonsensical or totally wrong responses while GPT-4 had no responses in that category. CONCLUSION: If this had been an actual certification exam, GPT-4 would now have a CCD suffix to its name even after being trained using general internet knowledge. Clearly, more goes into physician training than can be captured in this exam. However, GPT algorithms may prove to be valuable physician aids in the diagnoses and monitoring of osteoporosis and other diseases.
Assuntos
Inteligência Artificial , Certificação , Humanos , Osteoporose/diagnóstico , Competência Clínica , Avaliação Educacional/métodos , Estados UnidosRESUMO
PURPOSE: The aim of this study was to define the capability of ChatGPT-4 and Google Gemini in analyzing detailed glaucoma case descriptions and suggesting an accurate surgical plan. METHODS: Retrospective analysis of 60 medical records of surgical glaucoma was divided into "ordinary" (n = 40) and "challenging" (n = 20) scenarios. Case descriptions were entered into ChatGPT and Bard's interfaces with the question "What kind of surgery would you perform?" and repeated three times to analyze the answers' consistency. After collecting the answers, we assessed the level of agreement with the unified opinion of three glaucoma surgeons. Moreover, we graded the quality of the responses with scores from 1 (poor quality) to 5 (excellent quality), according to the Global Quality Score (GQS) and compared the results. RESULTS: ChatGPT surgical choice was consistent with those of glaucoma specialists in 35/60 cases (58%), compared to 19/60 (32%) of Gemini (p = 0.0001). Gemini was not able to complete the task in 16 cases (27%). Trabeculectomy was the most frequent choice for both chatbots (53% and 50% for ChatGPT and Gemini, respectively). In "challenging" cases, ChatGPT agreed with specialists in 9/20 choices (45%), outperforming Google Gemini performances (4/20, 20%). Overall, GQS scores were 3.5 ± 1.2 and 2.1 ± 1.5 for ChatGPT and Gemini (p = 0.002). This difference was even more marked if focusing only on "challenging" cases (1.5 ± 1.4 vs. 3.0 ± 1.5, p = 0.001). CONCLUSION: ChatGPT-4 showed a good analysis performance for glaucoma surgical cases, either ordinary or challenging. On the other side, Google Gemini showed strong limitations in this setting, presenting high rates of unprecise or missed answers.
Assuntos
Glaucoma , Humanos , Estudos Retrospectivos , Glaucoma/cirurgia , Glaucoma/fisiopatologia , Feminino , Masculino , Trabeculectomia/métodos , Pressão Intraocular/fisiologia , Idoso , Pessoa de Meia-IdadeRESUMO
Adverse drug events (ADEs) are one of the major causes of hospital admissions and are associated with increased morbidity and mortality. Post-marketing ADE identification is one of the most important phases of drug safety surveillance. Traditionally, data sources for post-marketing surveillance mainly come from spontaneous reporting system such as the Food and Drug Administration Adverse Event Reporting System (FAERS). Social media data such as posts on X (formerly Twitter) contain rich patient and medication information and could potentially accelerate drug surveillance research. However, ADE information in social media data is usually locked in the text, making it difficult to be employed by traditional statistical approaches. In recent years, large language models (LLMs) have shown promise in many natural language processing tasks. In this study, we developed several LLMs to perform ADE classification on X data. We fine-tuned various LLMs including BERT-base, Bio_ClinicalBERT, RoBERTa, and RoBERTa-large. We also experimented ChatGPT few-shot prompting and ChatGPT fine-tuned on the whole training data. We then evaluated the model performance based on sensitivity, specificity, negative predictive value, positive predictive value, accuracy, F1-measure, and area under the ROC curve. Our results showed that RoBERTa-large achieved the best F1-measure (0.8) among all models followed by ChatGPT fine-tuned model with F1-measure of 0.75. Our feature importance analysis based on 1200 random samples and RoBERTa-Large showed the most important features are as follows: "withdrawals"/"withdrawal", "dry", "dealing", "mouth", and "paralysis". The good model performance and clinically relevant features show the potential of LLMs in augmenting ADE detection for post-marketing drug safety surveillance.
RESUMO
Healthcare 4.0 describes the future transformation of the healthcare sector driven by the combination of digital technologies, such as artificial intelligence (AI), big data and the Internet of Medical Things, enabling the advancement of precision medicine. This overview article addresses various areas such as large language models (LLM), diagnostics and robotics, shedding light on the positive aspects of Healthcare 4.0 and showcasing exciting methods and application examples in cardiology. It delves into the broad knowledge base and enormous potential of LLMs, highlighting their immediate benefits as digital assistants or for administrative tasks. In diagnostics, the increasing usefulness of wearables is emphasized and an AI for predicting heart filling pressures based on cardiac magnetic resonance imaging (MRI) is introduced. Additionally, it discusses the revolutionary methodology of a digital simulation of the physical heart (digital twin). Finally, it addresses both regulatory frameworks and a brief vision of data-driven healthcare delivery, explaining the need for investments in technical personnel and infrastructure to achieve a more effective medicine.
Assuntos
Inteligência Artificial , Inteligência Artificial/tendências , Atenção à Saúde/tendências , Humanos , Previsões , Telemedicina/tendências , Alemanha , Cardiologia/tendências , Medicina de Precisão/tendências , Medicina de Precisão/métodos , Robótica/tendênciasRESUMO
OBJECTIVES: The rising diversity of food preferences and the desire to provide better personalized care provide challenges to renal dietitians working in dialysis clinics. To address this situation, we explored the use of a large language model, specifically, ChatGPT using the GPT-4 model (openai.com), to support nutritional advice given to dialysis patients. METHODS: We tasked ChatGPT-4 with generating a personalized daily meal plan, including nutritional information. Virtual "patients" were generated through Monte Carlo simulation; data from a randomly selected virtual patient were presented to ChatGPT. We provided to ChatGPT patient demographics, food preferences, laboratory data, clinical characteristics, and available budget, to generate a one-day sample menu with recipes and nutritional analyses. The resulting daily recipe recommendations, cooking instructions, and nutritional analyses were reviewed and rated on a five-point Likert scale by an experienced renal dietitian. In addition, the generated content was rated by a renal dietitian and compared with a U. S. Department of Agriculture-approved nutrient analysis software. ChatGPT also analyzed nutrition information of two recipes published online. We also requested a translation of the output into Spanish, Mandarin, Hungarian, German, and Dutch. RESULTS: ChatGPT generated a daily menu with five recipes. The renal dietitian rated the recipes at 3 (3, 3) [median (Q1, Q3)], the cooking instructions at 5 (5,5), and the nutritional analysis at 2 (2, 2) on the five-point Likert scale. ChatGPT's nutritional analysis underestimated calories by 36% (95% CI: 44-88%), protein by 28% (25-167%), fat 48% (29-81%), phosphorus 54% (15-102%), potassium 49% (40-68%), and sodium 53% (14-139%). The nutritional analysis of online available recipes differed only by 0 to 35%. The translations were rated as reliable by native speakers (4 on the five-point Likert scale). CONCLUSION: While ChatGPT-4 shows promise in providing personalized nutritional guidance for diverse dialysis patients, improvements are necessary. This study highlights the importance of thorough qualitative and quantitative evaluation of artificial intelligence-generated content, especially regarding medical use cases.
RESUMO
BACKGROUND: Allergy disorders caused by biological particles, such as the proteins in some airborne pollen grains, are currently considered one of the most common chronic diseases, and European Academy of Allergy and Clinical Immunology forecasts indicate that within 15 years 50% of Europeans will have some kind of allergy as a consequence of urbanization, industrialization, pollution, and climate change. OBJECTIVE: The aim of this study was to monitor and analyze the dissemination of information about pollen symptoms from December 2006 to January 2022. By conducting a comprehensive evaluation of public comments and trends on Twitter, the research sought to provide valuable insights into the impact of pollen on sensitive individuals, ultimately enhancing our understanding of how pollen-related information spreads and its implications for public health awareness. METHODS: Using a blend of large language models, dimensionality reduction, unsupervised clustering, and term frequency-inverse document frequency, alongside visual representations such as word clouds and semantic interaction graphs, our study analyzed Twitter data to uncover insights on respiratory allergies. This concise methodology enabled the extraction of significant themes and patterns, offering a deep dive into public knowledge and discussions surrounding respiratory allergies on Twitter. RESULTS: The months between March and August had the highest volume of messages. The percentage of patient tweets appeared to increase notably during the later years, and there was also a potential increase in the prevalence of symptoms, mainly in the morning hours, indicating a potential rise in pollen allergies and related discussions on social media. While pollen allergy is a global issue, specific sociocultural, political, and economic contexts mean that patients experience symptomatology at a localized level, needing appropriate localized responses. CONCLUSIONS: The interpretation of tweet information represents a valuable tool to take preventive measures to mitigate the impact of pollen allergy on sensitive patients to achieve equity in living conditions and enhance access to health information and services.
Assuntos
Pólen , Mídias Sociais , Mídias Sociais/estatística & dados numéricos , Pólen/efeitos adversos , Humanos , Estudos Retrospectivos , Rinite Alérgica Sazonal/epidemiologia , Disseminação de Informação/métodos , AlérgenosRESUMO
BACKGROUND: Artificial intelligence and the language models derived from it, such as ChatGPT, offer immense possibilities, particularly in the field of medicine. It is already evident that ChatGPT can provide adequate and, in some cases, expert-level responses to health-related queries and advice for patients. However, it is currently unknown how patients perceive these capabilities, whether they can derive benefit from them, and whether potential risks, such as harmful suggestions, are detected by patients. OBJECTIVE: This study aims to clarify whether patients can get useful and safe health care advice from an artificial intelligence chatbot assistant. METHODS: This cross-sectional study was conducted using 100 publicly available health-related questions from 5 medical specialties (trauma, general surgery, otolaryngology, pediatrics, and internal medicine) from a web-based platform for patients. Responses generated by ChatGPT-4.0 and by an expert panel (EP) of experienced physicians from the aforementioned web-based platform were packed into 10 sets consisting of 10 questions each. The blinded evaluation was carried out by patients regarding empathy and usefulness (assessed through the question: "Would this answer have helped you?") on a scale from 1 to 5. As a control, evaluation was also performed by 3 physicians in each respective medical specialty, who were additionally asked about the potential harm of the response and its correctness. RESULTS: In total, 200 sets of questions were submitted by 64 patients (mean 45.7, SD 15.9 years; 29/64, 45.3% male), resulting in 2000 evaluated answers of ChatGPT and the EP each. ChatGPT scored higher in terms of empathy (4.18 vs 2.7; P<.001) and usefulness (4.04 vs 2.98; P<.001). Subanalysis revealed a small bias in terms of levels of empathy given by women in comparison with men (4.46 vs 4.14; P=.049). Ratings of ChatGPT were high regardless of the participant's age. The same highly significant results were observed in the evaluation of the respective specialist physicians. ChatGPT outperformed significantly in correctness (4.51 vs 3.55; P<.001). Specialists rated the usefulness (3.93 vs 4.59) and correctness (4.62 vs 3.84) significantly lower in potentially harmful responses from ChatGPT (P<.001). This was not the case among patients. CONCLUSIONS: The results indicate that ChatGPT is capable of supporting patients in health-related queries better than physicians, at least in terms of written advice through a web-based platform. In this study, ChatGPT's responses had a lower percentage of potentially harmful advice than the web-based EP. However, it is crucial to note that this finding is based on a specific study design and may not generalize to all health care settings. Alarmingly, patients are not able to independently recognize these potential dangers.
Assuntos
Relações Médico-Paciente , Humanos , Estudos Transversais , Masculino , Feminino , Adulto , Pessoa de Meia-Idade , Inteligência Artificial , Médicos/psicologia , Internet , Empatia , Inquéritos e QuestionáriosRESUMO
BACKGROUND: Alzheimer's disease (AD) is a progressive neurodegenerative disorder posing challenges to patients, caregivers, and society. Accessible and accurate information is crucial for effective AD management. OBJECTIVE: This study aimed to evaluate the accuracy, comprehensibility, clarity, and usefulness of the Generative Pretrained Transformer's (GPT) answers concerning the management and caregiving of patients with AD. METHODS: In total, 14 questions related to the prevention, treatment, and care of AD were identified and posed to GPT-3.5 and GPT-4 in Chinese and English, respectively, and 4 respondent neurologists were asked to answer them. We generated 8 sets of responses (total 112) and randomly coded them in answer sheets. Next, 5 evaluator neurologists and 5 family members of patients were asked to rate the 112 responses using separate 5-point Likert scales. We evaluated the quality of the responses using a set of 8 questions rated on a 5-point Likert scale. To gauge comprehensibility and participant satisfaction, we included 3 questions dedicated to each aspect within the same set of 8 questions. RESULTS: As of April 10, 2023, the 5 evaluator neurologists and 5 family members of patients with AD rated the 112 responses: GPT-3.5: n=28, 25%, responses; GPT-4: n=28, 25%, responses; respondent neurologists: 56 (50%) responses. The top 5 (4.5%) responses rated by evaluator neurologists had 4 (80%) GPT (GPT-3.5+GPT-4) responses and 1 (20%) respondent neurologist's response. For the top 5 (4.5%) responses rated by patients' family members, all but the third response were GPT responses. Based on the evaluation by neurologists, the neurologist-generated responses achieved a mean score of 3.9 (SD 0.7), while the GPT-generated responses scored significantly higher (mean 4.4, SD 0.6; P<.001). Language and model analyses revealed no significant differences in response quality between the GPT-3.5 and GPT-4 models (GPT-3.5: mean 4.3, SD 0.7; GPT-4: mean 4.4, SD 0.5; P=.51). However, English responses outperformed Chinese responses in terms of comprehensibility (Chinese responses: mean 4.1, SD 0.7; English responses: mean 4.6, SD 0.5; P=.005) and participant satisfaction (Chinese responses: mean 4.2, SD 0.8; English responses: mean 4.5, SD 0.5; P=.04). According to the evaluator neurologists' review, Chinese responses had a mean score of 4.4 (SD 0.6), whereas English responses had a mean score of 4.5 (SD 0.5; P=.002). As for the family members of patients with AD, no significant differences were observed between GPT and neurologists, GPT-3.5 and GPT-4, or Chinese and English responses. CONCLUSIONS: GPT can provide patient education materials on AD for patients, their families and caregivers, nurses, and neurologists. This capability can contribute to the effective health care management of patients with AD, leading to enhanced patient outcomes.
Assuntos
Doença de Alzheimer , Inteligência Artificial , Neurologistas , Humanos , Doença de Alzheimer/terapia , Doença de Alzheimer/psicologia , Masculino , Cuidadores , Feminino , Inquéritos e QuestionáriosRESUMO
This study explores the potential of using large language models to assist content analysis by conducting a case study to identify adverse events (AEs) in social media posts. The case study compares ChatGPT's performance with human annotators' in detecting AEs associated with delta-8-tetrahydrocannabinol, a cannabis-derived product. Using the identical instructions given to human annotators, ChatGPT closely approximated human results, with a high degree of agreement noted: 94.4% (9436/10,000) for any AE detection (Fleiss κ=0.95) and 99.3% (9931/10,000) for serious AEs (κ=0.96). These findings suggest that ChatGPT has the potential to replicate human annotation accurately and efficiently. The study recognizes possible limitations, including concerns about the generalizability due to ChatGPT's training data, and prompts further research with different models, data sources, and content analysis tasks. The study highlights the promise of large language models for enhancing the efficiency of biomedical research.
Assuntos
Mídias Sociais , Humanos , Mídias Sociais/estatística & dados numéricos , Dronabinol/efeitos adversos , Processamento de Linguagem NaturalRESUMO
BACKGROUND: The release of ChatGPT (OpenAI) in November 2022 drastically reduced the barrier to using artificial intelligence by allowing a simple web-based text interface to a large language model (LLM). One use case where ChatGPT could be useful is in triaging patients at the site of a disaster using the Simple Triage and Rapid Treatment (START) protocol. However, LLMs experience several common errors including hallucinations (also called confabulations) and prompt dependency. OBJECTIVE: This study addresses the research problem: "Can ChatGPT adequately triage simulated disaster patients using the START protocol?" by measuring three outcomes: repeatability, reproducibility, and accuracy. METHODS: Nine prompts were developed by 5 disaster medicine physicians. A Python script queried ChatGPT Version 4 for each prompt combined with 391 validated simulated patient vignettes. Ten repetitions of each combination were performed for a total of 35,190 simulated triages. A reference standard START triage code for each simulated case was assigned by 2 disaster medicine specialists (JMF and MV), with a third specialist (LC) added if the first two did not agree. Results were evaluated using a gage repeatability and reproducibility study (gage R and R). Repeatability was defined as variation due to repeated use of the same prompt. Reproducibility was defined as variation due to the use of different prompts on the same patient vignette. Accuracy was defined as agreement with the reference standard. RESULTS: Although 35,102 (99.7%) queries returned a valid START score, there was considerable variability. Repeatability (use of the same prompt repeatedly) was 14% of the overall variation. Reproducibility (use of different prompts) was 4.1% of the overall variation. The accuracy of ChatGPT for START was 63.9% with a 32.9% overtriage rate and a 3.1% undertriage rate. Accuracy varied by prompt with a maximum of 71.8% and a minimum of 46.7%. CONCLUSIONS: This study indicates that ChatGPT version 4 is insufficient to triage simulated disaster patients via the START protocol. It demonstrated suboptimal repeatability and reproducibility. The overall accuracy of triage was only 63.9%. Health care professionals are advised to exercise caution while using commercial LLMs for vital medical determinations, given that these tools may commonly produce inaccurate data, colloquially referred to as hallucinations or confabulations. Artificial intelligence-guided tools should undergo rigorous statistical evaluation-using methods such as gage R and R-before implementation into clinical settings.