RESUMEN
BACKGROUND & AIMS: Early identification and accurate characterization of overt gastrointestinal bleeding (GIB) enables opportunities to optimize patient management and ensures appropriately risk-adjusted coding for claims-based quality measures and reimbursement. Recent advancements in generative artificial intelligence, particularly large language models (LLMs), create opportunities to support accurate identification of clinical conditions. In this study, we present the first LLM-based pipeline for identification of overt GIB in the electronic health record (EHR). We demonstrate 2 clinically relevant applications: the automated detection of recurrent bleeding and appropriate reimbursement coding for patients with GIB. METHODS: Development of the LLM-based pipeline was performed on 17,712 nursing notes from 1108 patients who were hospitalized with acute GIB and underwent endoscopy in the hospital from 2014 to 2023. The pipeline was used to train an EHR-based machine learning model for detection of recurrent bleeding on 546 patients presenting to 2 hospitals and externally validated on 562 patients presenting to 4 different hospitals. The pipeline was used to develop an algorithm for appropriate reimbursement coding on 7956 patients who underwent endoscopy in the hospital from 2019 to 2023. RESULTS: The LLM-based pipeline accurately detected melena (positive predictive value, 0.972; sensitivity, 0.900), hematochezia (positive predictive value, 0.900; sensitivity, 0.908), and hematemesis (positive predictive value, 0.859; sensitivity, 0.932). The EHR-based machine learning model identified recurrent bleeding with area under the curve of 0.986, sensitivity of 98.4%, and specificity of 97.5%. The reimbursement coding algorithm resulted in an average per-patient reimbursement increase of $1299 to $3247 with a total difference of $697,460 to $1,743,649. CONCLUSIONS: An LLM-based pipeline can robustly detect overt GIB in the EHR with clinically relevant applications in detection of recurrent bleeding and appropriate reimbursement coding.
RESUMEN
Neurodegenerative dementia syndromes, such as primary progressive aphasias (PPA), have traditionally been diagnosed based, in part, on verbal and non-verbal cognitive profiles. Debate continues about whether PPA is best divided into three variants and regarding the most distinctive linguistic features for classifying PPA variants. In this cross-sectional study, we initially harnessed the capabilities of artificial intelligence and natural language processing to perform unsupervised classification of short, connected speech samples from 78 pateints with PPA. We then used natural language processing to identify linguistic features that best dissociate the three PPA variants. Large language models discerned three distinct PPA clusters, with 88.5% agreement with independent clinical diagnoses. Patterns of cortical atrophy of three data-driven clusters corresponded to the localization in the clinical diagnostic criteria. In the subsequent supervised classification, 17 distinctive features emerged, including the observation that separating verbs into high- and low-frequency types significantly improved classification accuracy. Using these linguistic features derived from the analysis of short, connected speech samples, we developed a classifier that achieved 97.9% accuracy in classifying the four groups (three PPA variants and healthy controls). The data-driven section of this study showcases the ability of large language models to find natural partitioning in the speech of patients with PPA consistent with conventional variants. In addition, the work identifies a robust set of language features indicative of each PPA variant, emphasizing the significance of dividing verbs into high- and low-frequency categories. Beyond improving diagnostic accuracy, these findings enhance our understanding of the neurobiology of language processing.
Asunto(s)
Afasia Progresiva Primaria , Inteligencia Artificial , Habla , Humanos , Afasia Progresiva Primaria/diagnóstico , Afasia Progresiva Primaria/clasificación , Masculino , Anciano , Femenino , Persona de Mediana Edad , Habla/fisiología , Estudios Transversales , Atrofia/patología , Procesamiento de Lenguaje NaturalRESUMEN
Using simulations or experiments performed at some set of temperatures to learn about the physics or chemistry at some other arbitrary temperature is a problem of immense practical and theoretical relevance. Here we develop a framework based on statistical mechanics and generative artificial intelligence that allows solving this problem. Specifically, we work with denoising diffusion probabilistic models and show how these models in combination with replica exchange molecular dynamics achieve superior sampling of the biomolecular energy landscape at temperatures that were never simulated without assuming any particular slow degrees of freedom. The key idea is to treat the temperature as a fluctuating random variable and not a control parameter as is usually done. This allows us to directly sample from the joint probability distribution in configuration and temperature space. The results here are demonstrated for a chirally symmetric peptide and single-strand RNA undergoing conformational transitions in all-atom water. We demonstrate how we can discover transition states and metastable states that were previously unseen at the temperature of interest and even bypass the need to perform further simulations for a wide range of temperatures. At the same time, any unphysical states are easily identifiable through very low Boltzmann weights. The procedure while shown here for a class of molecular simulations should be more generally applicable to mixing information across simulations and experiments with varying control parameters.
Asunto(s)
Inteligencia Artificial , Simulación de Dinámica Molecular , Péptidos , ARN , Temperatura , Péptidos/química , Física , ARN/químicaRESUMEN
Digital twins, which are in silico replications of an individual and its environment, have advanced clinical decision-making and prognostication in cardiovascular medicine. The technology enables personalized simulations of clinical scenarios, prediction of disease risk, and strategies for clinical trial augmentation. Current applications of cardiovascular digital twins have integrated multi-modal data into mechanistic and statistical models to build physiologically accurate cardiac replicas to enhance disease phenotyping, enrich diagnostic workflows, and optimize procedural planning. Digital twin technology is rapidly evolving in the setting of newly available data modalities and advances in generative artificial intelligence, enabling dynamic and comprehensive simulations unique to an individual. These twins fuse physiologic, environmental, and healthcare data into machine learning and generative models to build real-time patient predictions that can model interactions with the clinical environment to accelerate personalized patient care. This review summarizes digital twins in cardiovascular medicine and their potential future applications by incorporating new personalized data modalities. It examines the technical advances in deep learning and generative artificial intelligence that broaden the scope and predictive power of digital twins. Finally, it highlights the individual and societal challenges as well as ethical considerations that are essential to realizing the future vision of incorporating cardiology digital twins into personalized cardiovascular care.
RESUMEN
Despite recent advances, the adoption of computer vision methods into clinical and commercial applications has been hampered by the limited availability of accurate ground truth tissue annotations required to train robust supervised models. Generating such ground truth can be accelerated by annotating tissue molecularly using immunofluorescence (IF) staining and mapping these annotations to a post-IF hematoxylin and eosin (H&E) (terminal H&E) stain. Mapping the annotations between IF and terminal H&E increases both the scale and accuracy by which ground truth could be generated. However, discrepancies between terminal H&E and conventional H&E caused by IF tissue processing have limited this implementation. We sought to overcome this challenge and achieve compatibility between these parallel modalities using synthetic image generation, in which a cycle-consistent generative adversarial network was applied to transfer the appearance of conventional H&E such that it emulates terminal H&E. These synthetic emulations allowed us to train a deep learning model for the segmentation of epithelium in terminal H&E that could be validated against the IF staining of epithelial-based cytokeratins. The combination of this segmentation model with the cycle-consistent generative adversarial network stain transfer model enabled performative epithelium segmentation in conventional H&E images. The approach demonstrates that the training of accurate segmentation models for the breadth of conventional H&E data can be executed free of human expert annotations by leveraging molecular annotation strategies such as IF, so long as the tissue impacts of the molecular annotation protocol are captured by generative models that can be deployed prior to the segmentation process.
RESUMEN
PURPOSE: This cross-sectional study assessed a generative-AI platform to automate the creation of accurate, appropriate, and compelling social-media (SoMe) posts from urological journal articles. MATERIALS AND METHODS: One hundred SoMe-posts from the top 3 journals in urology X (Twitter) profiles were collected from Aug-2022 to Oct-2023 A freeware GPT-tool was developed to auto-generate SoMe posts, which included title-summarization, key findings, pertinent emojis, hashtags, and DOI links to the article. Three physicians independently evaluated GPT-generated posts for achieving tetrafecta of accuracy and appropriateness criteria. Fifteen scenarios were created from 5 randomly selected posts from each journal. Each scenario contained both the original and the GPT-generated post for the same article. Five questions were formulated to investigate the posts' likability, shareability, engagement, understandability, and comprehensiveness. The paired posts were then randomized and presented to blinded academic authors and general public through Amazon Mechanical Turk (AMT) responders for preference evaluation. RESULTS: Median (IQR) time for post auto-generation was 10.2 seconds (8.5-12.5). Of the 150 rated GPT-generated posts, 115 (76.6%) met the correctness tetrafecta: 144 (96%) accurately summarized the title, 147 (98%) accurately presented the articles' main findings, 131 (87.3%) appropriately used emojis and hashtags 138 (92%). A total of 258 academic urologists and 493 AMT responders answered the surveys, wherein the GPT-generated posts consistently outperformed the original journals' posts for both academicians and AMT responders (P < .05). CONCLUSIONS: Generative-AI can automate the creation of SoMe posts from urology journal abstracts that are both accurate and preferable by the academic community and general public.
RESUMEN
The internet is the primary source of infertility-related information for most people who are experiencing fertility issues. Although no longer shrouded in stigma, the privacy of interacting only with a computer provides a sense of safety when engaging with sensitive content and allows for diverse and geographically dispersed communities to connect and share their experiences. It also provides businesses with a virtual marketplace for their products. The introduction of ChatGPT, a conversational language model developed by OpenAI to understand and generate human-like text in response to user input, in November 2022, and other emerging generative artificial intelligence (AI) language models, has changed and will continue to change the way we interact with large volumes of digital information. When it comes to its application in health information seeking, specifically in relation to fertility in this case, is ChatGPT a friend or foe in helping people make well-informed decisions? Furthermore, if deemed useful, how can we ensure this technology supports fertility-related decision-making? After conducting a study into the quality of the information provided by ChatGPT to people seeking information on fertility, we explore the potential benefits and pitfalls of using generative AI as a tool to support decision-making.
Asunto(s)
Inteligencia Artificial , Infertilidad , Humanos , Fertilidad , Infertilidad/terapia , Comercio , ComunicaciónRESUMEN
With the recent advances in artificial intelligence (AI), patients are increasingly exposed to misleading medical information. Generative AI models, including large language models such as ChatGPT, create and modify text, images, audio and video information based on training data. Commercial use of generative AI is expanding rapidly and the public will routinely receive messages created by generative AI. However, generative AI models may be unreliable, routinely make errors and widely spread misinformation. Misinformation created by generative AI about mental illness may include factual errors, nonsense, fabricated sources and dangerous advice. Psychiatrists need to recognise that patients may receive misinformation online, including about medicine and psychiatry.
Asunto(s)
Trastornos Mentales , Psiquiatría , Humanos , Inteligencia Artificial , Psiquiatras , ComunicaciónRESUMEN
Therapeutic resistance is a major challenge facing the design of effective cancer treatments. Adaptive cancer therapy is in principle the most viable approach to manage cancer's adaptive dynamics through drug combinations with dose timing and modulation. However, there are numerous open issues facing the clinical success of adaptive therapy. Chief among these issues is the feasibility of real-time predictions of treatment response which represent a bedrock requirement of adaptive therapy. Generative artificial intelligence has the potential to learn prediction models of treatment response from clinical, molecular, and radiomics data about patients and their treatments. The article explores this potential through a proposed integration model of Generative Pre-Trained Transformers (GPTs) in a closed loop with adaptive treatments to predict the trajectories of disease progression. The conceptual model and the challenges facing its realization are discussed in the broader context of artificial intelligence integration in oncology.
Asunto(s)
Inteligencia Artificial , Neoplasias , Humanos , Neoplasias/tratamiento farmacológico , Neoplasias/terapiaRESUMEN
With its increasing popularity, healthcare professionals and patients may use ChatGPT to obtain medication-related information. This study was conducted to assess ChatGPT's ability to provide satisfactory responses (i.e., directly answers the question, accurate, complete and relevant) to medication-related questions posed to an academic drug information service. ChatGPT responses were compared to responses generated by the investigators through the use of traditional resources, and references were evaluated. Thirty-nine questions were entered into ChatGPT; the three most common categories were therapeutics (8; 21%), compounding/formulation (6; 15%) and dosage (5; 13%). Ten (26%) questions were answered satisfactorily by ChatGPT. Of the 29 (74%) questions that were not answered satisfactorily, deficiencies included lack of a direct response (11; 38%), lack of accuracy (11; 38%) and/or lack of completeness (12; 41%). References were included with eight (29%) responses; each included fabricated references. Presently, healthcare professionals and consumers should be cautioned against using ChatGPT for medication-related information.
Asunto(s)
Servicios de Información sobre Medicamentos , Humanos , Educación del Paciente como Asunto , Encuestas y Cuestionarios , InternetRESUMEN
INTRODUCTION: Large language models like Chat Generative Pre-Trained Transformer (ChatGPT) are increasingly used in academic writing. Faculty may consider use of artificial intelligence (AI)-generated responses a form of cheating. We sought to determine whether general surgery residency faculty could detect AI versus human-written responses to a text prompt; hypothesizing that faculty would not be able to reliably differentiate AI versus human-written responses. METHODS: Ten essays were generated using a text prompt, "Tell us in 1-2 paragraphs why you are considering the University of Rochester for General Surgery residency" (Current trainees: n = 5, ChatGPT: n = 5). Ten blinded faculty reviewers rated essays (ten-point Likert scale) on the following criteria: desire to interview, relevance to the general surgery residency, overall impression, and AI- or human-generated; with scores and identification error rates compared between the groups. RESULTS: There were no differences between groups for %total points (ChatGPT 66.0 ± 13.5%, human 70.0 ± 23.0%, P = 0.508) or identification error rates (ChatGPT 40.0 ± 35.0%, human 20.0 ± 30.0%, P = 0.175). Except for one, all essays were identified incorrectly by at least two reviewers. Essays identified as human-generated received higher overall impression scores (area under the curve: 0.82 ± 0.04, P < 0.01). CONCLUSIONS: Whether use of AI tools for academic purposes should constitute academic dishonesty is controversial. We demonstrate that human and AI-generated essays are similar in quality, but there is bias against presumed AI-generated essays. Faculty are not able to reliably differentiate human from AI-generated essays, thus bias may be misdirected. AI-tools are becoming ubiquitous and their use is not easily detected. Faculty must expect these tools to play increasing roles in medical education.
Asunto(s)
Inteligencia Artificial , Cirugía General , Internado y Residencia , Internado y Residencia/métodos , Humanos , Cirugía General/educación , Escritura , Docentes Médicos/psicologíaRESUMEN
In the high-stakes realm of critical care, where daily decisions are crucial and clear communication is paramount, comprehending the rationale behind Artificial Intelligence (AI)-driven decisions appears essential. While AI has the potential to improve decision-making, its complexity can hinder comprehension and adherence to its recommendations. "Explainable AI" (XAI) aims to bridge this gap, enhancing confidence among patients and doctors. It also helps to meet regulatory transparency requirements, offers actionable insights, and promotes fairness and safety. Yet, defining explainability and standardising assessments are ongoing challenges and balancing performance and explainability can be needed, even if XAI is a growing field.
Asunto(s)
Inteligencia Artificial , Humanos , Inteligencia Artificial/tendencias , Inteligencia Artificial/normas , Cuidados Críticos/métodos , Cuidados Críticos/normas , Toma de Decisiones Clínicas/métodos , Médicos/normasRESUMEN
BACKGROUND: Large language model (LLM)-linked chatbots may be an efficient source of clinical recommendations for healthcare providers and patients. This study evaluated the performance of LLM-linked chatbots in providing recommendations for the surgical management of gastroesophageal reflux disease (GERD). METHODS: Nine patient cases were created based on key questions addressed by the Society of American Gastrointestinal and Endoscopic Surgeons (SAGES) guidelines for the surgical treatment of GERD. ChatGPT-3.5, ChatGPT-4, Copilot, Google Bard, and Perplexity AI were queried on November 16th, 2023, for recommendations regarding the surgical management of GERD. Accurate chatbot performance was defined as the number of responses aligning with SAGES guideline recommendations. Outcomes were reported with counts and percentages. RESULTS: Surgeons were given accurate recommendations for the surgical management of GERD in an adult patient for 5/7 (71.4%) KQs by ChatGPT-4, 3/7 (42.9%) KQs by Copilot, 6/7 (85.7%) KQs by Google Bard, and 3/7 (42.9%) KQs by Perplexity according to the SAGES guidelines. Patients were given accurate recommendations for 3/5 (60.0%) KQs by ChatGPT-4, 2/5 (40.0%) KQs by Copilot, 4/5 (80.0%) KQs by Google Bard, and 1/5 (20.0%) KQs by Perplexity, respectively. In a pediatric patient, surgeons were given accurate recommendations for 2/3 (66.7%) KQs by ChatGPT-4, 3/3 (100.0%) KQs by Copilot, 3/3 (100.0%) KQs by Google Bard, and 2/3 (66.7%) KQs by Perplexity. Patients were given appropriate guidance for 2/2 (100.0%) KQs by ChatGPT-4, 2/2 (100.0%) KQs by Copilot, 1/2 (50.0%) KQs by Google Bard, and 1/2 (50.0%) KQs by Perplexity. CONCLUSIONS: Gastrointestinal surgeons, gastroenterologists, and patients should recognize both the promise and pitfalls of LLM's when utilized for advice on surgical management of GERD. Additional training of LLM's using evidence-based health information is needed.
Asunto(s)
Inteligencia Artificial , Reflujo Gastroesofágico , Reflujo Gastroesofágico/cirugía , Humanos , Toma de Decisiones Clínicas , Adulto , Guías de Práctica Clínica como Asunto , MasculinoRESUMEN
Informed consent is a cornerstone of ethical medical practice, particularly in obstetrics where procedures like labor induction carry significant risks and require clear patient understanding. Despite legal mandates for patient materials to be accessible, many consent forms remain too complex, resulting in patient confusion and dissatisfaction. This study explores the use of Generative Artificial Intelligence (GAI) to simplify informed consent for labor induction with oxytocin, ensuring content is both medically accurate and comprehensible at an 8th-grade readability level. GAI-generated consent forms streamline the process, automatically tailoring content to meet readability standards while retaining essential details such as the procedure's nature, risks, benefits, and alternatives. Through iterative prompts and expert refinement, the AI produces clear, patient-friendly language that bridges the gap between medical jargon and patient comprehension. Flesch Reading Ease scores show improved readability, meeting recommended levels for health literacy. GAI has the potential to revolutionize healthcare communication by enhancing patient understanding, promoting shared decision-making, and improving satisfaction with the consent process. However, human oversight remains critical to ensure that AI-generated content adheres to legal and ethical standards. This case study demonstrates that GAI can be an effective tool in creating accessible, standardized, yet personalized consent documents, contributing to better-informed patients and potentially reducing malpractice claims.
RESUMEN
In the complex and multidimensional field of medicine, multimodal data are prevalent and crucial for informed clinical decisions. Multimodal data span a broad spectrum of data types, including medical images (eg, MRI and CT scans), time-series data (eg, sensor data from wearable devices and electronic health records), audio recordings (eg, heart and respiratory sounds and patient interviews), text (eg, clinical notes and research articles), videos (eg, surgical procedures), and omics data (eg, genomics and proteomics). While advancements in large language models (LLMs) have enabled new applications for knowledge retrieval and processing in the medical field, most LLMs remain limited to processing unimodal data, typically text-based content, and often overlook the importance of integrating the diverse data modalities encountered in clinical practice. This paper aims to present a detailed, practical, and solution-oriented perspective on the use of multimodal LLMs (M-LLMs) in the medical field. Our investigation spanned M-LLM foundational principles, current and potential applications, technical and ethical challenges, and future research directions. By connecting these elements, we aimed to provide a comprehensive framework that links diverse aspects of M-LLMs, offering a unified vision for their future in health care. This approach aims to guide both future research and practical implementations of M-LLMs in health care, positioning them as a paradigm shift toward integrated, multimodal data-driven medical practice. We anticipate that this work will spark further discussion and inspire the development of innovative approaches in the next generation of medical M-LLM systems.
Asunto(s)
Atención a la Salud , Humanos , Atención a la Salud/tendencias , Procesamiento de Lenguaje Natural , Registros Electrónicos de SaludRESUMEN
As advances in artificial intelligence (AI) continue to transform and revolutionize the field of medicine, understanding the potential uses of generative AI in health care becomes increasingly important. Generative AI, including models such as generative adversarial networks and large language models, shows promise in transforming medical diagnostics, research, treatment planning, and patient care. However, these data-intensive systems pose new threats to protected health information. This Viewpoint paper aims to explore various categories of generative AI in health care, including medical diagnostics, drug discovery, virtual health assistants, medical research, and clinical decision support, while identifying security and privacy threats within each phase of the life cycle of such systems (ie, data collection, model development, and implementation phases). The objectives of this study were to analyze the current state of generative AI in health care, identify opportunities and privacy and security challenges posed by integrating these technologies into existing health care infrastructure, and propose strategies for mitigating security and privacy risks. This study highlights the importance of addressing the security and privacy threats associated with generative AI in health care to ensure the safe and effective use of these systems. The findings of this study can inform the development of future generative AI systems in health care and help health care organizations better understand the potential benefits and risks associated with these systems. By examining the use cases and benefits of generative AI across diverse domains within health care, this paper contributes to theoretical discussions surrounding AI ethics, security vulnerabilities, and data privacy regulations. In addition, this study provides practical insights for stakeholders looking to adopt generative AI solutions within their organizations.
Asunto(s)
Inteligencia Artificial , Investigación Biomédica , Humanos , Privacidad , Recolección de Datos , LenguajeRESUMEN
BACKGROUND: Medical texts present significant domain-specific challenges, and manually curating these texts is a time-consuming and labor-intensive process. To address this, natural language processing (NLP) algorithms have been developed to automate text processing. In the biomedical field, various toolkits for text processing exist, which have greatly improved the efficiency of handling unstructured text. However, these existing toolkits tend to emphasize different perspectives, and none of them offer generation capabilities, leaving a significant gap in the current offerings. OBJECTIVE: This study aims to describe the development and preliminary evaluation of Ascle. Ascle is tailored for biomedical researchers and clinical staff with an easy-to-use, all-in-one solution that requires minimal programming expertise. For the first time, Ascle provides 4 advanced and challenging generative functions: question-answering, text summarization, text simplification, and machine translation. In addition, Ascle integrates 12 essential NLP functions, along with query and search capabilities for clinical databases. METHODS: We fine-tuned 32 domain-specific language models and evaluated them thoroughly on 27 established benchmarks. In addition, for the question-answering task, we developed a retrieval-augmented generation (RAG) framework for large language models that incorporated a medical knowledge graph with ranking techniques to enhance the reliability of generated answers. Additionally, we conducted a physician validation to assess the quality of generated content beyond automated metrics. RESULTS: The fine-tuned models and RAG framework consistently enhanced text generation tasks. For example, the fine-tuned models improved the machine translation task by 20.27 in terms of BLEU score. In the question-answering task, the RAG framework raised the ROUGE-L score by 18% over the vanilla models. Physician validation of generated answers showed high scores for readability (4.95/5) and relevancy (4.43/5), with a lower score for accuracy (3.90/5) and completeness (3.31/5). CONCLUSIONS: This study introduces the development and evaluation of Ascle, a user-friendly NLP toolkit designed for medical text generation. All code is publicly available through the Ascle GitHub repository. All fine-tuned language models can be accessed through Hugging Face.
Asunto(s)
Procesamiento de Lenguaje Natural , Humanos , Algoritmos , Programas InformáticosRESUMEN
BACKGROUND: Although patients have easy access to their electronic health records and laboratory test result data through patient portals, laboratory test results are often confusing and hard to understand. Many patients turn to web-based forums or question-and-answer (Q&A) sites to seek advice from their peers. The quality of answers from social Q&A sites on health-related questions varies significantly, and not all responses are accurate or reliable. Large language models (LLMs) such as ChatGPT have opened a promising avenue for patients to have their questions answered. OBJECTIVE: We aimed to assess the feasibility of using LLMs to generate relevant, accurate, helpful, and unharmful responses to laboratory test-related questions asked by patients and identify potential issues that can be mitigated using augmentation approaches. METHODS: We collected laboratory test result-related Q&A data from Yahoo! Answers and selected 53 Q&A pairs for this study. Using the LangChain framework and ChatGPT web portal, we generated responses to the 53 questions from 5 LLMs: GPT-4, GPT-3.5, LLaMA 2, MedAlpaca, and ORCA_mini. We assessed the similarity of their answers using standard Q&A similarity-based evaluation metrics, including Recall-Oriented Understudy for Gisting Evaluation, Bilingual Evaluation Understudy, Metric for Evaluation of Translation With Explicit Ordering, and Bidirectional Encoder Representations from Transformers Score. We used an LLM-based evaluator to judge whether a target model had higher quality in terms of relevance, correctness, helpfulness, and safety than the baseline model. We performed a manual evaluation with medical experts for all the responses to 7 selected questions on the same 4 aspects. RESULTS: Regarding the similarity of the responses from 4 LLMs; the GPT-4 output was used as the reference answer, the responses from GPT-3.5 were the most similar, followed by those from LLaMA 2, ORCA_mini, and MedAlpaca. Human answers from Yahoo data were scored the lowest and, thus, as the least similar to GPT-4-generated answers. The results of the win rate and medical expert evaluation both showed that GPT-4's responses achieved better scores than all the other LLM responses and human responses on all 4 aspects (relevance, correctness, helpfulness, and safety). LLM responses occasionally also suffered from lack of interpretation in one's medical context, incorrect statements, and lack of references. CONCLUSIONS: By evaluating LLMs in generating responses to patients' laboratory test result-related questions, we found that, compared to other 4 LLMs and human answers from a Q&A website, GPT-4's responses were more accurate, helpful, relevant, and safer. There were cases in which GPT-4 responses were inaccurate and not individualized. We identified a number of ways to improve the quality of LLM responses, including prompt engineering, prompt augmentation, retrieval-augmented generation, and response evaluation.
Asunto(s)
Inteligencia Artificial , Registros Electrónicos de Salud , Humanos , LenguajeRESUMEN
BACKGROUND: Medical documentation plays a crucial role in clinical practice, facilitating accurate patient management and communication among health care professionals. However, inaccuracies in medical notes can lead to miscommunication and diagnostic errors. Additionally, the demands of documentation contribute to physician burnout. Although intermediaries like medical scribes and speech recognition software have been used to ease this burden, they have limitations in terms of accuracy and addressing provider-specific metrics. The integration of ambient artificial intelligence (AI)-powered solutions offers a promising way to improve documentation while fitting seamlessly into existing workflows. OBJECTIVE: This study aims to assess the accuracy and quality of Subjective, Objective, Assessment, and Plan (SOAP) notes generated by ChatGPT-4, an AI model, using established transcripts of History and Physical Examination as the gold standard. We seek to identify potential errors and evaluate the model's performance across different categories. METHODS: We conducted simulated patient-provider encounters representing various ambulatory specialties and transcribed the audio files. Key reportable elements were identified, and ChatGPT-4 was used to generate SOAP notes based on these transcripts. Three versions of each note were created and compared to the gold standard via chart review; errors generated from the comparison were categorized as omissions, incorrect information, or additions. We compared the accuracy of data elements across versions, transcript length, and data categories. Additionally, we assessed note quality using the Physician Documentation Quality Instrument (PDQI) scoring system. RESULTS: Although ChatGPT-4 consistently generated SOAP-style notes, there were, on average, 23.6 errors per clinical case, with errors of omission (86%) being the most common, followed by addition errors (10.5%) and inclusion of incorrect facts (3.2%). There was significant variance between replicates of the same case, with only 52.9% of data elements reported correctly across all 3 replicates. The accuracy of data elements varied across cases, with the highest accuracy observed in the "Objective" section. Consequently, the measure of note quality, assessed by PDQI, demonstrated intra- and intercase variance. Finally, the accuracy of ChatGPT-4 was inversely correlated to both the transcript length (P=.05) and the number of scorable data elements (P=.05). CONCLUSIONS: Our study reveals substantial variability in errors, accuracy, and note quality generated by ChatGPT-4. Errors were not limited to specific sections, and the inconsistency in error types across replicates complicated predictability. Transcript length and data complexity were inversely correlated with note accuracy, raising concerns about the model's effectiveness in handling complex medical cases. The quality and reliability of clinical notes produced by ChatGPT-4 do not meet the standards required for clinical use. Although AI holds promise in health care, caution should be exercised before widespread adoption. Further research is needed to address accuracy, variability, and potential errors. ChatGPT-4, while valuable in various applications, should not be considered a safe alternative to human-generated clinical documentation at this time.
Asunto(s)
Relaciones Médico-Paciente , Humanos , Documentación/métodos , Registros Electrónicos de Salud , Inteligencia ArtificialRESUMEN
BACKGROUND: Artificial intelligence chatbots such as ChatGPT (OpenAI) have garnered excitement about their potential for delegating writing tasks ordinarily performed by humans. Many of these tasks (eg, writing recommendation letters) have social and professional ramifications, making the potential social biases in ChatGPT's underlying language model a serious concern. OBJECTIVE: Three preregistered studies used the text analysis program Linguistic Inquiry and Word Count to investigate gender bias in recommendation letters written by ChatGPT in human-use sessions (N=1400 total letters). METHODS: We conducted analyses using 22 existing Linguistic Inquiry and Word Count dictionaries, as well as 6 newly created dictionaries based on systematic reviews of gender bias in recommendation letters, to compare recommendation letters generated for the 200 most historically popular "male" and "female" names in the United States. Study 1 used 3 different letter-writing prompts intended to accentuate professional accomplishments associated with male stereotypes, female stereotypes, or neither. Study 2 examined whether lengthening each of the 3 prompts while holding the between-prompt word count constant modified the extent of bias. Study 3 examined the variability within letters generated for the same name and prompts. We hypothesized that when prompted with gender-stereotyped professional accomplishments, ChatGPT would evidence gender-based language differences replicating those found in systematic reviews of human-written recommendation letters (eg, more affiliative, social, and communal language for female names; more agentic and skill-based language for male names). RESULTS: Significant differences in language between letters generated for female versus male names were observed across all prompts, including the prompt hypothesized to be neutral, and across nearly all language categories tested. Historically female names received significantly more social referents (5/6, 83% of prompts), communal or doubt-raising language (4/6, 67% of prompts), personal pronouns (4/6, 67% of prompts), and clout language (5/6, 83% of prompts). Contradicting the study hypotheses, some gender differences (eg, achievement language and agentic language) were significant in both the hypothesized and nonhypothesized directions, depending on the prompt. Heteroscedasticity between male and female names was observed in multiple linguistic categories, with greater variance for historically female names than for historically male names. CONCLUSIONS: ChatGPT reproduces many gender-based language biases that have been reliably identified in investigations of human-written reference letters, although these differences vary across prompts and language categories. Caution should be taken when using ChatGPT for tasks that have social consequences, such as reference letter writing. The methods developed in this study may be useful for ongoing bias testing among progressive generations of chatbots across a range of real-world scenarios. TRIAL REGISTRATION: OSF Registries osf.io/ztv96; https://osf.io/ztv96.