Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 52
Filter
Add more filters

Country/Region as subject
Affiliation country
Publication year range
1.
Nature ; 619(7969): 357-362, 2023 Jul.
Article in English | MEDLINE | ID: mdl-37286606

ABSTRACT

Physicians make critical time-constrained decisions every day. Clinical predictive models can help physicians and administrators make decisions by forecasting clinical and operational events. Existing structured data-based clinical predictive models have limited use in everyday practice owing to complexity in data processing, as well as model development and deployment1-3. Here we show that unstructured clinical notes from the electronic health record can enable the training of clinical language models, which can be used as all-purpose clinical predictive engines with low-resistance development and deployment. Our approach leverages recent advances in natural language processing4,5 to train a large language model for medical language (NYUTron) and subsequently fine-tune it across a wide range of clinical and operational predictive tasks. We evaluated our approach within our health system for five such tasks: 30-day all-cause readmission prediction, in-hospital mortality prediction, comorbidity index prediction, length of stay prediction, and insurance denial prediction. We show that NYUTron has an area under the curve (AUC) of 78.7-94.9%, with an improvement of 5.36-14.7% in the AUC compared with traditional models. We additionally demonstrate the benefits of pretraining with clinical text, the potential for increasing generalizability to different sites through fine-tuning and the full deployment of our system in a prospective, single-arm trial. These results show the potential for using clinical language models in medicine to read alongside physicians and provide guidance at the point of care.


Subject(s)
Clinical Decision-Making , Electronic Health Records , Natural Language Processing , Physicians , Humans , Clinical Decision-Making/methods , Patient Readmission , Hospital Mortality , Comorbidity , Length of Stay , Insurance Coverage , Area Under Curve , Point-of-Care Systems/trends , Clinical Trials as Topic
2.
Clin Transplant ; 38(10): e15466, 2024 Oct.
Article in English | MEDLINE | ID: mdl-39329220

ABSTRACT

INTRODUCTION: ChatGPT has shown the ability to answer clinical questions in general medicine but may be constrained by the specialized nature of kidney transplantation. Thus, it is important to explore how ChatGPT can be used in kidney transplantation and how its knowledge compares to human respondents. METHODS: We prompted ChatGPT versions 3.5, 4, and 4 Visual (4 V) with 12 multiple-choice questions related to six kidney transplant cases from 2013 to 2015 American Society of Nephrology (ASN) fellowship program quizzes. We compared the performance of ChatGPT with US nephrology fellowship program directors, nephrology fellows, and the audience of the ASN's annual Kidney Week meeting. RESULTS: Overall, ChatGPT 4 V correctly answered 10 out of 12 questions, showing a performance level comparable to nephrology fellows (group majority correctly answered 9 of 12 questions) and training program directors (11 of 12). This surpassed ChatGPT 4 (7 of 12 correct) and 3.5 (5 of 12). All three ChatGPT versions failed to correctly answer questions where the consensus among human respondents was low. CONCLUSION: Each iterative version of ChatGPT performed better than the prior version, with version 4 V achieving performance on par with nephrology fellows and training program directors. While it shows promise in understanding and answering kidney transplantation questions, ChatGPT should be seen as a complementary tool to human expertise rather than a replacement.


Subject(s)
Kidney Transplantation , Humans , Surveys and Questionnaires , Nephrology/education , Fellowships and Scholarships , Prognosis , Kidney Failure, Chronic/surgery , Female
3.
J Gen Intern Med ; 37(9): 2230-2238, 2022 07.
Article in English | MEDLINE | ID: mdl-35710676

ABSTRACT

BACKGROUND: Residents receive infrequent feedback on their clinical reasoning (CR) documentation. While machine learning (ML) and natural language processing (NLP) have been used to assess CR documentation in standardized cases, no studies have described similar use in the clinical environment. OBJECTIVE: The authors developed and validated using Kane's framework a ML model for automated assessment of CR documentation quality in residents' admission notes. DESIGN, PARTICIPANTS, MAIN MEASURES: Internal medicine residents' and subspecialty fellows' admission notes at one medical center from July 2014 to March 2020 were extracted from the electronic health record. Using a validated CR documentation rubric, the authors rated 414 notes for the ML development dataset. Notes were truncated to isolate the relevant portion; an NLP software (cTAKES) extracted disease/disorder named entities and human review generated CR terms. The final model had three input variables and classified notes as demonstrating low- or high-quality CR documentation. The ML model was applied to a retrospective dataset (9591 notes) for human validation and data analysis. Reliability between human and ML ratings was assessed on 205 of these notes with Cohen's kappa. CR documentation quality by post-graduate year (PGY) was evaluated by the Mantel-Haenszel test of trend. KEY RESULTS: The top-performing logistic regression model had an area under the receiver operating characteristic curve of 0.88, a positive predictive value of 0.68, and an accuracy of 0.79. Cohen's kappa was 0.67. Of the 9591 notes, 31.1% demonstrated high-quality CR documentation; quality increased from 27.0% (PGY1) to 31.0% (PGY2) to 39.0% (PGY3) (p < .001 for trend). Validity evidence was collected in each domain of Kane's framework (scoring, generalization, extrapolation, and implications). CONCLUSIONS: The authors developed and validated a high-performing ML model that classifies CR documentation quality in resident admission notes in the clinical environment-a novel application of ML and NLP with many potential use cases.


Subject(s)
Clinical Reasoning , Documentation , Electronic Health Records , Humans , Machine Learning , Natural Language Processing , Reproducibility of Results , Retrospective Studies
4.
J Gen Intern Med ; 37(3): 507-512, 2022 02.
Article in English | MEDLINE | ID: mdl-33945113

ABSTRACT

BACKGROUND: Residents and fellows receive little feedback on their clinical reasoning documentation. Barriers include lack of a shared mental model and variability in the reliability and validity of existing assessment tools. Of the existing tools, the IDEA assessment tool includes a robust assessment of clinical reasoning documentation focusing on four elements (interpretive summary, differential diagnosis, explanation of reasoning for lead and alternative diagnoses) but lacks descriptive anchors threatening its reliability. OBJECTIVE: Our goal was to develop a valid and reliable assessment tool for clinical reasoning documentation building off the IDEA assessment tool. DESIGN, PARTICIPANTS, AND MAIN MEASURES: The Revised-IDEA assessment tool was developed by four clinician educators through iterative review of admission notes written by medicine residents and fellows and subsequently piloted with additional faculty to ensure response process validity. A random sample of 252 notes from July 2014 to June 2017 written by 30 trainees across several chief complaints was rated. Three raters rated 20% of the notes to demonstrate internal structure validity. A quality cut-off score was determined using Hofstee standard setting. KEY RESULTS: The Revised-IDEA assessment tool includes the same four domains as the IDEA assessment tool with more detailed descriptive prompts, new Likert scale anchors, and a score range of 0-10. Intraclass correlation was high for the notes rated by three raters, 0.84 (95% CI 0.74-0.90). Scores ≥6 were determined to demonstrate high-quality clinical reasoning documentation. Only 53% of notes (134/252) were high-quality. CONCLUSIONS: The Revised-IDEA assessment tool is reliable and easy to use for feedback on clinical reasoning documentation in resident and fellow admission notes with descriptive anchors that facilitate a shared mental model for feedback.


Subject(s)
Clinical Competence , Clinical Reasoning , Documentation , Feedback , Humans , Models, Psychological , Reproducibility of Results
5.
Arterioscler Thromb Vasc Biol ; 40(10): 2539-2547, 2020 10.
Article in English | MEDLINE | ID: mdl-32840379

ABSTRACT

OBJECTIVE: To determine the prevalence of D-dimer elevation in coronavirus disease 2019 (COVID-19) hospitalization, trajectory of D-dimer levels during hospitalization, and its association with clinical outcomes. Approach and Results: Consecutive adults admitted to a large New York City hospital system with a positive polymerase chain reaction test for SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) between March 1, 2020 and April 8, 2020 were identified. Elevated D-dimer was defined by the laboratory-specific upper limit of normal (>230 ng/mL). Outcomes included critical illness (intensive care, mechanical ventilation, discharge to hospice, or death), thrombotic events, acute kidney injury, and death during admission. Among 2377 adults hospitalized with COVID-19 and ≥1 D-dimer measurement, 1823 (76%) had elevated D-dimer at presentation. Patients with elevated presenting baseline D-dimer were more likely than those with normal D-dimer to have critical illness (43.9% versus 18.5%; adjusted odds ratio, 2.4 [95% CI, 1.9-3.1]; P<0.001), any thrombotic event (19.4% versus 10.2%; adjusted odds ratio, 1.9 [95% CI, 1.4-2.6]; P<0.001), acute kidney injury (42.4% versus 19.0%; adjusted odds ratio, 2.4 [95% CI, 1.9-3.1]; P<0.001), and death (29.9% versus 10.8%; adjusted odds ratio, 2.1 [95% CI, 1.6-2.9]; P<0.001). Rates of adverse events increased with the magnitude of D-dimer elevation; individuals with presenting D-dimer >2000 ng/mL had the highest risk of critical illness (66%), thrombotic event (37.8%), acute kidney injury (58.3%), and death (47%). CONCLUSIONS: Abnormal D-dimer was frequently observed at admission with COVID-19 and was associated with higher incidence of critical illness, thrombotic events, acute kidney injury, and death. The optimal management of patients with elevated D-dimer in COVID-19 requires further study.


Subject(s)
Coronavirus Infections/blood , Coronavirus Infections/mortality , Critical Illness/epidemiology , Disease Progression , Fibrin Fibrinogen Degradation Products/metabolism , Hospital Mortality/trends , Pneumonia, Viral/blood , Pneumonia, Viral/mortality , Adult , Aged , Biomarkers/blood , COVID-19 , Cause of Death , Cohort Studies , Coronavirus Infections/physiopathology , Databases, Factual , Female , Hospitals, Urban , Humans , Male , Middle Aged , New York City/epidemiology , Pandemics , Pneumonia, Viral/physiopathology , Prevalence , Retrospective Studies , Risk Assessment , Severe Acute Respiratory Syndrome/blood , Severe Acute Respiratory Syndrome/mortality , Severe Acute Respiratory Syndrome/physiopathology , Severity of Illness Index
6.
BMC Med Inform Decis Mak ; 20(1): 214, 2020 09 07.
Article in English | MEDLINE | ID: mdl-32894128

ABSTRACT

BACKGROUND: Automated systems that use machine learning to estimate a patient's risk of death are being developed to influence care. There remains sparse transparent reporting of model generalizability in different subpopulations especially for implemented systems. METHODS: A prognostic study included adult admissions at a multi-site, academic medical center between 2015 and 2017. A predictive model for all-cause mortality (including initiation of hospice care) within 60 days of admission was developed. Model generalizability is assessed in temporal validation in the context of potential demographic bias. A subsequent prospective cohort study was conducted at the same sites between October 2018 and June 2019. Model performance during prospective validation was quantified with areas under the receiver operating characteristic and precision recall curves stratified by site. Prospective results include timeliness, positive predictive value, and the number of actionable predictions. RESULTS: Three years of development data included 128,941 inpatient admissions (94,733 unique patients) across sites where patients are mostly white (61%) and female (60%) and 4.2% led to death within 60 days. A random forest model incorporating 9614 predictors produced areas under the receiver operating characteristic and precision recall curves of 87.2 (95% CI, 86.1-88.2) and 28.0 (95% CI, 25.0-31.0) in temporal validation. Performance marginally diverges within sites as the patient mix shifts from development to validation (patients of one site increases from 10 to 38%). Applied prospectively for nine months, 41,728 predictions were generated in real-time (median [IQR], 1.3 [0.9, 32] minutes). An operating criterion of 75% positive predictive value identified 104 predictions at very high risk (0.25%) where 65% (50 from 77 well-timed predictions) led to death within 60 days. CONCLUSION: Temporal validation demonstrates good model discrimination for 60-day mortality. Slight performance variations are observed across demographic subpopulations. The model was implemented prospectively and successfully produced meaningful estimates of risk within minutes of admission.


Subject(s)
Electronic Health Records , Hospitalization , Machine Learning , Patient Admission , Adolescent , Adult , Aged , Aged, 80 and over , Female , Humans , Male , Middle Aged , Mortality , Prognosis , Prospective Studies , Young Adult
7.
Gynecol Oncol ; 149(1): 22-27, 2018 04.
Article in English | MEDLINE | ID: mdl-29605045

ABSTRACT

OBJECTIVES: Black race has been associated with increased 30-day morbidity and mortality following surgery for endometrial cancer. Black women are also less likely to undergo laparoscopy when compared to white women. With the development of improved laparoscopic techniques and equipment, including the robotic platform, we sought to evaluate whether there has been a change in surgical approach for black women, and in turn, improvement in perioperative outcomes. METHODS: Using the American College of Surgeons' National Surgical Quality Improvement Project's database, patients who underwent hysterectomy for endometrial cancer from 2010 to 2015 were identified. Comparative analyses stratified by race and hysterectomy approach were performed to assess the relationship between race and perioperative outcomes. RESULTS: A total of 17,692 patients were identified: of these, 13,720 (77.5%) were white and 1553 (8.8%) were black. Black women were less likely to undergo laparoscopic hysterectomy compared to white women (49.3% vs 71.3%, p<0.0001). Rates of laparoscopy in both races increased over the 6-year period; however these consistently remained lower in black women each year. Black women had higher 30-day postoperative complication rates compared to white women (22.5% vs 13.6%, p<0.0001). When laparoscopic hysterectomies were isolated, there was no difference in postoperative complication rates between black and white women (9.2% vs 7.5%, p=0.1). CONCLUSIONS: Overall black women incur more postoperative complications compared to white women undergoing hysterectomy for endometrial cancer. However, laparoscopy may mitigate this disparity. Efforts should be made to maximize the utilization of minimally invasive surgery for the surgical management of endometrial cancer.


Subject(s)
Black People/statistics & numerical data , Endometrial Neoplasms/ethnology , Endometrial Neoplasms/surgery , Hysterectomy/statistics & numerical data , Laparoscopy/statistics & numerical data , White People/statistics & numerical data , Female , Healthcare Disparities/statistics & numerical data , Humans , Hysterectomy/adverse effects , Hysterectomy/methods , Laparoscopy/adverse effects , Laparoscopy/methods , Middle Aged , Postoperative Complications/epidemiology , Postoperative Complications/ethnology , Postoperative Complications/etiology , United States/epidemiology
9.
Semin Musculoskelet Radiol ; 21(1): 32-36, 2017 Feb.
Article in English | MEDLINE | ID: mdl-28253531

ABSTRACT

This article reviews examples of big data analyses in health care with a focus on radiology. We review the defining characteristics of big data, the use of natural language processing, traditional and novel data sources, and large clinical data repositories available for research. This article aims to invoke novel research ideas through a combination of examples of analyses and domain knowledge.


Subject(s)
Data Interpretation, Statistical , Radiology/statistics & numerical data , Humans
10.
Am J Addict ; 26(6): 581-586, 2017 Sep.
Article in English | MEDLINE | ID: mdl-28799677

ABSTRACT

BACKGROUND AND OBJECTIVES: Missed visits are common in office-based buprenorphine treatment (OBOT). The feasibility of text message (TM) appointment reminders among OBOT patients is unknown. METHODS: This 6-month prospective cohort study provided TM reminders to OBOT program patients (N = 93). A feasibility survey was completed following delivery of TM reminders and at 6 months. RESULTS: Respondents reported that the reminders should be provided to all OBOT patients (100%) and helped them to adhere to their scheduled appointment (97%). At 6 months, there were no reports of intrusion to their privacy or disruption of daily activities due to the TM reminders. Most participants reported that the TM reminders were helpful in adhering to scheduled appointments (95%), that the reminders should be offered to all clinic patients (95%), and favored receiving only TM reminders rather than telephone reminders (95%). Barriers to adhering to scheduled appointment times included transportation difficulties (34%), not being able to take time off from school or work (31%), long clinic wait-times (9%), being hospitalized or sick (8%), feeling sad or depressed (6%), and child care (6%). CONCLUSIONS: This study demonstrated the acceptability and feasibility of TM appointment reminders in OBOT. Older age and longer duration in buprenorphine treatment did not diminish interest in receiving the TM intervention. Although OBOT patients expressed concern regarding the privacy of TM content sent from their providers, privacy issues were uncommon among this cohort. Scientific Significance Findings from this study highlighted patient barriers to adherence to scheduled appointments. These barriers included transportation difficulties (34%), not being able to take time off from school or work (31%), long clinic lines (9%), and other factors that may confound the effect of future TM appointment reminder interventions. Further research is also required to assess 1) the level of system changes required to integrate TM appointment reminder tools with already existing electronic medical records and appointment records software; 2) acceptability among clinicians and administrators; and 3) financial and resource constraints to healthcare systems. (Am J Addict 2017;26:581-586).


Subject(s)
Buprenorphine/therapeutic use , Opiate Substitution Treatment , Opioid-Related Disorders/drug therapy , Reminder Systems , Text Messaging , Adult , Appointments and Schedules , Feasibility Studies , Female , Humans , Male , Narcotic Antagonists/therapeutic use , Opiate Substitution Treatment/methods , Opiate Substitution Treatment/psychology , Opiate Substitution Treatment/statistics & numerical data , Opioid-Related Disorders/epidemiology , Opioid-Related Disorders/psychology , Patient Acceptance of Health Care , Patient Compliance/psychology , Patient Compliance/statistics & numerical data , Prospective Studies , Reminder Systems/instrumentation , Reminder Systems/statistics & numerical data , United States
12.
J Transl Med ; 14(1): 235, 2016 08 05.
Article in English | MEDLINE | ID: mdl-27492440

ABSTRACT

BACKGROUND: Translational research is a key area of focus of the National Institutes of Health (NIH), as demonstrated by the substantial investment in the Clinical and Translational Science Award (CTSA) program. The goal of the CTSA program is to accelerate the translation of discoveries from the bench to the bedside and into communities. Different classification systems have been used to capture the spectrum of basic to clinical to population health research, with substantial differences in the number of categories and their definitions. Evaluation of the effectiveness of the CTSA program and of translational research in general is hampered by the lack of rigor in these definitions and their application. This study adds rigor to the classification process by creating a checklist to evaluate publications across the translational spectrum and operationalizes these classifications by building machine learning-based text classifiers to categorize these publications. METHODS: Based on collaboratively developed definitions, we created a detailed checklist for categories along the translational spectrum from T0 to T4. We applied the checklist to CTSA-linked publications to construct a set of coded publications for use in training machine learning-based text classifiers to classify publications within these categories. The training sets combined T1/T2 and T3/T4 categories due to low frequency of these publication types compared to the frequency of T0 publications. We then compared classifier performance across different algorithms and feature sets and applied the classifiers to all publications in PubMed indexed to CTSA grants. To validate the algorithm, we manually classified the articles with the top 100 scores from each classifier. RESULTS: The definitions and checklist facilitated classification and resulted in good inter-rater reliability for coding publications for the training set. Very good performance was achieved for the classifiers as represented by the area under the receiver operating curves (AUC), with an AUC of 0.94 for the T0 classifier, 0.84 for T1/T2, and 0.92 for T3/T4. CONCLUSIONS: The combination of definitions agreed upon by five CTSA hubs, a checklist that facilitates more uniform definition interpretation, and algorithms that perform well in classifying publications along the translational spectrum provide a basis for establishing and applying uniform definitions of translational research categories. The classification algorithms allow publication analyses that would not be feasible with manual classification, such as assessing the distribution and trends of publications across the CTSA network and comparing the categories of publications and their citations to assess knowledge transfer across the translational research spectrum.


Subject(s)
Machine Learning , Publications/classification , Translational Research, Biomedical , Algorithms , Area Under Curve , Documentation
13.
Gynecol Oncol ; 142(3): 508-13, 2016 09.
Article in English | MEDLINE | ID: mdl-27288543

ABSTRACT

OBJECTIVE: To determine factors influencing discharge patterns after laparoscopic hysterectomy for endometrial cancer and to evaluate the safety of same-day discharge during the 30-day postoperative period. METHODS: Using the American College of Surgeons' National Surgical Quality Improvement Project's database, patients who underwent hysterectomy for endometrial cancer from 2010 to 2014 were identified and categorized by their hospital length of stay. Statistical analyses were performed to assess the relationship between hospital stay and demographics, medical comorbidities, intraoperative surgical factors and postoperative outcomes. RESULTS: A total of 9020 patients had laparoscopic hysterectomies for endometrial cancer and of these, 729 patients (8.1%) were successfully discharged on the day of surgery. These patients were younger and had lower body mass indexes and fewer medical comorbidities than patients who were admitted after their procedure. The same-day discharge group underwent surgical procedures of less complexity than the hospital admission group based on shorter operative times and fewer relative value units (RVUs). There was a lower rate of surgical site infections in the same-day discharge group, and no difference in rates of other postoperative complications including hospital readmissions and reoperations. CONCLUSIONS: Rates of laparoscopic hysterectomy for endometrial cancer are gradually increasing but the rates of same-day discharge have increased at a much slower rate. Same-day discharge has been successful despite differences in preoperative demographics, medical comorbidities and intraoperative surgical complexity. Overall postoperative complication rates were equivalent despite length of hospital stay, demonstrating the safety and feasibility of same-day discharge after laparoscopic hysterectomy for endometrial cancer.


Subject(s)
Ambulatory Surgical Procedures/methods , Endometrial Neoplasms/surgery , Hysterectomy/methods , Ambulatory Surgical Procedures/adverse effects , Ambulatory Surgical Procedures/statistics & numerical data , Endometrial Neoplasms/epidemiology , Female , Humans , Hysterectomy/adverse effects , Hysterectomy/statistics & numerical data , Laparoscopy/adverse effects , Laparoscopy/methods , Laparoscopy/statistics & numerical data , Middle Aged , United States/epidemiology
14.
JACC Clin Electrophysiol ; 10(5): 956-966, 2024 May.
Article in English | MEDLINE | ID: mdl-38703162

ABSTRACT

BACKGROUND: Prediction of drug-induced long QT syndrome (diLQTS) is of critical importance given its association with torsades de pointes. There is no reliable method for the outpatient prediction of diLQTS. OBJECTIVES: This study sought to evaluate the use of a convolutional neural network (CNN) applied to electrocardiograms (ECGs) to predict diLQTS in an outpatient population. METHODS: We identified all adult outpatients newly prescribed a QT-prolonging medication between January 1, 2003, and March 31, 2022, who had a 12-lead sinus ECG in the preceding 6 months. Using risk factor data and the ECG signal as inputs, the CNN QTNet was implemented in TensorFlow to predict diLQTS. RESULTS: Models were evaluated in a held-out test dataset of 44,386 patients (57% female) with a median age of 62 years. Compared with 3 other models relying on risk factors or ECG signal or baseline QTc alone, QTNet achieved the best (P < 0.001) performance with a mean area under the curve of 0.802 (95% CI: 0.786-0.818). In a survival analysis, QTNet also had the highest inverse probability of censorship-weighted area under the receiver-operating characteristic curve at day 2 (0.875; 95% CI: 0.848-0.904) and up to 6 months. In a subgroup analysis, QTNet performed best among males and patients ≤50 years or with baseline QTc <450 ms. In an external validation cohort of solely suburban outpatient practices, QTNet similarly maintained the highest predictive performance. CONCLUSIONS: An ECG-based CNN can accurately predict diLQTS in the outpatient setting while maintaining its predictive performance over time. In the outpatient setting, our model could identify higher-risk individuals who would benefit from closer monitoring.


Subject(s)
Artificial Intelligence , Electrocardiography , Long QT Syndrome , Neural Networks, Computer , Humans , Female , Male , Long QT Syndrome/chemically induced , Long QT Syndrome/diagnosis , Middle Aged , Aged , Adult , Risk Factors
15.
JAMA Netw Open ; 7(3): e240357, 2024 Mar 04.
Article in English | MEDLINE | ID: mdl-38466307

ABSTRACT

Importance: By law, patients have immediate access to discharge notes in their medical records. Technical language and abbreviations make notes difficult to read and understand for a typical patient. Large language models (LLMs [eg, GPT-4]) have the potential to transform these notes into patient-friendly language and format. Objective: To determine whether an LLM can transform discharge summaries into a format that is more readable and understandable. Design, Setting, and Participants: This cross-sectional study evaluated a sample of the discharge summaries of adult patients discharged from the General Internal Medicine service at NYU (New York University) Langone Health from June 1 to 30, 2023. Patients discharged as deceased were excluded. All discharge summaries were processed by the LLM between July 26 and August 5, 2023. Interventions: A secure Health Insurance Portability and Accountability Act-compliant platform, Microsoft Azure OpenAI, was used to transform these discharge summaries into a patient-friendly format between July 26 and August 5, 2023. Main Outcomes and Measures: Outcomes included readability as measured by Flesch-Kincaid Grade Level and understandability using Patient Education Materials Assessment Tool (PEMAT) scores. Readability and understandability of the original discharge summaries were compared with the transformed, patient-friendly discharge summaries created through the LLM. As balancing metrics, accuracy and completeness of the patient-friendly version were measured. Results: Discharge summaries of 50 patients (31 female [62.0%] and 19 male [38.0%]) were included. The median patient age was 65.5 (IQR, 59.0-77.5) years. Mean (SD) Flesch-Kincaid Grade Level was significantly lower in the patient-friendly discharge summaries (6.2 [0.5] vs 11.0 [1.5]; P < .001). PEMAT understandability scores were significantly higher for patient-friendly discharge summaries (81% vs 13%; P < .001). Two physicians reviewed each patient-friendly discharge summary for accuracy on a 6-point scale, with 54 of 100 reviews (54.0%) giving the best possible rating of 6. Summaries were rated entirely complete in 56 reviews (56.0%). Eighteen reviews noted safety concerns, mostly involving omissions, but also several inaccurate statements (termed hallucinations). Conclusions and Relevance: The findings of this cross-sectional study of 50 discharge summaries suggest that LLMs can be used to translate discharge summaries into patient-friendly language and formats that are significantly more readable and understandable than discharge summaries as they appear in electronic health records. However, implementation will require improvements in accuracy, completeness, and safety. Given the safety concerns, initial implementation will require physician review.


Subject(s)
Artificial Intelligence , Inpatients , United States , Adult , Humans , Female , Male , Middle Aged , Aged , Cross-Sectional Studies , Patient Discharge , Electronic Health Records , Language
16.
JAMIA Open ; 7(3): ooae078, 2024 Oct.
Article in English | MEDLINE | ID: mdl-39156046

ABSTRACT

Objectives: Accelerating demand for patient messaging has impacted the practice of many providers. Messages are not recommended for urgent medical issues, but some do require rapid attention. This presents an opportunity for artificial intelligence (AI) methods to prioritize review of messages. Our study aimed to highlight some patient portal messages for prioritized review using a custom AI system integrated into the electronic health record (EHR). Materials and Methods: We developed a Bidirectional Encoder Representations from Transformers (BERT)-based large language model using 40 132 patient-sent messages to identify patterns involving high acuity topics that warrant an immediate callback. The model was then implemented into 2 shared pools of patient messages managed by dozens of registered nurses. A primary outcome, such as the time before messages were read, was evaluated with a difference-in-difference methodology. Results: Model validation on an expert-reviewed dataset (n = 7260) yielded very promising performance (C-statistic = 97%, average-precision = 72%). A binarized output (precision = 67%, sensitivity = 63%) was integrated into the EHR for 2 years. In a pre-post analysis (n = 396 466), an improvement exceeding the trend was observed in the time high-scoring messages sit unread (21 minutes, 63 vs 42 for messages sent outside business hours). Discussion: Our work shows great promise in improving care when AI is aligned with human workflow. Future work involves audience expansion, aiding users with suggested actions, and drafting responses. Conclusion: Many patients utilize patient portal messages, and while most messages are routine, a small fraction describe alarming symptoms. Our AI-based workflow shortens the turnaround time to get a trained clinician to review these messages to provide safer, higher-quality care.

17.
JAMA Netw Open ; 7(7): e2422399, 2024 Jul 01.
Article in English | MEDLINE | ID: mdl-39012633

ABSTRACT

Importance: Virtual patient-physician communications have increased since 2020 and negatively impacted primary care physician (PCP) well-being. Generative artificial intelligence (GenAI) drafts of patient messages could potentially reduce health care professional (HCP) workload and improve communication quality, but only if the drafts are considered useful. Objectives: To assess PCPs' perceptions of GenAI drafts and to examine linguistic characteristics associated with equity and perceived empathy. Design, Setting, and Participants: This cross-sectional quality improvement study tested the hypothesis that PCPs' ratings of GenAI drafts (created using the electronic health record [EHR] standard prompts) would be equivalent to HCP-generated responses on 3 dimensions. The study was conducted at NYU Langone Health using private patient-HCP communications at 3 internal medicine practices piloting GenAI. Exposures: Randomly assigned patient messages coupled with either an HCP message or the draft GenAI response. Main Outcomes and Measures: PCPs rated responses' information content quality (eg, relevance), using a Likert scale, communication quality (eg, verbosity), using a Likert scale, and whether they would use the draft or start anew (usable vs unusable). Branching logic further probed for empathy, personalization, and professionalism of responses. Computational linguistics methods assessed content differences in HCP vs GenAI responses, focusing on equity and empathy. Results: A total of 16 PCPs (8 [50.0%] female) reviewed 344 messages (175 GenAI drafted; 169 HCP drafted). Both GenAI and HCP responses were rated favorably. GenAI responses were rated higher for communication style than HCP responses (mean [SD], 3.70 [1.15] vs 3.38 [1.20]; P = .01, U = 12 568.5) but were similar to HCPs on information content (mean [SD], 3.53 [1.26] vs 3.41 [1.27]; P = .37; U = 13 981.0) and usable draft proportion (mean [SD], 0.69 [0.48] vs 0.65 [0.47], P = .49, t = -0.6842). Usable GenAI responses were considered more empathetic than usable HCP responses (32 of 86 [37.2%] vs 13 of 79 [16.5%]; difference, 125.5%), possibly attributable to more subjective (mean [SD], 0.54 [0.16] vs 0.31 [0.23]; P < .001; difference, 74.2%) and positive (mean [SD] polarity, 0.21 [0.14] vs 0.13 [0.25]; P = .02; difference, 61.5%) language; they were also numerically longer (mean [SD] word count, 90.5 [32.0] vs 65.4 [62.6]; difference, 38.4%), but the difference was not statistically significant (P = .07) and more linguistically complex (mean [SD] score, 125.2 [47.8] vs 95.4 [58.8]; P = .002; difference, 31.2%). Conclusions: In this cross-sectional study of PCP perceptions of an EHR-integrated GenAI chatbot, GenAI was found to communicate information better and with more empathy than HCPs, highlighting its potential to enhance patient-HCP communication. However, GenAI drafts were less readable than HCPs', a significant concern for patients with low health or English literacy.


Subject(s)
Physician-Patient Relations , Humans , Cross-Sectional Studies , Female , Male , Adult , Middle Aged , Communication , Quality Improvement , Artificial Intelligence , Physicians, Primary Care/psychology , Electronic Health Records , Language , Empathy , Attitude of Health Personnel
18.
J Am Med Inform Assoc ; 31(9): 1983-1993, 2024 Sep 01.
Article in English | MEDLINE | ID: mdl-38778578

ABSTRACT

OBJECTIVES: To evaluate the proficiency of a HIPAA-compliant version of GPT-4 in identifying actionable, incidental findings from unstructured radiology reports of Emergency Department patients. To assess appropriateness of artificial intelligence (AI)-generated, patient-facing summaries of these findings. MATERIALS AND METHODS: Radiology reports extracted from the electronic health record of a large academic medical center were manually reviewed to identify non-emergent, incidental findings with high likelihood of requiring follow-up, further sub-stratified as "definitely actionable" (DA) or "possibly actionable-clinical correlation" (PA-CC). Instruction prompts to GPT-4 were developed and iteratively optimized using a validation set of 50 reports. The optimized prompt was then applied to a test set of 430 unseen reports. GPT-4 performance was primarily graded on accuracy identifying either DA or PA-CC findings, then secondarily for DA findings alone. Outputs were reviewed for hallucinations. AI-generated patient-facing summaries were assessed for appropriateness via Likert scale. RESULTS: For the primary outcome (DA or PA-CC), GPT-4 achieved 99.3% recall, 73.6% precision, and 84.5% F-1. For the secondary outcome (DA only), GPT-4 demonstrated 95.2% recall, 77.3% precision, and 85.3% F-1. No findings were "hallucinated" outright. However, 2.8% of cases included generated text about recommendations that were inferred without specific reference. The majority of True Positive AI-generated summaries required no or minor revision. CONCLUSION: GPT-4 demonstrates proficiency in detecting actionable, incidental findings after refined instruction prompting. AI-generated patient instructions were most often appropriate, but rarely included inferred recommendations. While this technology shows promise to augment diagnostics, active clinician oversight via "human-in-the-loop" workflows remains critical for clinical implementation.


Subject(s)
Artificial Intelligence , Electronic Health Records , Incidental Findings , Humans , Emergency Service, Hospital , Radiology Information Systems
19.
PLOS Digit Health ; 3(7): e0000394, 2024 Jul.
Article in English | MEDLINE | ID: mdl-39042600

ABSTRACT

BACKGROUND: Healthcare crowdsourcing events (e.g. hackathons) facilitate interdisciplinary collaboration and encourage innovation. Peer-reviewed research has not yet considered a healthcare crowdsourcing event focusing on generative artificial intelligence (GenAI), which generates text in response to detailed prompts and has vast potential for improving the efficiency of healthcare organizations. Our event, the New York University Langone Health (NYULH) Prompt-a-thon, primarily sought to inspire and build AI fluency within our diverse NYULH community, and foster collaboration and innovation. Secondarily, we sought to analyze how participants' experience was influenced by their prior GenAI exposure and whether they received sample prompts during the workshop. METHODS: Executing the event required the assembly of an expert planning committee, who recruited diverse participants, anticipated technological challenges, and prepared the event. The event was composed of didactics and workshop sessions, which educated and allowed participants to experiment with using GenAI on real healthcare data. Participants were given novel "project cards" associated with each dataset that illuminated the tasks GenAI could perform and, for a random set of teams, sample prompts to help them achieve each task (the public repository of project cards can be found at https://github.com/smallw03/NYULH-Generative-AI-Prompt-a-thon-Project-Cards). Afterwards, participants were asked to fill out a survey with 7-point Likert-style questions. RESULTS: Our event was successful in educating and inspiring hundreds of enthusiastic in-person and virtual participants across our organization on the responsible use of GenAI in a low-cost and technologically feasible manner. All participants responded positively, on average, to each of the survey questions (e.g., confidence in their ability to use and trust GenAI). Critically, participants reported a self-perceived increase in their likelihood of using and promoting colleagues' use of GenAI for their daily work. No significant differences were seen in the surveys of those who received sample prompts with their project task descriptions. CONCLUSION: The first healthcare Prompt-a-thon was an overwhelming success, with minimal technological failures, positive responses from diverse participants and staff, and evidence of post-event engagement. These findings will be integral to planning future events at our institution, and to others looking to engage their workforce in utilizing GenAI.

20.
Eur Heart J Acute Cardiovasc Care ; 13(6): 472-480, 2024 Jun 30.
Article in English | MEDLINE | ID: mdl-38518758

ABSTRACT

AIMS: Myocardial infarction and heart failure are major cardiovascular diseases that affect millions of people in the USA with morbidity and mortality being highest among patients who develop cardiogenic shock. Early recognition of cardiogenic shock allows prompt implementation of treatment measures. Our objective is to develop a new dynamic risk score, called CShock, to improve early detection of cardiogenic shock in the cardiac intensive care unit (ICU). METHODS AND RESULTS: We developed and externally validated a deep learning-based risk stratification tool, called CShock, for patients admitted into the cardiac ICU with acute decompensated heart failure and/or myocardial infarction to predict the onset of cardiogenic shock. We prepared a cardiac ICU dataset using the Medical Information Mart for Intensive Care-III database by annotating with physician-adjudicated outcomes. This dataset which consisted of 1500 patients with 204 having cardiogenic/mixed shock was then used to train CShock. The features used to train the model for CShock included patient demographics, cardiac ICU admission diagnoses, routinely measured laboratory values and vital signs, and relevant features manually extracted from echocardiogram and left heart catheterization reports. We externally validated the risk model on the New York University (NYU) Langone Health cardiac ICU database which was also annotated with physician-adjudicated outcomes. The external validation cohort consisted of 131 patients with 25 patients experiencing cardiogenic/mixed shock. CShock achieved an area under the receiver operator characteristic curve (AUROC) of 0.821 (95% CI 0.792-0.850). CShock was externally validated in the more contemporary NYU cohort and achieved an AUROC of 0.800 (95% CI 0.717-0.884), demonstrating its generalizability in other cardiac ICUs. Having an elevated heart rate is most predictive of cardiogenic shock development based on Shapley values. The other top 10 predictors are having an admission diagnosis of myocardial infarction with ST-segment elevation, having an admission diagnosis of acute decompensated heart failure, Braden Scale, Glasgow Coma Scale, blood urea nitrogen, systolic blood pressure, serum chloride, serum sodium, and arterial blood pH. CONCLUSION: The novel CShock score has the potential to provide automated detection and early warning for cardiogenic shock and improve the outcomes for millions of patients who suffer from myocardial infarction and heart failure.


Subject(s)
Machine Learning , Shock, Cardiogenic , Humans , Shock, Cardiogenic/diagnosis , Male , Female , Risk Assessment/methods , Aged , Middle Aged , Coronary Care Units , Early Diagnosis , Retrospective Studies , Risk Factors , ROC Curve , Hospital Mortality/trends , Myocardial Infarction/diagnosis , Myocardial Infarction/complications , Intensive Care Units
SELECTION OF CITATIONS
SEARCH DETAIL