Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 12 de 12
Filtrar
1.
Artículo en Inglés | MEDLINE | ID: mdl-38778578

RESUMEN

OBJECTIVES: To evaluate the proficiency of a HIPAA-compliant version of GPT-4 in identifying actionable, incidental findings from unstructured radiology reports of Emergency Department patients. To assess appropriateness of artificial intelligence (AI)-generated, patient-facing summaries of these findings. MATERIALS AND METHODS: Radiology reports extracted from the electronic health record of a large academic medical center were manually reviewed to identify non-emergent, incidental findings with high likelihood of requiring follow-up, further sub-stratified as "definitely actionable" (DA) or "possibly actionable-clinical correlation" (PA-CC). Instruction prompts to GPT-4 were developed and iteratively optimized using a validation set of 50 reports. The optimized prompt was then applied to a test set of 430 unseen reports. GPT-4 performance was primarily graded on accuracy identifying either DA or PA-CC findings, then secondarily for DA findings alone. Outputs were reviewed for hallucinations. AI-generated patient-facing summaries were assessed for appropriateness via Likert scale. RESULTS: For the primary outcome (DA or PA-CC), GPT-4 achieved 99.3% recall, 73.6% precision, and 84.5% F-1. For the secondary outcome (DA only), GPT-4 demonstrated 95.2% recall, 77.3% precision, and 85.3% F-1. No findings were "hallucinated" outright. However, 2.8% of cases included generated text about recommendations that were inferred without specific reference. The majority of True Positive AI-generated summaries required no or minor revision. CONCLUSION: GPT-4 demonstrates proficiency in detecting actionable, incidental findings after refined instruction prompting. AI-generated patient instructions were most often appropriate, but rarely included inferred recommendations. While this technology shows promise to augment diagnostics, active clinician oversight via "human-in-the-loop" workflows remains critical for clinical implementation.

2.
medRxiv ; 2024 Feb 13.
Artículo en Inglés | MEDLINE | ID: mdl-38405784

RESUMEN

Importance: Large language models (LLMs) are crucial for medical tasks. Ensuring their reliability is vital to avoid false results. Our study assesses two state-of-the-art LLMs (ChatGPT and LlaMA-2) for extracting clinical information, focusing on cognitive tests like MMSE and CDR. Objective: Evaluate ChatGPT and LlaMA-2 performance in extracting MMSE and CDR scores, including their associated dates. Methods: Our data consisted of 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or MoCA. After applying inclusion criteria 34,465 notes remained, of which 765 underwent ChatGPT (GPT-4) and LlaMA-2, and 22 experts reviewed the responses. ChatGPT successfully extracted MMSE and CDR instances with dates from 742 notes. We used 20 notes for fine-tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 each assigned to two reviewers simultaneously. Inter-rater-agreement (Fleiss' Kappa), precision, recall, true/false negative rates, and accuracy were calculated. Our study follows TRIPOD reporting guidelines for model validation. Results: For MMSE information extraction, ChatGPT (vs. LlaMA-2) achieved accuracy of 83% (vs. 66.4%), sensitivity of 89.7% (vs. 69.9%), true-negative rates of 96% (vs 60.0%), and precision of 82.7% (vs 62.2%). For CDR the results were lower overall, with accuracy of 87.1% (vs. 74.5%), sensitivity of 84.3% (vs. 39.7%), true-negative rates of 99.8% (98.4%), and precision of 48.3% (vs. 16.1%). We qualitatively evaluated the MMSE errors of ChatGPT and LlaMA-2 on double-reviewed notes. LlaMA-2 errors included 27 cases of total hallucination, 19 cases of reporting other scores instead of MMSE, 25 missed scores, and 23 cases of reporting only the wrong date. In comparison, ChatGPT's errors included only 3 cases of total hallucination, 17 cases of wrong test reported instead of MMSE, and 19 cases of reporting a wrong date. Conclusions: In this diagnostic/prognostic study of ChatGPT and LlaMA-2 for extracting cognitive exam dates and scores from clinical notes, ChatGPT exhibited high accuracy, with better performance compared to LlaMA-2. The use of LLMs could benefit dementia research and clinical care, by identifying eligible patients for treatments initialization or clinical trial enrollments. Rigorous evaluation of LLMs is crucial to understanding their capabilities and limitations.

3.
Infect Control Hosp Epidemiol ; 45(6): 717-725, 2024 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-38288606

RESUMEN

BACKGROUND: There is a paucity of data guiding treatment duration of oral vancomycin for Clostridiodes difficile infection (CDI) in patients requiring concomitant systemic antibiotics. OBJECTIVES: To evaluate prescribing practices of vancomycin for CDI in patients that required concurrent systemic antibiotics and to determine whether a prolonged duration of vancomycin (>14 days), compared to a standard duration (10-14 days), decreased CDI recurrence. METHODS: In this retrospective cohort study, we evaluated adult hospitalized patients with an initial episode of CDI who were treated with vancomycin and who received overlapping systemic antibiotics for >72 hours. Outcomes of interest included CDI recurrence and isolation of vancomycin-resistant Enterococcus (VRE). RESULTS: Among the 218 patients included, 36% received a standard duration and 64% received a prolonged duration of treatment for a median of 13 days (11-14) and 20 days (16-26), respectively. Patients who received a prolonged duration had a longer median duration of systemic antibiotic overlap with vancomycin (11 vs 8 days; P < .001) and significantly more carbapenem use and infectious disease consultation. Recurrence at 8 weeks (12% standard duration vs 8% prolonged duration; P = .367), recurrence at 6 months (15% standard duration vs 10% prolonged duration; P = .240), and VRE isolation (3% standard duration vs 9% prolonged duration; P = .083) were not significantly different between groups. Discontinuation of vancomycin prior to completion of antibiotics was an independent predictor of 8-week recurrence on multivariable logistic regression (OR, 4.8; 95% CI, 1.3-18.1). CONCLUSIONS: Oral vancomycin prescribing relative to the systemic antibiotic end date may affect CDI recurrence to a greater extent than total vancomycin duration alone. Further studies are needed to confirm these findings.


Asunto(s)
Antibacterianos , Clostridioides difficile , Infecciones por Clostridium , Recurrencia , Vancomicina , Humanos , Vancomicina/administración & dosificación , Vancomicina/uso terapéutico , Estudios Retrospectivos , Masculino , Femenino , Antibacterianos/uso terapéutico , Antibacterianos/administración & dosificación , Persona de Mediana Edad , Infecciones por Clostridium/tratamiento farmacológico , Anciano , Administración Oral , Anciano de 80 o más Años , Esquema de Medicación , Enterococos Resistentes a la Vancomicina , Adulto
4.
Res Sq ; 2023 Jul 03.
Artículo en Inglés | MEDLINE | ID: mdl-37461545

RESUMEN

Pathology reports are considered the gold standard in medical research due to their comprehensive and accurate diagnostic information. Natural language processing (NLP) techniques have been developed to automate information extraction from pathology reports. However, existing studies suffer from two significant limitations. First, they typically frame their tasks as report classification, which restricts the granularity of extracted information. Second, they often fail to generalize to unseen reports due to variations in language, negation, and human error. To overcome these challenges, we propose a BERT (bidirectional encoder representations from transformers) named entity recognition (NER) system to extract key diagnostic elements from pathology reports. We also introduce four data augmentation methods to improve the robustness of our model. Trained and evaluated on 1438 annotated breast pathology reports, acquired from a large medical center in the United States, our BERT model trained with data augmentation achieves an entity F1-score of 0.916 on an internal test set, surpassing the BERT baseline (0.843). We further assessed the model's generalizability using an external validation dataset from the United Arab Emirates, where our model maintained satisfactory performance (F1-score 0.860). Our findings demonstrate that our NER systems can effectively extract fine-grained information from widely diverse medical reports, offering the potential for large-scale information extraction in a wide range of medical and AI research. We publish our code at https://github.com/nyukat/pathology_extraction.

5.
Appl Clin Inform ; 13(3): 632-640, 2022 05.
Artículo en Inglés | MEDLINE | ID: mdl-35896506

RESUMEN

BACKGROUND: We previously developed and validated a predictive model to help clinicians identify hospitalized adults with coronavirus disease 2019 (COVID-19) who may be ready for discharge given their low risk of adverse events. Whether this algorithm can prompt more timely discharge for stable patients in practice is unknown. OBJECTIVES: The aim of the study is to estimate the effect of displaying risk scores on length of stay (LOS). METHODS: We integrated model output into the electronic health record (EHR) at four hospitals in one health system by displaying a green/orange/red score indicating low/moderate/high-risk in a patient list column and a larger COVID-19 summary report visible for each patient. Display of the score was pseudo-randomized 1:1 into intervention and control arms using a patient identifier passed to the model execution code. Intervention effect was assessed by comparing LOS between intervention and control groups. Adverse safety outcomes of death, hospice, and re-presentation were tested separately and as a composite indicator. We tracked adoption and sustained use through daily counts of score displays. RESULTS: Enrolling 1,010 patients from May 15, 2020 to December 7, 2020, the trial found no detectable difference in LOS. The intervention had no impact on safety indicators of death, hospice or re-presentation after discharge. The scores were displayed consistently throughout the study period but the study lacks a causally linked process measure of provider actions based on the score. Secondary analysis revealed complex dynamics in LOS temporally, by primary symptom, and hospital location. CONCLUSION: An AI-based COVID-19 risk score displayed passively to clinicians during routine care of hospitalized adults with COVID-19 was safe but had no detectable impact on LOS. Health technology challenges such as insufficient adoption, nonuniform use, and provider trust compounded with temporal factors of the COVID-19 pandemic may have contributed to the null result. TRIAL REGISTRATION: ClinicalTrials.gov identifier: NCT04570488.


Asunto(s)
COVID-19 , Adulto , COVID-19/epidemiología , Hospitalización , Humanos , Pandemias , Alta del Paciente , SARS-CoV-2 , Resultado del Tratamiento
8.
NPJ Digit Med ; 3: 130, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-33083565

RESUMEN

The COVID-19 pandemic has challenged front-line clinical decision-making, leading to numerous published prognostic tools. However, few models have been prospectively validated and none report implementation in practice. Here, we use 3345 retrospective and 474 prospective hospitalizations to develop and validate a parsimonious model to identify patients with favorable outcomes within 96 h of a prediction, based on real-time lab values, vital signs, and oxygen support variables. In retrospective and prospective validation, the model achieves high average precision (88.6% 95% CI: [88.4-88.7] and 90.8% [90.8-90.8]) and discrimination (95.1% [95.1-95.2] and 86.8% [86.8-86.9]) respectively. We implemented and integrated the model into the EHR, achieving a positive predictive value of 93.3% with 41% sensitivity. Preliminary results suggest clinicians are adopting these scores into their clinical workflows.

9.
BMC Med Inform Decis Mak ; 20(1): 214, 2020 09 07.
Artículo en Inglés | MEDLINE | ID: mdl-32894128

RESUMEN

BACKGROUND: Automated systems that use machine learning to estimate a patient's risk of death are being developed to influence care. There remains sparse transparent reporting of model generalizability in different subpopulations especially for implemented systems. METHODS: A prognostic study included adult admissions at a multi-site, academic medical center between 2015 and 2017. A predictive model for all-cause mortality (including initiation of hospice care) within 60 days of admission was developed. Model generalizability is assessed in temporal validation in the context of potential demographic bias. A subsequent prospective cohort study was conducted at the same sites between October 2018 and June 2019. Model performance during prospective validation was quantified with areas under the receiver operating characteristic and precision recall curves stratified by site. Prospective results include timeliness, positive predictive value, and the number of actionable predictions. RESULTS: Three years of development data included 128,941 inpatient admissions (94,733 unique patients) across sites where patients are mostly white (61%) and female (60%) and 4.2% led to death within 60 days. A random forest model incorporating 9614 predictors produced areas under the receiver operating characteristic and precision recall curves of 87.2 (95% CI, 86.1-88.2) and 28.0 (95% CI, 25.0-31.0) in temporal validation. Performance marginally diverges within sites as the patient mix shifts from development to validation (patients of one site increases from 10 to 38%). Applied prospectively for nine months, 41,728 predictions were generated in real-time (median [IQR], 1.3 [0.9, 32] minutes). An operating criterion of 75% positive predictive value identified 104 predictions at very high risk (0.25%) where 65% (50 from 77 well-timed predictions) led to death within 60 days. CONCLUSION: Temporal validation demonstrates good model discrimination for 60-day mortality. Slight performance variations are observed across demographic subpopulations. The model was implemented prospectively and successfully produced meaningful estimates of risk within minutes of admission.


Asunto(s)
Registros Electrónicos de Salud , Hospitalización , Aprendizaje Automático , Admisión del Paciente , Adolescente , Adulto , Anciano , Anciano de 80 o más Años , Femenino , Humanos , Masculino , Persona de Mediana Edad , Mortalidad , Pronóstico , Estudios Prospectivos , Adulto Joven
10.
JAMIA Open ; 3(2): 243-251, 2020 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-32734165

RESUMEN

OBJECTIVE: One primary consideration when developing predictive models is downstream effects on future model performance. We conduct experiments to quantify the effects of experimental design choices, namely cohort selection and internal validation methods, on (estimated) real-world model performance. MATERIALS AND METHODS: Four years of hospitalizations are used to develop a 1-year mortality prediction model (composite of death or initiation of hospice care). Two common methods to select appropriate patient visits from their encounter history (backwards-from-outcome and forwards-from-admission) are combined with 2 testing cohorts (random and temporal validation). Two models are trained under otherwise identical conditions, and their performances compared. Operating thresholds are selected in each test set and applied to a "real-world" cohort of labeled admissions from another, unused year. RESULTS: Backwards-from-outcome cohort selection retains 25% of candidate admissions (n = 23 579), whereas forwards-from-admission selection includes many more (n = 92 148). Both selection methods produce similar performances when applied to a random test set. However, when applied to the temporally defined "real-world" set, forwards-from-admission yields higher areas under the ROC and precision recall curves (88.3% and 56.5% vs. 83.2% and 41.6%). DISCUSSION: A backwards-from-outcome experiment manipulates raw training data, simplifying the experiment. This manipulated data no longer resembles real-world data, resulting in optimistic estimates of test set performance, especially at high precision. In contrast, a forwards-from-admission experiment with a temporally separated test set consistently and conservatively estimates real-world performance. CONCLUSION: Experimental design choices impose bias upon selected cohorts. A forwards-from-admission experiment, validated temporally, can conservatively estimate real-world performance. LAY SUMMARY: The routine care of patients stands to benefit greatly from assistive technologies, including data-driven risk assessment. Already, many different machine learning and artificial intelligence applications are being developed from complex electronic health record data. To overcome challenges that arise from such data, researchers often start with simple experimental approaches to test their work. One key component is how patients (and their healthcare visits) are selected for the study from the pool of all patients seen. Another is how the group of patients used to create the risk estimator differs from the group used to evaluate how well it works. These choices complicate how the experimental setting compares to the real-world application to patients. For example, different selection approaches that depend on each patient's future outcome can simplify the experiment but are impractical upon implementation as these data are unavailable. We show that this kind of "backwards" experiment optimistically estimates how well the model performs. Instead, our results advocate for experiments that select patients in a "forwards" manner and "temporal" validation that approximates training on past data and implementing on future data. More robust results help gauge the clinical utility of recent works and aid decision-making before implementation into practice.

12.
Biomed Eng Online ; 17(1): 169, 2018 Nov 12.
Artículo en Inglés | MEDLINE | ID: mdl-30419903

RESUMEN

BACKGROUND: Mechanical ventilation is an essential therapy to support critically ill respiratory failure patients. Current standards of care consist of generalised approaches, such as the use of positive end expiratory pressure to inspired oxygen fraction (PEEP-FiO2) tables, which fail to account for the inter- and intra-patient variability between and within patients. The benefits of higher or lower tidal volume, PEEP, and other settings are highly debated and no consensus has been reached. Moreover, clinicians implicitly account for patient-specific factors such as disease condition and progression as they manually titrate ventilator settings. Hence, care is highly variable and potentially often non-optimal. These conditions create a situation that could benefit greatly from an engineered approach. The overall goal is a review of ventilation that is accessible to both clinicians and engineers, to bridge the divide between the two fields and enable collaboration to improve patient care and outcomes. This review does not take the form of a typical systematic review. Instead, it defines the standard terminology and introduces key clinical and biomedical measurements before introducing the key clinical studies and their influence in clinical practice which in turn flows into the needs and requirements around how biomedical engineering research can play a role in improving care. Given the significant clinical research to date and its impact on this complex area of care, this review thus provides a tutorial introduction around the review of the state of the art relevant to a biomedical engineering perspective. DISCUSSION: This review presents the significant clinical aspects and variables of ventilation management, the potential risks associated with suboptimal ventilation management, and a review of the major recent attempts to improve ventilation in the context of these variables. The unique aspect of this review is a focus on these key elements relevant to engineering new approaches. In particular, the need for ventilation strategies which consider, and directly account for, the significant differences in patient condition, disease etiology, and progression within patients is demonstrated with the subsequent requirement for optimal ventilation strategies to titrate for patient- and time-specific conditions. CONCLUSION: Engineered, protective lung strategies that can directly account for and manage inter- and intra-patient variability thus offer great potential to improve both individual care, as well as cohort clinical outcomes.


Asunto(s)
Ingeniería Biomédica , Cuidados Críticos , Respiración con Presión Positiva/instrumentación , Respiración Artificial/instrumentación , Animales , Enfermedad Crítica , Humanos , Pulmón , Lesión Pulmonar/etiología , Oscilometría , Oxígeno/sangre , Oxígeno/química , Respiración con Presión Positiva/métodos , Presión , Respiración Artificial/métodos , Síndrome de Dificultad Respiratoria/terapia , Riesgo , Volumen de Ventilación Pulmonar , Ventiladores Mecánicos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...