RESUMEN
Pressure injury (PI) detection is challenging, especially in dark skin tones, due to the unreliability of visual inspection. Thermography may serve as a viable alternative as temperature differences in the skin can indicate impending tissue damage. Although deep learning models hold considerable promise toward reliably detecting PI, existing work fails to evaluate performance on diverse skin tones and varying data collection protocols. We collected a new dataset of 35 participants focused on darker skin tones where temperature differences are induced through cooling and cupping protocols. The dataset includes different cameras, lighting, patient pose, and camera distance. We compare the performance of three convolutional neural network (CNN) models trained on either the thermal or the optical images on all skin tones. Our results suggest thermography-based CNN is robust to data collection protocols. Moreover, the visual explanation often captures the region of interest without requiring explicit bounding box labels.
RESUMEN
Musculoskeletal (MSK) pain leads to significant healthcare utilization, decreased productivity, and disability globally. Due to its complex etiology, MSK pain is often chronic and challenging to manage effectively. Disparities in pain management-influenced by provider implicit biases and patient race, gender, age, and socioeconomic status-contribute to inconsistent outcomes. Interventional radiology (IR) provides innovative solutions for MSK pain through minimally invasive procedures, which can alleviate symptoms and reduce reliance on opioids. However, IR services may be underutilized, especially due to current treatment paradigms, referral patterns, and in areas with limited access to care. Artificial intelligence (AI) presents a promising avenue to address these inequities by analyzing large datasets to identify disparities in pain management, recognizing implicit biases, improving cultural competence, and enhancing pain assessment through multimodal data analysis. Additionally, patients who may benefit from an IR pain procedure for their MSK pain may then receive more information through their providers after being identified as a candidate by AI sifting through the electronic medical record. By leveraging AI, healthcare providers can potentially mitigate their biases while ensuring more equitable pain management and better overall outcomes for patients.
Asunto(s)
Inteligencia Artificial , Disparidades en Atención de Salud , Manejo del Dolor , Humanos , Radiografía Intervencional , Enfermedades Musculoesqueléticas/terapia , Enfermedades Musculoesqueléticas/diagnóstico por imagen , Actitud del Personal de Salud , Factores de Riesgo , Disparidades en el Estado de Salud , Conocimientos, Actitudes y Práctica en Salud , Resultado del Tratamiento , Accesibilidad a los Servicios de SaludRESUMEN
Reports of Large Language Models (LLMs) passing board examinations have spurred medical enthusiasm for their clinical integration. Through a narrative review, we reflect upon the skill shifts necessary for clinicians to succeed in an LLM-enabled world, achieving benefits while minimizing risks. We suggest how medical education must evolve to prepare clinicians capable of navigating human-AI systems.
Asunto(s)
Inteligencia Artificial , Humanos , Educación Médica/métodos , Competencia Clínica , Lenguaje , Atención a la SaludRESUMEN
OBJECTIVE: Pulse oximetry, a ubiquitous vital sign in modern medicine, has inequitable accuracy that disproportionately affects minority Black and Hispanic patients, with associated increases in mortality, organ dysfunction, and oxygen therapy. Previous retrospective studies used self-reported race or ethnicity as a surrogate for skin tone which is believed to be the root cause of the disparity. Our objective was to determine the utility of skin tone in explaining pulse oximetry discrepancies. DESIGN: Prospective cohort study. SETTING: Patients were eligible if they had pulse oximetry recorded up to 5 minutes before arterial blood gas (ABG) measurements. Skin tone was measured using administered visual scales, reflectance colorimetry, and reflectance spectrophotometry. PARTICIPANTS: Admitted hospital patients at Duke University Hospital. INTERVENTIONS: None. MEASUREMENTS AND MAIN RESULTS: Sao2-Spo2 bias, variation of bias, and accuracy root mean square, comparing pulse oximetry, and ABG measurements. Linear mixed-effects models were fitted to estimate Sao2-Spo2 bias while accounting for clinical confounders.One hundred twenty-eight patients (57 Black, 56 White) with 521 ABG-pulse oximetry pairs were recruited. Skin tone data were prospectively collected using six measurement methods, generating eight measurements. The collected skin tone measurements were shown to yield differences among each other and overlap with self-reported racial groups, suggesting that skin tone could potentially provide information beyond self-reported race. Among the eight skin tone measurements in this study, and compared with self-reported race, the Monk Scale had the best relationship with differences in pulse oximetry bias (point estimate: -2.40%; 95% CI, -4.32% to -0.48%; p = 0.01) when comparing patients with lighter and dark skin tones. CONCLUSIONS: We found clinical performance differences in pulse oximetry, especially in darker skin tones. Additional studies are needed to determine the relative contributions of skin tone measures and other potential factors on pulse oximetry discrepancies.
Asunto(s)
Enfermedad Crítica , Oximetría , Pigmentación de la Piel , Humanos , Oximetría/métodos , Estudios Prospectivos , Femenino , Masculino , Persona de Mediana Edad , Anciano , Estudios de Cohortes , Adulto , Análisis de los Gases de la Sangre/métodos , Población BlancaRESUMEN
This narrative review focuses on the role of clinical prediction models in supporting informed decision-making in critical care, emphasizing their 2 forms: traditional scores and artificial intelligence (AI)-based models. Acknowledging the potential for both types to embed biases, the authors underscore the importance of critical appraisal to increase our trust in models. The authors outline recommendations and critical care examples to manage risk of bias in AI models. The authors advocate for enhanced interdisciplinary training for clinicians, who are encouraged to explore various resources (books, journals, news Web sites, and social media) and events (Datathons) to deepen their understanding of risk of bias.
Asunto(s)
Inteligencia Artificial , Cuidados Críticos , Humanos , Cuidados Críticos/normas , Sesgo , Toma de Decisiones ClínicasRESUMEN
RATIONALE AND OBJECTIVES: Radiology residents often receive limited feedback on preliminary reports issued during independent call. This study aimed to determine if Large Language Models (LLMs) can supplement traditional feedback by identifying missed diagnoses in radiology residents' preliminary reports. MATERIALS & METHODS: A randomly selected subset of 500 (250 train/250 validation) paired preliminary and final reports between 12/17/2022 and 5/22/2023 were extracted and de-identified from our institutional database. The prompts and report text were input into the GPT-4 language model via the GPT-4 API (gpt-4-0314 model version). Iterative prompt tuning was used on a subset of the training/validation sets to direct the model to identify important findings in the final report that were absent in preliminary reports. For testing, a subset of 10 reports with confirmed diagnostic errors were randomly selected. Fourteen residents with on-call experience assessed the LLM-generated discrepancies and completed a survey on their experience using a 5-point Likert scale. RESULTS: The model identified 24 unique missed diagnoses across 10 test reports with i% model prediction accuracy as rated by 14 residents. Five additional diagnoses were identified by users, resulting in a model sensitivity of 79.2 %. Post-evaluation surveys showed a mean satisfaction rating of 3.50 and perceived accuracy rating of 3.64 out of 5 for LLM-generated feedback. Most respondents (71.4 %) favored a combination of LLM-generated and traditional feedback. CONCLUSION: This pilot study on the use of LLM-generated feedback for radiology resident preliminary reports demonstrated notable accuracy in identifying missed diagnoses and was positively received, highlighting LLMs' potential role in supplementing conventional feedback methods.
RESUMEN
Background: Although hypothesized to be the root cause of the pulse oximetry disparities, skin tone and its use for improving medical therapies have yet to be extensively studied. Studies previously used self-reported race as a proxy variable for skin tone. However, this approach cannot account for skin tone variability within race groups and also risks the potential to be confounded by other non-biological factors when modeling data. Therefore, to better evaluate health disparities associated with pulse oximetry, this study aimed to create a unique baseline dataset that included skin tone and electronic health record (EHR) data. Methods: Patients admitted to Duke University Hospital were eligible if they had at least one pulse oximetry value recorded within 5 minutes before an arterial blood gas (ABG) value. We collected skin tone data at 16 different body locations using multiple devices, including administered visual scales, colorimetric, spectrophotometric, and photography via mobile phone cameras. All patients' data were linked in Duke's Protected Analytics Computational Environment (PACE), converted into a common data model, and then de-identified before publication in PhysioNet. Results: Skin tone data were collected from 128 patients. We assessed 167 features per skin location on each patient. We also collected over 2000 images from mobile phones measured in the same controlled environment. Skin tone data are linked with patients' EHR data, such as laboratory data, vital sign recordings, and demographic information. Conclusions: Measuring different aspects of skin tone for each of the sixteen body locations and linking them with patients' EHR data could assist in the development of a more equitable AI model to combat disparities in healthcare associated with skin tone. A common data model format enables easy data federation with similar data from other sources, facilitating multicenter research on skin tone in healthcare. Description: A prospectively collected EHR-linked skin tone measurements database in a common data model with emphasis on pulse oximetry disparities.
RESUMEN
Large Language Models (LLMs) are rapidly being adopted in healthcare, necessitating standardized reporting guidelines. We present TRIPOD-LLM, an extension of the TRIPOD+AI statement, addressing the unique challenges of LLMs in biomedical applications. TRIPOD-LLM provides a comprehensive checklist of 19 main items and 50 subitems, covering key aspects from title to discussion. The guidelines introduce a modular format accommodating various LLM research designs and tasks, with 14 main items and 32 subitems applicable across all categories. Developed through an expedited Delphi process and expert consensus, TRIPOD-LLM emphasizes transparency, human oversight, and task-specific performance reporting. We also introduce an interactive website ( https://tripod-llm.vercel.app/ ) facilitating easy guideline completion and PDF generation for submission. As a living document, TRIPOD-LLM will evolve with the field, aiming to enhance the quality, reproducibility, and clinical applicability of LLM research in healthcare through comprehensive reporting. COI: DSB: Editorial, unrelated to this work: Associate Editor of Radiation Oncology, HemOnc.org (no financial compensation); Research funding, unrelated to this work: American Association for Cancer Research; Advisory and consulting, unrelated to this work: MercurialAI. DDF: Editorial, unrelated to this work: Associate Editor of JAMIA, Editorial Board of Scientific Data, Nature; Funding, unrelated to this work: the intramural research program at the U.S. National Library of Medicine, National Institutes of Health. JWG: Editorial, unrelated to this work: Editorial Board of Radiology: Artificial Intelligence, British Journal of Radiology AI journal and NEJM AI. All other authors declare no conflicts of interest.
RESUMEN
In the United States, there is a proposal to link hospital Medicare payments with health equity measures, signaling a need to precisely measure equity in healthcare delivery. Despite significant research demonstrating disparities in health care outcomes and access, there is a noticeable gap in tools available to assess health equity across various health conditions and treatments. The available tools often focus on a single area of patient care, such as medication delivery, but fail to examine the entire health care process. The objective of this study is to propose a process mining framework to provide a comprehensive view of health equity. Using event logs which track all actions during patient care, this method allows us to look at disparities in single and multiple treatment steps, but also in the broader strategy of treatment delivery. We have applied this framework to the management of patients with sepsis in the Intensive Care Unit (ICU), focusing on sex and English language proficiency. We found no significant differences between treatments of male and female patients. However, for patients who don't speak English, there was a notable delay in starting their treatment, even though their illness was just as severe and subsequent treatments were similar. This framework subsumes existing individual approaches to measure health inequities and offers a comprehensive approach to pinpoint and delve into healthcare disparities, providing a valuable tool for research and policy-making aiming at more equitable healthcare.
RESUMEN
OBJECTIVES: To develop and validate machine learning (ML) models to predict high-flow nasal cannula (HFNC) failure in COVID-19, compare their performance to the respiratory rate-oxygenation (ROX) index, and evaluate model accuracy by self-reported race. DESIGN: Retrospective cohort study. SETTING: Four Emory University Hospitals in Atlanta, GA. PATIENTS: Adult patients hospitalized with COVID-19 between March 2020 and April 2022 who received HFNC therapy within 24 hours of ICU admission were included. INTERVENTIONS: None. MEASUREMENTS AND MAIN RESULTS: Four types of supervised ML models were developed for predicting HFNC failure (defined as intubation or death within 7 d of HFNC initiation), using routine clinical variables from the first 24 hours of ICU admission. Models were trained on the first 60% (n = 594) of admissions and validated on the latter 40% (n = 390) of admissions to simulate prospective implementation. Among 984 patients included, 317 patients (32.2%) developed HFNC failure. eXtreme Gradient Boosting (XGB) model had the highest area under the receiver-operator characteristic curve (AUROC) for predicting HFNC failure (0.707), and was the only model with significantly better performance than the ROX index (AUROC 0.616). XGB model had significantly worse performance in Black patients compared with White patients (AUROC 0.663 vs. 0.808, p = 0.02). Racial differences in the XGB model were reduced and no longer statistically significant when restricted to patients with nonmissing arterial blood gas data, and when XGB model was developed to predict mortality (rather than the composite outcome of failure, which could be influenced by biased clinical decisions for intubation). CONCLUSIONS: Our XGB model had better discrimination for predicting HFNC failure in COVID-19 than the ROX index, but had racial differences in accuracy of predictions. Further studies are needed to understand and mitigate potential sources of biases in clinical ML models and to improve their equitability.
Asunto(s)
COVID-19 , Cánula , Humanos , COVID-19/terapia , COVID-19/etnología , Masculino , Estudios Retrospectivos , Femenino , Persona de Mediana Edad , Anciano , Terapia por Inhalación de Oxígeno/métodos , Insuficiencia del Tratamiento , Aprendizaje Automático , SARS-CoV-2 , Unidades de Cuidados Intensivos , Ventilación no Invasiva/métodosRESUMEN
Increasing evidence supports reduced accuracy of noninvasive assessment tools, such as pulse oximetry, temperature probes, and AI skin diagnosis benchmarks, in patients with darker skin tones. The FDA is exploring potential strategies for device regulation to improve performance across diverse skin tones by including skin tone criteria. However, there is no consensus about how prospective studies should perform skin tone assessment in order to take this bias into account. There are several tools available to conduct skin tone assessments including administered visual scales (e.g., Fitzpatrick Skin Type, Pantone, Monk Skin Tone) and color measurement tools (e.g., reflectance colorimeters, reflectance spectrophotometers, cameras), although none are consistently used or validated across multiple medical domains. Accurate and consistent skin tone measurement depends on many factors including standardized environments, lighting, body parts assessed, patient conditions, and choice of skin tone assessment tool(s). As race and ethnicity are inadequate proxies for skin tone, these considerations can be helpful in standardizing the effect of skin tone on studies such as AI dermatology diagnoses, pulse oximetry, and temporal thermometers. Skin tone bias in medical devices is likely due to systemic factors that lead to inadequate validation across diverse skin tones. There is an opportunity for researchers to use skin tone assessment methods with standardized considerations in prospective studies of noninvasive tools that may be affected by skin tone. We propose considerations that researchers must take in order to improve device robustness to skin tone bias.
RESUMEN
In the following narrative review, we discuss the potential role of large language models (LLMs) in medical device innovation, specifically examples using generative pretrained transformer-4. Throughout the biodesign process, LLMs can offer prompt-driven insights, aiding problem identification, knowledge assimilation and decision-making. Intellectual property analysis, regulatory assessment and market analysis emerge as key LLM applications. Through case examples, we underscore LLMs' transformative ability to democratise information access and expertise, facilitating inclusive innovation in medical devices as well as its effectiveness with providing real-time, individualised feedback for innovators of all experience levels. By mitigating entry barriers, LLMs accelerate transformative advancements, fostering collaboration among established and emerging stakeholders.
Asunto(s)
Propiedad Intelectual , Humanos , Equipos y Suministros , InvencionesRESUMEN
De-identification of medical images intended for research is a core requirement for data sharing initiatives, particularly as the demand for data for artificial intelligence (AI) applications grows. The Center for Biomedical Informatics and Information Technology (CBIIT) of the United States National Cancer Institute (NCI) convened a two half-day virtual workshop with the intent of summarizing the state of the art in de-identification technology and processes and exploring interesting aspects of the subject. This paper summarizes the highlights of the second day of the workshop, the recordings and presentations of which are publicly available for review. The topics covered included pathology whole slide image de-identification, de-facing, the role of AI in image de-identification, and the NCI Medical Image De-Identification Initiative (MIDI) datasets and pipeline.
RESUMEN
As artificial intelligence (AI) rapidly approaches human-level performance in medical imaging, it is crucial that it does not exacerbate or propagate healthcare disparities. Previous research established AI's capacity to infer demographic data from chest X-rays, leading to a key concern: do models using demographic shortcuts have unfair predictions across subpopulations? In this study, we conducted a thorough investigation into the extent to which medical AI uses demographic encodings, focusing on potential fairness discrepancies within both in-distribution training sets and external test sets. Our analysis covers three key medical imaging disciplines-radiology, dermatology and ophthalmology-and incorporates data from six global chest X-ray datasets. We confirm that medical imaging AI leverages demographic shortcuts in disease classification. Although correcting shortcuts algorithmically effectively addresses fairness gaps to create 'locally optimal' models within the original data distribution, this optimality is not true in new test settings. Surprisingly, we found that models with less encoding of demographic attributes are often most 'globally optimal', exhibiting better fairness during model evaluation in new test environments. Our work establishes best practices for medical imaging models that maintain their performance and fairness in deployments beyond their initial training contexts, underscoring critical considerations for AI clinical deployments across populations and sites.
Asunto(s)
Inteligencia Artificial , Diagnóstico por Imagen , Humanos , Diagnóstico por Imagen/métodos , Algoritmos , Masculino , Radiografía Torácica , Femenino , DermatologíaRESUMEN
The potential of artificial intelligence (AI) in medicine lies in its ability to enhance clinicians' capacity to analyse medical images, thereby improving diagnostic precision and accuracy and thus enhancing current tests. However, the integration of AI within health care is fraught with difficulties. Heterogeneity among health care system applications, reliance on proprietary closed-source software, and rising cybersecurity threats pose significant challenges. Moreover, before their deployment in clinical settings, AI models must demonstrate their effectiveness across a wide range of scenarios and must be validated by prospective studies, but doing so requires testing in an environment mirroring the clinical workflow, which is difficult to achieve without dedicated software. Finally, the use of AI techniques in health care raises significant legal and ethical issues, such as the protection of patient privacy, the prevention of bias, and the monitoring of the device's safety and effectiveness for regulatory compliance. This review describes challenges to AI integration in health care and provides guidelines on how to move forward. We describe an open-source solution that we developed that integrates AI models into the Picture Archives Communication System (PACS), called PACS-AI. This approach aims to increase the evaluation of AI models by facilitating their integration and validation with existing medical imaging databases. PACS-AI may overcome many current barriers to AI deployment and offer a pathway toward responsible, fair, and effective deployment of AI models in health care. In addition, we propose a list of criteria and guidelines that AI researchers should adopt when publishing a medical AI model to enhance standardisation and reproducibility.
Asunto(s)
Inteligencia Artificial , Sistemas de Información Radiológica , Humanos , Inteligencia Artificial/ética , Sistemas de Atención de PuntoRESUMEN
BACKGROUND: Chest X-rays (CXR) are essential for diagnosing a variety of conditions, but when used on new populations, model generalizability issues limit their efficacy. Generative AI, particularly denoising diffusion probabilistic models (DDPMs), offers a promising approach to generating synthetic images, enhancing dataset diversity. This study investigates the impact of synthetic data supplementation on the performance and generalizability of medical imaging research. METHODS: The study employed DDPMs to create synthetic CXRs conditioned on demographic and pathological characteristics from the CheXpert dataset. These synthetic images were used to supplement training datasets for pathology classifiers, with the aim of improving their performance. The evaluation involved three datasets (CheXpert, MIMIC-CXR, and Emory Chest X-ray) and various experiments, including supplementing real data with synthetic data, training with purely synthetic data, and mixing synthetic data with external datasets. Performance was assessed using the area under the receiver operating curve (AUROC). FINDINGS: Adding synthetic data to real datasets resulted in a notable increase in AUROC values (up to 0.02 in internal and external test sets with 1000% supplementation, p-value <0.01 in all instances). When classifiers were trained exclusively on synthetic data, they achieved performance levels comparable to those trained on real data with 200%-300% data supplementation. The combination of real and synthetic data from different sources demonstrated enhanced model generalizability, increasing model AUROC from 0.76 to 0.80 on the internal test set (p-value <0.01). INTERPRETATION: Synthetic data supplementation significantly improves the performance and generalizability of pathology classifiers in medical imaging. FUNDING: Dr. Gichoya is a 2022 Robert Wood Johnson Foundation Harold Amos Medical Faculty Development Program and declares support from RSNA Health Disparities grant (#EIHD2204), Lacuna Fund (#67), Gordon and Betty Moore Foundation, NIH (NIBIB) MIDRC grant under contracts 75N92020C00008 and 75N92020C00021, and NHLBI Award Number R01HL167811.
Asunto(s)
Diagnóstico por Imagen , Curva ROC , Humanos , Diagnóstico por Imagen/métodos , Algoritmos , Radiografía Torácica/métodos , Procesamiento de Imagen Asistido por Computador/métodos , Bases de Datos Factuales , Área Bajo la Curva , Modelos EstadísticosRESUMEN
Pulse oximeters measure peripheral arterial oxygen saturation (SpO2) noninvasively, while the gold standard (SaO2) involves arterial blood gas measurement. There are known racial and ethnic disparities in their performance. BOLD is a dataset that aims to underscore the importance of addressing biases in pulse oximetry accuracy, which disproportionately affect darker-skinned patients. The dataset was created by harmonizing three Electronic Health Record databases (MIMIC-III, MIMIC-IV, eICU-CRD) comprising Intensive Care Unit stays of US patients. Paired SpO2 and SaO2 measurements were time-aligned and combined with various other sociodemographic and parameters to provide a detailed representation of each patient. BOLD includes 49,099 paired measurements, within a 5-minute window and with oxygen saturation levels between 70-100%. Minority racial and ethnic groups account for ~25% of the data - a proportion seldom achieved in previous studies. The codebase is publicly available. Given the prevalent use of pulse oximeters in the hospital and at home, we hope that BOLD will be leveraged to develop debiasing algorithms that can result in more equitable healthcare solutions.
Asunto(s)
Análisis de los Gases de la Sangre , Oximetría , Humanos , Saturación de Oxígeno , Unidades de Cuidados Intensivos , Etnicidad , Oxígeno/sangreRESUMEN
Large language models (LLMs) hold immense potential to revolutionize radiology. However, their integration into practice requires careful consideration. Artificial intelligence (AI) chatbots and general-purpose LLMs have potential pitfalls related to privacy, transparency, and accuracy, limiting their current clinical readiness. Thus, LLM-based tools must be optimized for radiology practice to overcome these limitations. Although research and validation for radiology applications remain in their infancy, commercial products incorporating LLMs are becoming available alongside promises of transforming practice. To help radiologists navigate this landscape, this AJR Expert Panel Narrative Review provides a multidimensional perspective on LLMs, encompassing considerations from bench (development and optimization) to bedside (use in practice). At present, LLMs are not autonomous entities that can replace expert decision-making, and radiologists remain responsible for the content of their reports. Patient-facing tools, particularly medical AI chatbots, require additional guardrails to ensure safety and prevent misuse. Still, if responsibly implemented, LLMs are well-positioned to transform efficiency and quality in radiology. Radiologists must be well-informed and proactively involved in guiding the implementation of LLMs in practice to mitigate risks and maximize benefits to patient care.