RESUMO
Significant progress has been made in augmenting clinical decision-making using artificial intelligence (AI) in the context of secondary and tertiary care at large academic medical centers. For such innovations to have an impact across the spectrum of care, additional challenges must be addressed, including inconsistent use of preventative care and gaps in chronic care management. The integration of additional data, including genomics and data from wearables, could prove critical in addressing these gaps, but technical, legal, and ethical challenges arise. On the technical side, approaches for integrating complex and messy data are needed. Data and design imperfections like selection bias, missing data, and confounding must be addressed. In terms of legal and ethical challenges, while AI has the potential to aid in leveraging patient data to make clinical care decisions, we also risk exacerbating existing disparities. Organizations implementing AI solutions must carefully consider how they can improve care for all and reduce inequities.
Assuntos
Inteligência Artificial , Medicina de Precisão , Humanos , Tomada de Decisão Clínica , Genômica/métodosRESUMO
Importance: Artificial intelligence (AI) could support clinicians when diagnosing hospitalized patients; however, systematic bias in AI models could worsen clinician diagnostic accuracy. Recent regulatory guidance has called for AI models to include explanations to mitigate errors made by models, but the effectiveness of this strategy has not been established. Objectives: To evaluate the impact of systematically biased AI on clinician diagnostic accuracy and to determine if image-based AI model explanations can mitigate model errors. Design, Setting, and Participants: Randomized clinical vignette survey study administered between April 2022 and January 2023 across 13 US states involving hospitalist physicians, nurse practitioners, and physician assistants. Interventions: Clinicians were shown 9 clinical vignettes of patients hospitalized with acute respiratory failure, including their presenting symptoms, physical examination, laboratory results, and chest radiographs. Clinicians were then asked to determine the likelihood of pneumonia, heart failure, or chronic obstructive pulmonary disease as the underlying cause(s) of each patient's acute respiratory failure. To establish baseline diagnostic accuracy, clinicians were shown 2 vignettes without AI model input. Clinicians were then randomized to see 6 vignettes with AI model input with or without AI model explanations. Among these 6 vignettes, 3 vignettes included standard-model predictions, and 3 vignettes included systematically biased model predictions. Main Outcomes and Measures: Clinician diagnostic accuracy for pneumonia, heart failure, and chronic obstructive pulmonary disease. Results: Median participant age was 34 years (IQR, 31-39) and 241 (57.7%) were female. Four hundred fifty-seven clinicians were randomized and completed at least 1 vignette, with 231 randomized to AI model predictions without explanations, and 226 randomized to AI model predictions with explanations. Clinicians' baseline diagnostic accuracy was 73.0% (95% CI, 68.3% to 77.8%) for the 3 diagnoses. When shown a standard AI model without explanations, clinician accuracy increased over baseline by 2.9 percentage points (95% CI, 0.5 to 5.2) and by 4.4 percentage points (95% CI, 2.0 to 6.9) when clinicians were also shown AI model explanations. Systematically biased AI model predictions decreased clinician accuracy by 11.3 percentage points (95% CI, 7.2 to 15.5) compared with baseline and providing biased AI model predictions with explanations decreased clinician accuracy by 9.1 percentage points (95% CI, 4.9 to 13.2) compared with baseline, representing a nonsignificant improvement of 2.3 percentage points (95% CI, -2.7 to 7.2) compared with the systematically biased AI model. Conclusions and Relevance: Although standard AI models improve diagnostic accuracy, systematically biased AI models reduced diagnostic accuracy, and commonly used image-based AI model explanations did not mitigate this harmful effect. Trial Registration: ClinicalTrials.gov Identifier: NCT06098950.
Assuntos
Inteligência Artificial , Competência Clínica , Insuficiência Respiratória , Adulto , Feminino , Humanos , Masculino , Insuficiência Cardíaca/complicações , Insuficiência Cardíaca/diagnóstico , Pneumonia/complicações , Pneumonia/diagnóstico , Doença Pulmonar Obstrutiva Crônica/complicações , Doença Pulmonar Obstrutiva Crônica/diagnóstico , Insuficiência Respiratória/diagnóstico , Insuficiência Respiratória/etiologia , Diagnóstico , Reprodutibilidade dos Testes , Viés , Doença Aguda , Médicos Hospitalares , Profissionais de Enfermagem , Assistentes Médicos , Estados UnidosRESUMO
BACKGROUND: Vestibular deficits can impair an individual's ability to maintain postural and/or gaze stability. Characterizing gait abnormalities among individuals affected by vestibular deficits could help identify patients at high risk of falling and inform rehabilitation programs. Commonly used gait assessment tools rely on simple measures such as timing and visual observations of path deviations by clinicians. These simple measures may not capture subtle changes in gait kinematics. Therefore, we investigated the use of wearable inertial measurement units (IMUs) and machine learning (ML) approaches to automatically discriminate between gait patterns of individuals with vestibular deficits and age-matched controls. The goal of this study was to examine the effects of IMU placement and gait task selection on the performance of automatic vestibular gait classifiers. METHODS: Thirty study participants (15 with vestibular deficits and 15 age-matched controls) participated in a single-session gait study during which they performed seven gait tasks while donning a full-body set of IMUs. Classification performance was reported in terms of area under the receiver operating characteristic curve (AUROC) scores for Random Forest models trained on data from each IMU placement for each gait task. RESULTS: Several models were able to classify vestibular gait better than random (AUROC > 0.5), but their performance varied according to IMU placement and gait task selection. Results indicated that a single IMU placed on the left arm when walking with eyes closed resulted in the highest AUROC score for a single IMU (AUROC = 0.88 [0.84, 0.89]). Feature permutation results indicated that participants with vestibular deficits reduced their arm swing compared to age-matched controls while they walked with eyes closed. CONCLUSIONS: These findings highlighted differences in upper extremity kinematics during walking with eyes closed that were characteristic of vestibular deficits and showed evidence of the discriminative ability of IMU-based automated screening for vestibular deficits. Further research should explore the mechanisms driving arm swing differences in the vestibular population.
Assuntos
Marcha , Vestíbulo do Labirinto , Humanos , Caminhada , Aprendizado de Máquina , Acidentes por Quedas/prevenção & controleRESUMO
INTRODUCTION: Studies investigating the relationship between blood pressure (BP) measurements from electronic health records (EHRs) and Alzheimer's disease (AD) rely on summary statistics, like BP variability, and have only been validated at a single institution. We hypothesize that leveraging BP trajectories can accurately estimate AD risk across different populations. METHODS: In a retrospective cohort study, EHR data from Veterans Affairs (VA) patients were used to train and internally validate a machine learning model to predict AD onset within 5 years. External validation was conducted on patients from Michigan Medicine (MM). RESULTS: The VA and MM cohorts included 6860 and 1201 patients, respectively. Model performance using BP trajectories was modest but comparable (area under the receiver operating characteristic curve [AUROC] = 0.64 [95% confidence interval (CI) = 0.54-0.73] for VA vs. AUROC = 0.66 [95% CI = 0.55-0.76] for MM). CONCLUSION: Approaches that directly leverage BP trajectories from EHR data could aid in AD risk stratification across institutions.
Assuntos
Doença de Alzheimer , Registros Eletrônicos de Saúde , Humanos , Estudos Retrospectivos , Doença de Alzheimer/diagnóstico , Pressão Sanguínea , Medição de RiscoRESUMO
BACKGROUND: Recently, machine learning techniques have been applied to data collected from inertial measurement units to automatically assess balance, but rely on hand-engineered features. We explore the utility of machine learning to automatically extract important features from inertial measurement unit data for balance assessment. FINDINGS: Ten participants with balance concerns performed multiple balance exercises in a laboratory setting while wearing an inertial measurement unit on their lower back. Physical therapists watched video recordings of participants performing the exercises and rated balance on a 5-point scale. We trained machine learning models using different representations of the unprocessed inertial measurement unit data to estimate physical therapist ratings. On a held-out test set, we compared these learned models to one another, to participants' self-assessments of balance, and to models trained using hand-engineered features. Utilizing the unprocessed kinematic data from the inertial measurement unit provided significant improvements over both self-assessments and models using hand-engineered features (AUROC of 0.806 vs. 0.768, 0.665). CONCLUSIONS: Unprocessed data from an inertial measurement unit used as input to a machine learning model produced accurate estimates of balance performance. The ability to learn from unprocessed data presents a potentially generalizable approach for assessing balance without the need for labor-intensive feature engineering, while maintaining comparable model performance.
Assuntos
Dispositivos Eletrônicos Vestíveis , Fenômenos Biomecânicos , Exercício Físico , Terapia por Exercício , Humanos , Aprendizado de MáquinaRESUMO
Dehydration beyond 2% bodyweight loss should be monitored to reduce the risk of heat-related injuries during exercise. However, assessments of hydration in athletic settings can be limited in their accuracy and accessibility. In this study, we sought to develop a data-driven noninvasive approach to measure hydration status, leveraging wearable sensors and normal orthostatic movements. Twenty participants (10 males, 25.0 ± 6.6 years; 10 females, 27.8 ± 4.3 years) completed two exercise sessions in a heated environment: one session was completed without fluid replacement. Before and after exercise, participants performed 12 postural movements that varied in length (up to 2 min). Logistic regression models were trained to estimate dehydration status given their heart rate responses to these postural movements. The area under the receiver operating characteristic curve (AUROC) was used to parameterize the model's discriminative ability. Models achieved an AUROC of 0.79 (IQR: 0.75, 0.91) when discriminating 2% bodyweight loss. The AUROC for the longer supine-to-stand postural movements and shorter toe-touches were similar (0.89, IQR: 0.89, 1.00). Shorter orthostatic tests achieved similar accuracy to clinical tests. The findings suggest that data from wearable sensors can be used to accurately estimate mild dehydration in athletes. In practice, this method may provide an additional measurement for early intervention of severe dehydration.
Assuntos
Esportes , Dispositivos Eletrônicos Vestíveis , Atletas , Desidratação , Exercício Físico , Feminino , Humanos , MasculinoRESUMO
Loss-of-balance (LOB) events, such as trips and slips, are frequent among community-dwelling older adults and are an indicator of increased fall risk. In a preliminary study, eight community-dwelling older adults with a history of falls were asked to perform everyday tasks in the real world while donning a set of three inertial measurement sensors (IMUs) and report LOB events via a voice-recording device. Over 290 h of real-world kinematic data were collected and used to build and evaluate classification models to detect the occurrence of LOB events. Spatiotemporal gait metrics were calculated, and time stamps for when LOB events occurred were identified. Using these data and machine learning approaches, we built classifiers to detect LOB events. Through a leave-one-participant-out validation scheme, performance was assessed in terms of the area under the receiver operating characteristic curve (AUROC) and the area under the precision recall curve (AUPR). The best model achieved an AUROC ≥0.87 for every held-out participant and an AUPR 4-20 times the incidence rate of LOB events. Such models could be used to filter large datasets prior to manual classification by a trained healthcare provider. In this context, the models filtered out at least 65.7% of the data, while detecting ≥87.0% of events on average. Based on the demonstrated discriminative ability to separate LOBs and normal walking segments, such models could be applied retrospectively to track the occurrence of LOBs over an extended period of time.
Assuntos
Acidentes por Quedas , Dispositivos Eletrônicos Vestíveis , Acidentes por Quedas/prevenção & controle , Idoso , Marcha , Humanos , Estudos Retrospectivos , CaminhadaRESUMO
The increasing availability of electronic health data presents a major opportunity in healthcare for both discovery and practical applications to improve healthcare. However, for healthcare epidemiologists to best use these data, computational techniques that can handle large complex datasets are required. Machine learning (ML), the study of tools and methods for identifying patterns in data, can help. The appropriate application of ML to these data promises to transform patient risk stratification broadly in the field of medicine and especially in infectious diseases. This, in turn, could lead to targeted interventions that reduce the spread of healthcare-associated pathogens. In this review, we begin with an introduction to the basics of ML. We then move on to discuss how ML can transform healthcare epidemiology, providing examples of successful applications. Finally, we present special considerations for those healthcare epidemiologists who want to use and apply ML.
Assuntos
Doenças Transmissíveis/epidemiologia , Métodos Epidemiológicos , Aprendizado de Máquina , Processamento Eletrônico de Dados , HumanosRESUMO
Type 1 diabetes (T1D) presents a significant health challenge, requiring patients to actively manage their blood glucose (BG) levels through regular bolus insulin administration. Automated control solutions based on machine learning (ML) models could reduce the need for manual patient intervention. However, the accuracy of current models falls short of what is needed. This is due in part to the fact that these models are often trained on data collected using a basal bolus (BB) strategy, which results in substantial entanglement between bolus insulin and carbohydrate intake. Under standard training approaches, this entanglement can lead to inaccurate forecasts in a control setting, ultimately resulting in poor BG management. To address this, we propose a novel algorithm for training BG forecasters that disentangles the effects of insulin and carbohydrates. By exploiting correction bolus values and leveraging the monotonic effect of insulin on BG, our method accurately captures the independent effects of insulin and carbohydrates on BG. Using an FDA-approved simulator, we evaluated our approach on 10 individuals across 30 days of data. Our approach achieved on average higher time in range compared to standard approaches (81.1% [95% confidence interval (CI) 80.3,81.9] vs 53.6% [95%CI 52.7,54.6], p<0.001), indicating that our approach is able to reliably maintain healthy BG levels in simulated individuals, while baseline approaches are not. Utilizing proxy metrics, our approach also demonstrates potential for improved control on three real world datasets, paving the way for advancements in ML-based BG management.
Assuntos
Glicemia , Diabetes Mellitus Tipo 1 , Insulina , Aprendizado de Máquina , Humanos , Glicemia/análise , Diabetes Mellitus Tipo 1/tratamento farmacológico , Diabetes Mellitus Tipo 1/sangue , Insulina/uso terapêutico , Insulina/administração & dosagem , Algoritmos , Modelos Biológicos , MasculinoRESUMO
OBJECTIVE: Recent years have seen an increase in machine learning (ML)-based blood glucose (BG) forecasting models, with a growing emphasis on potential application to hybrid or closed-loop predictive glucose controllers. However, current approaches focus on evaluating the accuracy of these models using benchmark data generated under the behavior policy, which may differ significantly from the data the model may encounter in a control setting. This study challenges the efficacy of such evaluation approaches, demonstrating that they can fail to accurately capture an ML-based model's true performance in closed-loop control settings. METHODS: Forecast error measured using current evaluation approaches was compared to the control performance of two forecasters-a machine learning-based model (LSTM) and a rule-based model (Loop)-in silico when the forecasters were utilized with a model-based controller in a hybrid closed-loop setting. RESULTS: Under current evaluation standards, LSTM achieves a significantly lower (better) forecast error than Loop with a root mean squared error (RMSE) of 11.57 ±0.05 mg/dL vs. 18.46 ±0.07 mg/dL at the 30-minute prediction horizon. Yet in a control setting, LSTM led to significantly worse control performance with only 77.14% (IQR 66.57-84.03) time-in-range compared to 86.20% (IQR 78.28-91.21) for Loop. CONCLUSION: Prevailing evaluation methods can fail to accurately capture the forecaster's performance when utilized in closed-loop settings. SIGNIFICANCE: Our findings underscore the limitations of current evaluation standards and the need for alternative evaluation metrics and training strategies when developing BG forecasters for closed-loop control systems.
RESUMO
INTRODUCTION: Identifying mild cognitive impairment (MCI) patients at risk for dementia could facilitate early interventions. Using electronic health records (EHRs), we developed a model to predict MCI to all-cause dementia (ACD) conversion at 5 years. METHODS: Cox proportional hazards model was used to identify predictors of ACD conversion from EHR data in veterans with MCI. Model performance (area under the receiver operating characteristic curve [AUC] and Brier score) was evaluated on a held-out data subset. RESULTS: Of 59,782 MCI patients, 15,420 (25.8%) converted to ACD. The model had good discriminative performance (AUC 0.73 [95% confidence interval (CI) 0.72-0.74]), and calibration (Brier score 0.18 [95% CI 0.17-0.18]). Age, stroke, cerebrovascular disease, myocardial infarction, hypertension, and diabetes were risk factors, while body mass index, alcohol abuse, and sleep apnea were protective factors. DISCUSSION: EHR-based prediction model had good performance in identifying 5-year MCI to ACD conversion and has potential to assist triaging of at-risk patients. Highlights: Of 59,782 veterans with mild cognitive impairment (MCI), 15,420 (25.8%) converted to all-cause dementia within 5 years.Electronic health record prediction models demonstrated good performance (area under the receiver operating characteristic curve 0.73; Brier 0.18).Age and vascular-related morbidities were predictors of dementia conversion.Synthetic data was comparable to real data in modeling MCI to dementia conversion. Key Points: An electronic health record-based model using demographic and co-morbidity data had good performance in identifying veterans who convert from mild cognitive impairment (MCI) to all-cause dementia (ACD) within 5 years.Increased age, stroke, cerebrovascular disease, myocardial infarction, hypertension, and diabetes were risk factors for 5-year conversion from MCI to ACD.High body mass index, alcohol abuse, and sleep apnea were protective factors for 5-year conversion from MCI to ACD.Models using synthetic data, analogs of real patient data that retain the distribution, density, and covariance between variables of real patient data but are not attributable to any specific patient, performed just as well as models using real patient data. This could have significant implications in facilitating widely distributed computing of health-care data with minimized patient privacy concern that could accelerate scientific discoveries.
RESUMO
Artificial intelligence (AI) has the potential to transform every facet of cardiovascular practice and research. The exponential rise in technology powered by AI is defining new frontiers in cardiovascular care, with innovations that span novel diagnostic modalities, new digital native biomarkers of disease, and high-performing tools evaluating care quality and prognosticating clinical outcomes. These digital innovations promise expanded access to cardiovascular screening and monitoring, especially among those without access to high-quality, specialized care historically. Moreover, AI is propelling biological and clinical discoveries that will make future cardiovascular care more personalized, precise, and effective. The review brings together these diverse AI innovations, highlighting developments in multimodal cardiovascular AI across clinical practice and biomedical discovery, and envisioning this new future backed by contemporary science and emerging discoveries. Finally, we define the critical path and the safeguards essential to realizing this AI-enabled future that helps achieve optimal cardiovascular health and outcomes for all.
Assuntos
Inteligência Artificial , Doenças Cardiovasculares , Humanos , Doenças Cardiovasculares/terapia , Doenças Cardiovasculares/diagnóstico , Cardiologia/métodos , Cardiologia/tendênciasRESUMO
Current methods of emergency-room-based syndromic surveillance were insufficient to detect early community spread of severe acute respiratory coronavirus virus 2 (SARS-CoV-2) in the United States, which slowed the infection prevention and control response to the novel pathogen. Emerging technologies and automated infection surveillance have the potential to improve upon current practice standards and to revolutionize the practice of infection detection, prevention and control both inside and outside of healthcare settings. Genomics, natural language processing, and machine learning can be leveraged to improve identification of transmission events and aid and evaluate outbreak response. In the near future, automated infection detection strategies can be used to advance a true "Learning Healthcare System" that will support near-real-time quality improvement efforts and advance the scientific basis for the practice of infection control.
RESUMO
The rich and complex electronic health record presents promise for expanding infection detection beyond currently covered settings of care. Here, we review the "how to" of leveraging electronic data sources to expand surveillance to settings of care and infections that have not been the traditional purview of the National Healthcare Safety Network (NHSN), including a discussion of creation of objective and reproducible infection surveillance definitions. In pursuit of a 'fully automated' system, we also examine the promises and pitfalls of leveraging unstructured, free-text data to support infection prevention activities and emerging technological advances that will likely affect the practice of automated infection surveillance. Finally, barriers to achieving a completely 'automated' infection detection system and challenges with intra- and interfacility reliability and missing data are discussed.
RESUMO
The ability to use 16S rRNA gene sequence data to train machine learning classification models offers the opportunity to diagnose patients based on the composition of their microbiome. In some applications, the taxonomic resolution that provides the best models may require the use of de novo operational taxonomic units (OTUs) whose composition changes when new data are added. We previously developed a new reference-based approach, OptiFit, that fits new sequence data to existing de novo OTUs without changing the composition of the original OTUs. While OptiFit produces OTUs that are as high quality as de novo OTUs, it is unclear whether this method for fitting new sequence data into existing OTUs will impact the performance of classification models relative to models trained and tested only using de novo OTUs. We used OptiFit to cluster sequences into existing OTUs and evaluated model performance in classifying a dataset containing samples from patients with and without colonic screen relevant neoplasia (SRN). We compared the performance of this model to standard methods including de novo and database-reference-based clustering. We found that using OptiFit performed as well or better in classifying SRNs. OptiFit can streamline the process of classifying new samples by avoiding the need to retrain models using reclustered sequences. IMPORTANCE There is great potential for using microbiome data to aid in diagnosis. A challenge with de novo operational taxonomic unit (OTU)-based classification models is that 16S rRNA gene sequences are often assigned to OTUs based on similarity to other sequences in the dataset. If data are generated from new patients, the old and new sequences must be reclustered to OTUs and the classification model retrained. Yet there is a desire to have a single, validated model that can be widely deployed. To overcome this obstacle, we applied the OptiFit clustering algorithm to fit new sequence data to existing OTUs allowing for reuse of the model. A random forest model implemented using OptiFit performed as well as the traditional reassign and retrain approach. This result shows that it is possible to train and apply machine learning models based on OTU relative abundance data that do not require retraining or the use of a reference database.
Assuntos
Metagenômica , Microbiota , Humanos , Análise de Sequência de DNA/métodos , RNA Ribossômico 16S/genética , Metagenômica/métodos , Algoritmos , Microbiota/genéticaRESUMO
OBJECTIVE: Screening individuals admitted to the hospital for Clostridioides difficile presents opportunities to limit transmission and hospital-onset C. difficile infection (HO-CDI). However, detection from rectal swabs is resource intensive. In contrast, machine learning (ML) models may accurately assess patient risk without significant resource usage. In this study, we compared the effectiveness of swab surveillance to daily risk estimates produced by an ML model to identify patients who will likely develop HO-CDI in the intensive care unit (ICU) setting. DESIGN: A prospective cohort study was conducted with patient carriage of toxigenic C. difficile identified by rectal swabs analyzed by anaerobic culture and polymerase chain reaction (PCR). A previously validated ML model using electronic health record data generated daily risk of HO-CDI for every patient. Swab results and risk predictions were compared to the eventual HO-CDI status. PATIENTS: Adult inpatient admissions taking place in University of Michigan Hospitals' medical and surgical intensive care units and oncology wards between June 6th and October 8th, 2020. RESULTS: In total, 2,979 admissions, representing 2,044 patients, were observed over the course of the study period, with 39 admissions developing HO-CDIs. Swab surveillance identified 9 true-positive and 87 false-positive HO-CDIs. The ML model identified 9 true-positive and 226 false-positive HO-CDIs; 8 of the true-positives identified by the model differed from those identified by the swab surveillance. CONCLUSION: With limited resources, an ML model identified the same number of HO-CDI admissions as swab-based surveillance, though it generated more false-positives. The patients identified by the ML model were not yet colonized with C. difficile. Additionally, the ML model identifies at-risk admissions before disease onset, providing opportunities for prevention.
Assuntos
Clostridioides difficile , Infecções por Clostridium , Infecção Hospitalar , Adulto , Humanos , Estudos Prospectivos , Hospitais , Infecções por Clostridium/diagnóstico , Infecções por Clostridium/epidemiologia , Infecções por Clostridium/prevenção & controle , Infecção Hospitalar/epidemiologia , Infecção Hospitalar/prevenção & controle , Unidades de Terapia IntensivaRESUMO
Many data-driven patient risk stratification models have not been evaluated prospectively. We performed and compared the prospective and retrospective evaluations of 2 Clostridioides difficile infection (CDI) risk-prediction models at 2 large academic health centers, and we discuss the models' robustness to data-set shifts.
Assuntos
Infecções por Clostridium , Humanos , Estudos Retrospectivos , Infecções por Clostridium/epidemiologiaRESUMO
As machine learning (ML) models gain traction in clinical applications, understanding the impact of clinician and societal biases on ML models is increasingly important. While biases can arise in the labels used for model training, the many sources from which these biases arise are not yet well-studied. In this paper, we highlight disparate censorship (i.e., differences in testing rates across patient groups) as a source of label bias that clinical ML models may amplify, potentially causing harm. Many patient risk-stratification models are trained using the results of clinician-ordered diagnostic and laboratory tests of labels. Patients without test results are often assigned a negative label, which assumes that untested patients do not experience the outcome. Since orders are affected by clinical and resource considerations, testing may not be uniform in patient populations, giving rise to disparate censorship. Disparate censorship in patients of equivalent risk leads to undertesting in certain groups, and in turn, more biased labels for such groups. Using such biased labels in standard ML pipelines could contribute to gaps in model performance across patient groups. Here, we theoretically and empirically characterize conditions in which disparate censorship or undertesting affect model performance across subgroups. Our findings call attention to disparate censorship as a source of label bias in clinical ML models.
RESUMO
OBJECTIVE: When patients develop acute respiratory failure (ARF), accurately identifying the underlying etiology is essential for determining the best treatment. However, differentiating between common medical diagnoses can be challenging in clinical practice. Machine learning models could improve medical diagnosis by aiding in the diagnostic evaluation of these patients. MATERIALS AND METHODS: Machine learning models were trained to predict the common causes of ARF (pneumonia, heart failure, and/or chronic obstructive pulmonary disease [COPD]). Models were trained using chest radiographs and clinical data from the electronic health record (EHR) and applied to an internal and external cohort. RESULTS: The internal cohort of 1618 patients included 508 (31%) with pneumonia, 363 (22%) with heart failure, and 137 (8%) with COPD based on physician chart review. A model combining chest radiographs and EHR data outperformed models based on each modality alone. Models had similar or better performance compared to a randomly selected physician reviewer. For pneumonia, the combined model area under the receiver operating characteristic curve (AUROC) was 0.79 (0.77-0.79), image model AUROC was 0.74 (0.72-0.75), and EHR model AUROC was 0.74 (0.70-0.76). For heart failure, combined: 0.83 (0.77-0.84), image: 0.80 (0.71-0.81), and EHR: 0.79 (0.75-0.82). For COPD, combined: AUROC = 0.88 (0.83-0.91), image: 0.83 (0.77-0.89), and EHR: 0.80 (0.76-0.84). In the external cohort, performance was consistent for heart failure and increased for COPD, but declined slightly for pneumonia. CONCLUSIONS: Machine learning models combining chest radiographs and EHR data can accurately differentiate between common causes of ARF. Further work is needed to determine how these models could act as a diagnostic aid to clinicians in clinical settings.