RESUMEN
This paper examines a kind of explainable AI, centered around what we term pro-hoc explanations, that is a form of support that consists of offering alternative explanations (one for each possible outcome) instead of a specific post-hoc explanation following specific advice. Specifically, our support mechanism utilizes explanations by examples, featuring analogous cases for each category in a binary setting. Pro-hoc explanations are an instance of what we called frictional AI, a general class of decision support aimed at achieving a useful compromise between the increase of decision effectiveness and the mitigation of cognitive risks, such as over-reliance, automation bias and deskilling. To illustrate an instance of frictional AI, we conducted an empirical user study to investigate its impact on the task of radiological detection of vertebral fractures in x-rays. Our study engaged 16 orthopedists in a 'human-first, second-opinion' interaction protocol. In this protocol, clinicians first made initial assessments of the x-rays without AI assistance and then provided their final diagnosis after considering the pro-hoc explanations. Our findings indicate that physicians, particularly those with less experience, perceived pro-hoc XAI support as significantly beneficial, even though it did not notably enhance their diagnostic accuracy. However, their increased confidence in final diagnoses suggests a positive overall impact. Given the promisingly high effect size observed, our results advocate for further research into pro-hoc explanations specifically, and into the broader concept of frictional AI.
Asunto(s)
Médicos , Radiología , Humanos , Toma de Decisiones Clínicas , AutomatizaciónRESUMEN
This paper proposes a user study aimed at evaluating the impact of Class Activation Maps (CAMs) as an eXplainable AI (XAI) method in a radiological diagnostic task, the detection of thoracolumbar (TL) fractures from vertebral X-rays. In particular, we focus on two oft-neglected features of CAMs, that is granularity and coloring, in terms of what features, lower-level vs higher-level, should the maps highlight and adopting which coloring scheme, to bring better impact to the decision-making process, both in terms of diagnostic accuracy (that is effectiveness) and of user-centered dimensions, such as perceived confidence and utility (that is satisfaction), depending on case complexity, AI accuracy, and user expertise. Our findings show that lower-level features CAMs, which highlight more focused anatomical landmarks, are associated with higher diagnostic accuracy than higher-level features CAMs, particularly among experienced physicians. Moreover, despite the intuitive appeal of semantic CAMs, traditionally colored CAMs consistently yielded higher diagnostic accuracy across all groups. Our results challenge some prevalent assumptions in the XAI field and emphasize the importance of adopting an evidence-based and human-centered approach to design and evaluate AI- and XAI-assisted diagnostic tools. To this aim, the paper also proposes a hierarchy of evidence framework to help designers and practitioners choose the XAI solutions that optimize performance and satisfaction on the basis of the strongest evidence available or to focus on the gaps in the literature that need to be filled to move from opinionated and eminence-based research to one more based on empirical evidence and end-user work and preferences.
Asunto(s)
Procesos Mentales , Radiología , Humanos , Semántica , Columna VertebralRESUMEN
One of the next frontiers in medical research, particularly in orthopaedic surgery, is personalized treatment outcome prediction. In personalized medicine, treatment choices are adjusted for the patient based on the individual's and their disease's distinct features. A high-value and patient-centered health care system requires evaluating results that integrate the patient's viewpoint. Patient-reported outcome measures (PROMs) are widely used to shed light on patients' perceptions of their health status after an intervention by using validated questionnaires. The aim of this study is to examine whether meteorological or light (night vs. day) conditions affect PROM scores and hence indirectly affect health-related outcomes. We collected scores for PROMs from questionnaires completed by patients (N = 2326) who had undergone hip and knee interventions between June 2017 and May 2020 at the IRCCS Orthopaedic Institute Galeazzi (IOG), Milan, Italy. Nearest neighbour propensity score (PS) matching was applied to ensure the similarity of the groups tested under the different weather-related conditions. The exposure PS was derived through logistic regression. The data were analysed using statistical tests (Student's t-test and Mann-Whitney U test). According to Cohen's effect size, weather conditions may affect the scores for PROMs and, indirectly, health-related outcomes via influencing the relative humidity and weather-related conditions. The findings suggest avoiding PROMs' collection in certain conditions if the odds of outcome-based underperformance are to be minimized. This would ensure a balance between costs for PROMs' collection and data availability.
RESUMEN
The rise of personalized medicine and its remarkable advancements have revealed new requirements for the availability of appropriate medical decision-making models. Computer science is an area that plays an essential role in the field of personalized medicine, where one of the goals is to provide algorithms and tools to extrapolate knowledge and improve the decision-support process. The minimum clinically important difference (MCID) is the smallest change in PROM scores that patients perceive as meaningful. Treatment that does not achieve the minimum level of improvement is considered inappropriate as well as a potential waste of resources. Using the MCID threshold to identify patients who fail to achieve the minimum change in PROM that results in a meaningful outcome may aid in pre-surgical shared decision-making. The decision tree algorithm is a method for extracting valuable information and providing further meaningful information to the domain expert that supports the decision-making. In the present study, different tools based on machine learning were developed. On the one hand, we compared three XGBoost models to predict the non-achievement of the MCID at six months post-operation in the SF-12 physical score. The prediction score threshold was set to 0.75 to provide three decision-making areas on the basis of the high confidence (HC) intervals; the minority class was re-balanced by weighting the positive class to penalize the loss function (XGBoost cost-sensitive), oversampling the minority class (XGBoost with SMOTE), and re-sampling the negative class (XGBoost with undersampling). On the other hand, we modeled the data through a decision tree (assessment tree), based on different complexity levels, to identify the hidden pattern and to provide a new way to understand possible relationships between the gathered features and the several outcomes. The results showed that all the proposed models were effective as binary classifiers, as they showed moderate predictive performance both regarding the minority or positive class (i.e., our targeted patients, those who will not benefit from surgery) and the negative class. The decision tree visualization can be exploited during the patient assessment status to better understand if those patients will benefit or not from the medical intervention. Both of these tools can come in handy for increasing knowledge about the patient's psychophysical state and for creating an increasingly specialized assessment of the individual patient.
RESUMEN
We propose a re-calibration method for Machine Learning models, based on computing confidence intervals for the predicted confidence scores. We show the effectiveness of the proposed method on a COVID-19 diagnosis benchmark.
Asunto(s)
COVID-19 , Prueba de COVID-19 , Calibración , Intervalos de Confianza , Humanos , Aprendizaje AutomáticoRESUMEN
The current gold standard for COVID-19 diagnosis, the rRT-PCR test, is hampered by long turnaround times, probable reagent shortages, high false-negative rates and high prices. As a result, machine learning (ML) methods have recently piqued interest, particularly when applied to digital imagery (X-rays and CT scans). In this review, the literature on ML-based diagnostic and prognostic studies grounded on hematochemical parameters has been considered. By doing so, a gap in the current literature was addressed concerning the application of machine learning to laboratory medicine. Sixty-eight articles have been included that were extracted from the Scopus and PubMed indexes. These studies were marked by a great deal of heterogeneity in terms of the examined laboratory test and clinical parameters, sample size, reference populations, ML algorithms, and validation approaches. The majority of research was found to be hampered by reporting and replicability issues: only four of the surveyed studies provided complete information on analytic procedures (units of measure, analyzing equipment), while 29 provided no information at all. Only 16 studies included independent external validation. In light of these findings, we discuss the importance of closer collaboration between data scientists and medical laboratory professionals in order to correctly characterise the relevant population, select the most appropriate statistical and analytical methods, ensure reproducibility, enable the proper interpretation of the results, and gain actual utility by using machine learning methods in clinical practice.
Asunto(s)
COVID-19 , Humanos , COVID-19/diagnóstico , SARS-CoV-2 , Prueba de COVID-19 , Pronóstico , Reproducibilidad de los Resultados , Aprendizaje AutomáticoRESUMEN
In this article, we discuss the development of prognostic machine learning (ML) models for COVID-19 progression, by focusing on the task of predicting ICU admission within (any of) the next 5 days. On the basis of 6,625 complete blood count (CBC) tests from 1,004 patients, of which 18% were admitted to intensive care unit (ICU), we created four ML models, by adopting a robust development procedure which was designed to minimize risks of bias and over-fitting, according to reference guidelines. The best model, a support vector machine, had an AUC of .85, a Brier score of .14, and a standardized net benefit of .69: these scores indicate that the model performed well over a variety of prediction criteria. We also conducted an interpretability study to back up our findings, showing that the data on which the developed model is based is consistent with the current medical literature. This also demonstrates that CBC data and ML methods can be used to predict COVID-19 patients' ICU admission at a relatively low cost: in particular, since CBC data can be quickly obtained by means of routine blood exams, our models could be used in resource-constrained settings and provide health practitioners with rapid and reliable indications.
RESUMEN
BACKGROUND: The fight against the COVID-19 pandemic seems to encompass a social media debate, possibly resulting in emotional contagion and the need for novel surveillance approaches. In the current study, we aimed to examine the flow and content of tweets, exploring the role of COVID-19 key events on the popular Twitter platform. METHODS: Using representative freely available data, we performed a focused, social media-based analysis to capture COVID-19 discussions on Twitter, considering sentiment and longitudinal trends between January 19 and March 3, 2020. Different populations of users were considered. Core discussions were explored measuring tweets' sentiment, by both computing a polarity compound score with 95% Confidence Interval and using a transformer-based model, pretrained on a large corpus of COVID-19-related Tweets. Context-dependent meaning and emotion-specific features were considered. RESULTS: We gathered 3,308,476 tweets written in English. Since the first World Health Organization report (January 21), negative sentiment proportion of tweets gradually increased as expected, with amplifications following key events. Sentiment scores were increasingly negative among most active users. Tweets content and flow revealed an ongoing scenario in which the global emergency seems difficult to be emotionally managed, as shown by sentiment trajectories. CONCLUSIONS: Integrating social media like Twitter as essential surveillance tools in the management of the pandemic and its waves might actually represent a novel preventive approach to hinder emotional contagion, disseminating reliable information and nurturing trust. There is the need to monitor and sustain healthy behaviors as well as community supports also via social media-based preventive interventions.