Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 26
Filtrar
Mais filtros











Base de dados
Intervalo de ano de publicação
1.
BMC Med Inform Decis Mak ; 24(Suppl 4): 203, 2024 Jul 23.
Artigo em Inglês | MEDLINE | ID: mdl-39044277

RESUMO

BACKGROUND: The frequency of hip and knee arthroplasty surgeries has been rising steadily in recent decades. This trend is attributed to an aging population, leading to increased demands on healthcare systems. Fast Track (FT) surgical protocols, perioperative procedures designed to expedite patient recovery and early mobilization, have demonstrated efficacy in reducing hospital stays, convalescence periods, and associated costs. However, the criteria for selecting patients for FT procedures have not fully capitalized on the available patient data, including patient-reported outcome measures (PROMs). METHODS: Our study focused on developing machine learning (ML) models to support decision making in assigning patients to FT procedures, utilizing data from patients' self-reported health status. These models are specifically designed to predict the potential health status improvement in patients initially selected for FT. Our approach focused on techniques inspired by the concept of controllable AI. This includes eXplainable AI (XAI), which aims to make the model's recommendations comprehensible to clinicians, and cautious prediction, a method used to alert clinicians about potential control losses, thereby enhancing the models' trustworthiness and reliability. RESULTS: Our models were trained and tested using a dataset comprising 899 records from individual patients admitted to the FT program at IRCCS Ospedale Galeazzi-Sant'Ambrogio. After training and selecting hyper-parameters, the models were assessed using a separate internal test set. The interpretable models demonstrated performance on par or even better than the most effective 'black-box' model (Random Forest). These models achieved sensitivity, specificity, and positive predictive value (PPV) exceeding 70%, with an area under the curve (AUC) greater than 80%. The cautious prediction models exhibited enhanced performance while maintaining satisfactory coverage (over 50%). Further, when externally validated on a separate cohort from the same hospital-comprising patients from a subsequent time period-the models showed no pragmatically notable decline in performance. CONCLUSIONS: Our results demonstrate the effectiveness of utilizing PROMs as basis to develop ML models for planning assignments to FT procedures. Notably, the application of controllable AI techniques, particularly those based on XAI and cautious prediction, emerges as a promising approach. These techniques provide reliable and interpretable support, essential for informed decision-making in clinical processes.


Assuntos
Artroplastia de Quadril , Artroplastia do Joelho , Aprendizado de Máquina , Medidas de Resultados Relatados pelo Paciente , Humanos , Feminino , Idoso , Masculino , Pessoa de Meia-Idade , Procedimentos Clínicos
2.
Artigo em Inglês | MEDLINE | ID: mdl-38607715

RESUMO

In this article we propose a conceptual framework to study ensembles of conformal predictors (CP), that we call Ensemble Predictors (EP). Our approach is inspired by the application of imprecise probabilities in information fusion. Based on the proposed framework, we study, for the first time in the literature, the theoretical properties of CP ensembles in a general setting, by focusing on simple and commonly used possibilistic combination rules. We also illustrate the applicability of the proposed methods in the setting of multivariate time-series classification, showing that these methods provide better performance (in terms of both robustness, conservativeness, accuracy and running time) than both standard classification algorithms and other combination rules proposed in the literature, on a large set of benchmarks from the UCR time series archive.

3.
Artif Intell Med ; 150: 102819, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38553159

RESUMO

This paper examines a kind of explainable AI, centered around what we term pro-hoc explanations, that is a form of support that consists of offering alternative explanations (one for each possible outcome) instead of a specific post-hoc explanation following specific advice. Specifically, our support mechanism utilizes explanations by examples, featuring analogous cases for each category in a binary setting. Pro-hoc explanations are an instance of what we called frictional AI, a general class of decision support aimed at achieving a useful compromise between the increase of decision effectiveness and the mitigation of cognitive risks, such as over-reliance, automation bias and deskilling. To illustrate an instance of frictional AI, we conducted an empirical user study to investigate its impact on the task of radiological detection of vertebral fractures in x-rays. Our study engaged 16 orthopedists in a 'human-first, second-opinion' interaction protocol. In this protocol, clinicians first made initial assessments of the x-rays without AI assistance and then provided their final diagnosis after considering the pro-hoc explanations. Our findings indicate that physicians, particularly those with less experience, perceived pro-hoc XAI support as significantly beneficial, even though it did not notably enhance their diagnostic accuracy. However, their increased confidence in final diagnoses suggests a positive overall impact. Given the promisingly high effect size observed, our results advocate for further research into pro-hoc explanations specifically, and into the broader concept of frictional AI.


Assuntos
Médicos , Radiologia , Humanos , Tomada de Decisão Clínica , Automação
4.
Comput Biol Med ; 170: 108042, 2024 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-38308866

RESUMO

This paper proposes a user study aimed at evaluating the impact of Class Activation Maps (CAMs) as an eXplainable AI (XAI) method in a radiological diagnostic task, the detection of thoracolumbar (TL) fractures from vertebral X-rays. In particular, we focus on two oft-neglected features of CAMs, that is granularity and coloring, in terms of what features, lower-level vs higher-level, should the maps highlight and adopting which coloring scheme, to bring better impact to the decision-making process, both in terms of diagnostic accuracy (that is effectiveness) and of user-centered dimensions, such as perceived confidence and utility (that is satisfaction), depending on case complexity, AI accuracy, and user expertise. Our findings show that lower-level features CAMs, which highlight more focused anatomical landmarks, are associated with higher diagnostic accuracy than higher-level features CAMs, particularly among experienced physicians. Moreover, despite the intuitive appeal of semantic CAMs, traditionally colored CAMs consistently yielded higher diagnostic accuracy across all groups. Our results challenge some prevalent assumptions in the XAI field and emphasize the importance of adopting an evidence-based and human-centered approach to design and evaluate AI- and XAI-assisted diagnostic tools. To this aim, the paper also proposes a hierarchy of evidence framework to help designers and practitioners choose the XAI solutions that optimize performance and satisfaction on the basis of the strongest evidence available or to focus on the gaps in the literature that need to be filled to move from opinionated and eminence-based research to one more based on empirical evidence and end-user work and preferences.


Assuntos
Processos Mentais , Radiologia , Humanos , Semântica , Coluna Vertebral
5.
Diagnostics (Basel) ; 13(6)2023 Mar 21.
Artigo em Inglês | MEDLINE | ID: mdl-36980497

RESUMO

Total hip (THA) and total knee (TKA) arthroplasty procedures have steadily increased over the past few decades, and their use is expected to grow further, mainly due to an increasing number of elderly patients. Cost-containment strategies, supporting a rapid recovery with a positive functional outcomes, high patient satisfaction, and enhanced patient reported outcomes, are needed. A Fast Track surgical procedure (FT) is a coordinated perioperative approach aimed at expediting early mobilization and recovery following surgery and, accordingly, shortening the length of hospital stay (LOS), convalescence and costs. In this view, rapid rehabilitation surgery optimizes traditional rehabilitation methods by integrating evidence-based practices into the procedure. The aim of the present study was to compare the effectiveness of Fast Track versus Care-as-Usual surgical procedures and pathways (including rehabilitation) on a mid-term patient-reported outcome (PROs), the SF12 (with regard both to Physical and Mental Scores), 3 months after hip or knee replacement surgery, with the use of Propensity score-matching (PSM) analysis to address the issue of the comparability of the groups in a non-randomized study. We were interested in the evaluation of the entire pathways, including the postoperative rehabilitation stage, therefore, we only used early home discharge as a surrogate to differentiate between the Fast Track and Care-as-Usual rehabilitation pathways. Our study shows that the entire Fast Track pathway, which includes the post-operative rehabilitation stage, has a significantly positive impact on physical health-related status (SF12 Physical Scores), as perceived by patients 3 months after hip or knee replacement surgery, as opposed to the standardized program, both in terms of the PROs score and the relative improvements observed, as compared with the minimum clinically important difference. This result encourages additional research into the effects of Fast Track rehabilitation on the entire process of care for patients undergoing hip or knee arthroplasty, focusing only on patient-reported outcomes.

6.
Artif Intell Med ; 138: 102506, 2023 04.
Artigo em Inglês | MEDLINE | ID: mdl-36990586

RESUMO

In this paper, we study human-AI collaboration protocols, a design-oriented construct aimed at establishing and evaluating how humans and AI can collaborate in cognitive tasks. We applied this construct in two user studies involving 12 specialist radiologists (the knee MRI study) and 44 ECG readers of varying expertise (the ECG study), who evaluated 240 and 20 cases, respectively, in different collaboration configurations. We confirm the utility of AI support but find that XAI can be associated with a "white-box paradox", producing a null or detrimental effect. We also find that the order of presentation matters: AI-first protocols are associated with higher diagnostic accuracy than human-first protocols, and with higher accuracy than both humans and AI alone. Our findings identify the best conditions for AI to augment human diagnostic skills, rather than trigger dysfunctional responses and cognitive biases that can undermine decision effectiveness.


Assuntos
Inteligência Artificial , Humanos
7.
Sensors (Basel) ; 22(19)2022 Sep 27.
Artigo em Inglês | MEDLINE | ID: mdl-36236427

RESUMO

Human Activity Recognition (HAR) has been studied extensively, yet current approaches are not capable of generalizing across different domains (i.e., subjects, devices, or datasets) with acceptable performance. This lack of generalization hinders the applicability of these models in real-world environments. As deep neural networks are becoming increasingly popular in recent work, there is a need for an explicit comparison between handcrafted and deep representations in Out-of-Distribution (OOD) settings. This paper compares both approaches in multiple domains using homogenized public datasets. First, we compare several metrics to validate three different OOD settings. In our main experiments, we then verify that even though deep learning initially outperforms models with handcrafted features, the situation is reversed as the distance from the training distribution increases. These findings support the hypothesis that handcrafted features may generalize better across specific domains.


Assuntos
Atividades Humanas , Redes Neurais de Computação , Humanos , Reconhecimento Psicológico
8.
Comput Methods Programs Biomed ; 221: 106930, 2022 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-35690505

RESUMO

Background and Objective Evaluation of AI-based decision support systems (AI-DSS) is of critical importance in practical applications, nonetheless common evaluation metrics fail to properly consider relevant and contextual information. In this article we discuss a novel utility metric, the weighted Utility (wU), for the evaluation of AI-DSS, which is based on the raters' perceptions of their annotation hesitation and of the relevance of the training cases. Methods We discuss the relationship between the proposed metric and other previous proposals; and we describe the application of the proposed metric for both model evaluation and optimization, through three realistic case studies. Results We show that our metric generalizes the well-known Net Benefit, as well as other common error-based and utility-based metrics. Through the empirical studies, we show that our metric can provide a more flexible tool for the evaluation of AI models. We also show that, compared to other optimization metrics, model optimization based on the wU can provide significantly better performance (AUC 0.862 vs 0.895, p-value <0.05), especially on cases judged to be more complex by the human annotators (AUC 0.85 vs 0.92, p-value <0.05). Conclusions We make the point for having utility as a primary concern in the evaluation and optimization of machine learning models in critical domains, like the medical one; and for the importance of a human-centred approach to assess the potential impact of AI models on human decision making also on the basis of further information that can be collected during the ground-truthing process.


Assuntos
Benchmarking , Aprendizado de Máquina , Humanos
9.
Stud Health Technol Inform ; 294: 127-128, 2022 May 25.
Artigo em Inglês | MEDLINE | ID: mdl-35612033

RESUMO

We propose a re-calibration method for Machine Learning models, based on computing confidence intervals for the predicted confidence scores. We show the effectiveness of the proposed method on a COVID-19 diagnosis benchmark.


Assuntos
COVID-19 , Teste para COVID-19 , Calibragem , Intervalos de Confiança , Humanos , Aprendizado de Máquina
10.
Med Biol Eng Comput ; 2022 Mar 30.
Artigo em Inglês | MEDLINE | ID: mdl-35353302

RESUMO

In this article, we discuss the development of prognostic machine learning (ML) models for COVID-19 progression, by focusing on the task of predicting ICU admission within (any of) the next 5 days. On the basis of 6,625 complete blood count (CBC) tests from 1,004 patients, of which 18% were admitted to intensive care unit (ICU), we created four ML models, by adopting a robust development procedure which was designed to minimize risks of bias and over-fitting, according to reference guidelines. The best model, a support vector machine, had an AUC of .85, a Brier score of .14, and a standardized net benefit of .69: these scores indicate that the model performed well over a variety of prediction criteria. We also conducted an interpretability study to back up our findings, showing that the data on which the developed model is based is consistent with the current medical literature. This also demonstrates that CBC data and ML methods can be used to predict COVID-19 patients' ICU admission at a relatively low cost: in particular, since CBC data can be quickly obtained by means of routine blood exams, our models could be used in resource-constrained settings and provide health practitioners with rapid and reliable indications.

11.
Clin Chem Lab Med ; 60(4): 556-568, 2022 03 28.
Artigo em Inglês | MEDLINE | ID: mdl-34333884

RESUMO

OBJECTIVES: The European Biological Variation Study (EuBIVAS), which includes 91 healthy volunteers from five European countries, estimated high-quality biological variation (BV) data for several measurands. Previous EuBIVAS papers reported no significant differences among laboratories/population; however, they were focused on specific set of measurands, without a comprehensive general look. The aim of this paper is to evaluate the homogeneity of EuBIVAS data considering multivariate information applying the Principal Component Analysis (PCA), a machine learning unsupervised algorithm. METHODS: The EuBIVAS data for 13 basic metabolic panel linked measurands (glucose, albumin, total protein, electrolytes, urea, total bilirubin, creatinine, phosphatase alkaline, aminotransferases), age, sex, menopause, body mass index (BMI), country, alcohol, smoking habits, and physical activity, have been used to generate three databases developed using the traditional univariate and the multivariate Elliptic Envelope approaches to detect outliers, and different missing-value imputations. Two matrix of data for each database, reporting both mean values, and "within-person BV" (CVP) values for any measurand/subject, were analyzed using PCA. RESULTS: A clear clustering between males and females mean values has been identified, where the menopausal females are closer to the males. Data interpretations for the three databases are similar. No significant differences for both mean and CVPs values, for countries, alcohol, smoking habits, BMI and physical activity, have been found. CONCLUSIONS: The absence of meaningful differences among countries confirms the EuBIVAS sample homogeneity and that the obtained data are widely applicable to deliver APS. Our data suggest that the use of PCA and the multivariate approach may be used to detect outliers, although further studies are required.


Assuntos
Algoritmos , Bilirrubina , Creatinina , Feminino , Humanos , Aprendizado de Máquina , Masculino , Análise de Componente Principal
12.
Health Inf Sci Syst ; 9(1): 37, 2021 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-34721844

RESUMO

PURPOSE: The rRT-PCR for COVID-19 diagnosis is affected by long turnaround time, potential shortage of reagents, high false-negative rates and high costs. Routine hematochemical tests are a faster and less expensive alternative for diagnosis. Thus, Machine Learning (ML) has been applied to hematological parameters to develop diagnostic tools and help clinicians in promptly managing positive patients. However, few ML models have been externally validated, making their real-world applicability unclear. METHODS: We externally validate 6 state-of-the-art diagnostic ML models, based on Complete Blood Count (CBC) and trained on a dataset encompassing 816 COVID-19 positive cases. The external validation was performed based on two datasets, collected at two different hospitals in northern Italy and encompassing 163 and 104 COVID-19 positive cases, in terms of both error rate and calibration. RESULTS AND CONCLUSION: We report an average AUC of 95% and average Brier score of 0.11, out-performing existing ML methods, and showing good cross-site transportability. The best performing model (SVM) reported an average AUC of 97.5% (Sensitivity: 87.5%, Specificity: 94%), comparable with the performance of RT-PCR, and was also the best calibrated. The validated models can be useful in the early identification of potential COVID-19 patients, due to the rapid availability of CBC exams, and in multiple test settings.

13.
Comput Methods Programs Biomed ; 208: 106288, 2021 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-34352688

RESUMO

Background and Objective Medical machine learning (ML) models tend to perform better on data from the same cohort than on new data, often due to overfitting, or co-variate shifts. For these reasons, external validation (EV) is a necessary practice in the evaluation of medical ML. However, there is still a gap in the literature on how to interpret EV results and hence assess the robustness of ML models. METHODS: We fill this gap by proposing a meta-validation method, to assess the soundness of EV procedures. In doing so, we complement the usual way to assess EV by considering both dataset cardinality, and the similarity of the EV dataset with respect to the training set. We then investigate how the notions of cardinality and similarity can be used to inform on the reliability of a validation procedure, by integrating them into two summative data visualizations. RESULTS: We illustrate our methodology by applying it to the validation of a state-of-the-art COVID-19 diagnostic model on 8 EV sets, collected across 3 different continents. The model performance was moderately impacted by data similarity (Pearson ρ = 0.38, p< 0.001). In the EV, the validated model reported good AUC (average: 0.84), acceptable calibration (average: 0.17) and utility (average: 0.50). The validation datasets were adequate in terms of dataset cardinality and similarity, thus suggesting the soundness of the results. We also provide a qualitative guideline to evaluate the reliability of validation procedures, and we discuss the importance of proper external validation in light of the obtained results. CONCLUSIONS: In this paper, we propose a novel, lean methodology to: 1) study how the similarity between training and validation sets impacts the generalizability of a ML model; 2) assess the soundness of EV evaluations along three complementary performance dimensions: discrimination, utility and calibration; 3) draw conclusions on the robustness of the model under validation. We applied this methodology to a state-of-the-art model for the diagnosis of COVID-19 from routine blood tests, and showed how to interpret the results in light of the presented framework.


Assuntos
COVID-19 , Estudos de Coortes , Humanos , Aprendizado de Máquina , Reprodutibilidade dos Testes , SARS-CoV-2
14.
Int J Med Inform ; 153: 104510, 2021 09.
Artigo em Inglês | MEDLINE | ID: mdl-34108105

RESUMO

This editorial aims to contribute to the current debate about the quality of studies that apply machine learning (ML) methodologies to medical data to extract value from them and provide clinicians with viable and useful tools supporting everyday care practices. We propose a practical checklist to help authors to self assess the quality of their contribution and to help reviewers to recognize and appreciate high-quality medical ML studies by distinguishing them from the mere application of ML techniques to medical data.


Assuntos
Inteligência Artificial , Informática Médica , Lista de Checagem , Humanos , Aprendizado de Máquina
15.
Comput Biol Med ; 133: 104393, 2021 06.
Artigo em Inglês | MEDLINE | ID: mdl-33915362

RESUMO

Treatment and prevention of cardiovascular diseases often rely on Electrocardiogram (ECG) interpretation. Dependent on the physician's variability, ECG interpretation is subjective and prone to errors. Machine learning models are often developed and used to support doctors; however, their lack of interpretability stands as one of the main drawbacks of their widespread operation. This paper focuses on an Explainable Artificial Intelligence (XAI) solution to make heartbeat classification more explainable using several state-of-the-art model-agnostic methods. We introduce a high-level conceptual framework for explainable time series and propose an original method that adds temporal dependency between time samples using the time series' derivative. The results were validated in the MIT-BIH arrhythmia dataset: we performed a performance's analysis to evaluate whether the explanations fit the model's behaviour; and employed the 1-D Jaccard's index to compare the subsequences extracted from an interpretable model and the XAI methods used. Our results show that the use of the raw signal and its derivative includes temporal dependency between samples to promote classification explanation. A small but informative user study concludes this study to evaluate the potential of the visual explanations produced by our original method for being adopted in real-world clinical settings, either as diagnostic aids or training resource.


Assuntos
Inteligência Artificial , Eletrocardiografia , Arritmias Cardíacas/diagnóstico , Frequência Cardíaca , Humanos , Aprendizado de Máquina
16.
J Intell ; 9(2)2021 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-33915991

RESUMO

Medical errors have a huge impact on clinical practice in terms of economic and human costs. As a result, technology-based solutions, such as those grounded in artificial intelligence (AI) or collective intelligence (CI), have attracted increasing interest as a means of reducing error rates and their impacts. Previous studies have shown that a combination of individual opinions based on rules, weighting mechanisms, or other CI solutions could improve diagnostic accuracy with respect to individual doctors. We conducted a study to investigate the potential of this approach in cardiology and, more precisely, in electrocardiogram (ECG) reading. To achieve this aim, we designed and conducted an experiment involving medical students, recent graduates, and residents, who were asked to annotate a collection of 10 ECGs of various complexity and difficulty. For each ECG, we considered groups of increasing size (from three to 30 members) and applied three different CI protocols. In all cases, the results showed a statistically significant improvement (ranging from 9% to 88%) in terms of diagnostic accuracy when compared to the performance of individual readers; this difference held for not only large groups, but also smaller ones. In light of these results, we conclude that CI approaches can support the tasks mentioned above, and possibly other similar ones as well. We discuss the implications of applying CI solutions to clinical settings, such as cases of augmented 'second opinions' and decision-making.

17.
Health Inf Sci Syst ; 9(1): 8, 2021 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-33585029

RESUMO

PURPOSE: The integration of Artificial Intelligence into medical practices has recently been advocated for the promise to bring increased efficiency and effectiveness to these practices. Nonetheless, little research has so far been aimed at understanding the best human-AI interaction protocols in collaborative tasks, even in currently more viable settings, like independent double-reading screening tasks. METHODS: To this aim, we report about a retrospective case-control study, involving 12 board-certified radiologists, in the detection of knee lesions by means of Magnetic Resonance Imaging, in which we simulated the serial combination of two Deep Learning models with humans in eight double-reading protocols. Inspired by the so-called Kasparov's Laws, we investigate whether the combination of humans and AI models could achieve better performance than AI models alone, and whether weak reader, when supported by fit-for-use interaction protocols, could out-perform stronger readers. RESULTS: We discuss two main findings: groups of humans who perform significantly worse than a state-of-the-art AI can significantly outperform it if their judgements are aggregated by majority voting (in concordance with the first part of the Kasparov's law); small ensembles of significantly weaker readers can significantly outperform teams of stronger readers, supported by the same computational tool, when the judgments of the former ones are combined within "fit-for-use" protocols (in concordance with the second part of the Kasparov's law). CONCLUSION: Our study shows that good interaction protocols can guarantee improved decision performance that easily surpasses the performance of individual agents, even of realistic super-human AI systems. This finding highlights the importance of focusing on how to guarantee better co-operation within human-AI teams, so to enable safer and more human sustainable care practices.

18.
Clin Chem Lab Med ; 59(2): 421-431, 2020 10 21.
Artigo em Inglês | MEDLINE | ID: mdl-33079698

RESUMO

Objectives: The rRT-PCR test, the current gold standard for the detection of coronavirus disease (COVID-19), presents with known shortcomings, such as long turnaround time, potential shortage of reagents, false-negative rates around 15-20%, and expensive equipment. The hematochemical values of routine blood exams could represent a faster and less expensive alternative. Methods: Three different training data set of hematochemical values from 1,624 patients (52% COVID-19 positive), admitted at San Raphael Hospital (OSR) from February to May 2020, were used for developing machine learning (ML) models: the complete OSR dataset (72 features: complete blood count (CBC), biochemical, coagulation, hemogasanalysis and CO-Oxymetry values, age, sex and specific symptoms at triage) and two sub-datasets (COVID-specific and CBC dataset, 32 and 21 features respectively). 58 cases (50% COVID-19 positive) from another hospital, and 54 negative patients collected in 2018 at OSR, were used for internal-external and external validation. Results: We developed five ML models: for the complete OSR dataset, the area under the receiver operating characteristic curve (AUC) for the algorithms ranged from 0.83 to 0.90; for the COVID-specific dataset from 0.83 to 0.87; and for the CBC dataset from 0.74 to 0.86. The validations also achieved good results: respectively, AUC from 0.75 to 0.78; and specificity from 0.92 to 0.96. Conclusions: ML can be applied to blood tests as both an adjunct and alternative method to rRT-PCR for the fast and cost-effective identification of COVID-19-positive patients. This is especially useful in developing countries, or in countries facing an increase in contagions.


Assuntos
Análise Química do Sangue/métodos , Teste para COVID-19/métodos , COVID-19/sangue , Testes Hematológicos/métodos , Aprendizado de Máquina , Algoritmos , Área Sob a Curva , Contagem de Células Sanguíneas , Conjuntos de Dados como Assunto , Humanos , SARS-CoV-2 , Sensibilidade e Especificidade
19.
BMC Med Inform Decis Mak ; 20(1): 219, 2020 09 11.
Artigo em Inglês | MEDLINE | ID: mdl-32917183

RESUMO

BACKGROUND: We focus on the importance of interpreting the quality of the labeling used as the input of predictive models to understand the reliability of their output in support of human decision-making, especially in critical domains, such as medicine. METHODS: Accordingly, we propose a framework distinguishing the reference labeling (or Gold Standard) from the set of annotations from which it is usually derived (the Diamond Standard). We define a set of quality dimensions and related metrics: representativeness (are the available data representative of its reference population?); reliability (do the raters agree with each other in their ratings?); and accuracy (are the raters' annotations a true representation?). The metrics for these dimensions are, respectively, the degree of correspondence, Ψ, the degree of weighted concordance ϱ, and the degree of fineness, Φ. We apply and evaluate these metrics in a diagnostic user study involving 13 radiologists. RESULTS: We evaluate Ψ against hypothesis-testing techniques, highlighting that our metrics can better evaluate distribution similarity in high-dimensional spaces. We discuss how Ψ could be used to assess the reliability of new predictions or for train-test selection. We report the value of ϱ for our case study and compare it with traditional reliability metrics, highlighting both their theoretical properties and the reasons that they differ. Then, we report the degree of fineness as an estimate of the accuracy of the collected annotations and discuss the relationship between this latter degree and the degree of weighted concordance, which we find to be moderately but significantly correlated. Finally, we discuss the implications of the proposed dimensions and metrics with respect to the context of Explainable Artificial Intelligence (XAI). CONCLUSION: We propose different dimensions and related metrics to assess the quality of the datasets used to build predictive models and Medical Artificial Intelligence (MAI). We argue that the proposed metrics are feasible for application in real-world settings for the continuous development of trustable and interpretable MAI systems.


Assuntos
Inteligência Artificial , Algoritmos , Benchmarking , Humanos , Reprodutibilidade dos Testes
20.
BMC Med Inform Decis Mak ; 20(Suppl 5): 142, 2020 08 20.
Artigo em Inglês | MEDLINE | ID: mdl-32819345

RESUMO

BACKGROUND: Despite the vagueness and uncertainty that is intrinsic in any medical act, interpretation and decision (including acts of data reporting and representation of relevant medical conditions), still little research has focused on how to explicitly take this uncertainty into account. In this paper, we focus on the representation of a general and wide-spread medical terminology, which is grounded on a traditional and well-established convention, to represent severity of health conditions (for instance, pain, visible signs), ranging from Absent to Extreme. Specifically, we will study how both potential patients and doctors perceive the different levels of the terminology in both quantitative and qualitative terms, and if the embedded user knowledge could improve the representation of ordinal values in the construction of machine learning models. METHODS: To this aim, we conducted a questionnaire-based research study involving a relatively large sample of 1,152 potential patients and 31 clinicians to represent numerically the perceived meaning of standard and widely-applied labels to describe health conditions. Using these collected values, we then present and discuss different possible fuzzy-set based representations that address the vagueness of medical interpretation by taking into account the perceptions of domain experts. We also apply the findings of this user study to evaluate the impact of different encodings on the predictive performance of common machine learning models in regard to a real-world medical prognostic task. RESULTS: We found significant differences in the perception of pain levels between the two user groups. We also show that the proposed encodings can improve the performances of specific classes of models, and discuss when this is the case. CONCLUSIONS: In perspective, our hope is that the proposed techniques for ordinal scale representation and ordinal encoding may be useful to the research community, and also that our methodology will be applied to other widely used ordinal scales for improving validity of datasets and bettering the results of machine learning tasks.


Assuntos
Lógica Fuzzy , Conhecimento , Aprendizado de Máquina , Médicos , Tomada de Decisões , Humanos , Reprodutibilidade dos Testes , Projetos de Pesquisa , Inquéritos e Questionários
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA