RESUMEN
OBJECTIVES: Performance of recently developed deep learning models for image classification surpasses that of radiologists. However, there are questions about model performance consistency and generalization in unseen external data. The purpose of this study is to determine whether the high performance of deep learning on mammograms can be transferred to external data with a different data distribution. MATERIALS AND METHODS: Six deep learning models (three published models with high performance and three models designed by us) were evaluated on four different mammogram data sets, including three public (Digital Database for Screening Mammography, INbreast, and Mammographic Image Analysis Society) and one private data set (UKy). The models were trained and validated on either Digital Database for Screening Mammography alone or a combined data set that included Digital Database for Screening Mammography. The models were then tested on the three external data sets. The area under the receiver operating characteristic curve (auROC) was used to evaluate model performance. RESULTS: The three published models reported validation auROC scores between 0.88 and 0.95 on the validation data set. Our models achieved between 0.71 (95% confidence interval [CI]: 0.70-0.72) and 0.79 (95% CI: 0.78-0.80) auROC on the same validation data set. However, the same evaluation criteria of all six models on the three external test data sets were significantly decreased, only between 0.44 (95% CI: 0.43-0.45) and 0.65 (95% CI: 0.64-0.66). CONCLUSION: Our results demonstrate performance inconsistency across the data sets and models, indicating that the high performance of deep learning models on one data set cannot be readily transferred to unseen external data sets, and these models need further assessment and validation before being applied in clinical practice.
Asunto(s)
Neoplasias de la Mama , Aprendizaje Profundo , Neoplasias de la Mama/diagnóstico por imagen , Detección Precoz del Cáncer , Femenino , Humanos , Procesamiento de Imagen Asistido por Computador , MamografíaRESUMEN
OBJECTIVE: Communication performance inconsistency between consultations is usually regarded as a measurement error that jeopardizes the reliability of assessments. However, inconsistency is an important phenomenon, since it indicates that physicians' communication may be below standard in some consultations. METHODS: Fifty residents performed two challenging consultations. Residents' communication competency was assessed with the CELI instrument. Residents' background in communication skills training (CST) was also established. We used multilevel analysis to explore communication performance inconsistency between the two consultations. We also established the relationships between inconsistency and average performance quality, the type of consultation, and CST background. RESULTS: Inconsistency accounted for 45.5% of variance in residents' communication performance. Inconsistency was dependent on the type of consultation. The effect of CST background training on performance quality was case specific. Inconsistency and average performance quality were related for those consultation combinations dissimilar in goals, structure, and required skills. CST background had no effect on inconsistency. CONCLUSION: Physician communication performance should be of high quality, but also consistent regardless of the type and complexity of the consultation. PRACTICE IMPLICATIONS: In order to improve performance quality and reduce performance inconsistency, communication education should offer ample opportunities to practice a wide variety of challenging consultations.