RESUMEN
BACKGROUND: Computer vision has promise in image-based cutaneous melanoma diagnosis but clinical utility is uncertain. OBJECTIVE: To determine if computer algorithms from an international melanoma detection challenge can improve dermatologists' accuracy in diagnosing melanoma. METHODS: In this cross-sectional study, we used 150 dermoscopy images (50 melanomas, 50 nevi, 50 seborrheic keratoses) from the test dataset of a melanoma detection challenge, along with algorithm results from 23 teams. Eight dermatologists and 9 dermatology residents classified dermoscopic lesion images in an online reader study and provided their confidence level. RESULTS: The top-ranked computer algorithm had an area under the receiver operating characteristic curve of 0.87, which was higher than that of the dermatologists (0.74) and residents (0.66) (P < .001 for all comparisons). At the dermatologists' overall sensitivity in classification of 76.0%, the algorithm had a superior specificity (85.0% vs. 72.6%, P = .001). Imputation of computer algorithm classifications into dermatologist evaluations with low confidence ratings (26.6% of evaluations) increased dermatologist sensitivity from 76.0% to 80.8% and specificity from 72.6% to 72.8%. LIMITATIONS: Artificial study setting lacking the full spectrum of skin lesions as well as clinical metadata. CONCLUSION: Accumulating evidence suggests that deep neural networks can classify skin images of melanoma and its benign mimickers with high accuracy and potentially improve human performance.
Asunto(s)
Aprendizaje Profundo , Dermoscopía/métodos , Interpretación de Imagen Asistida por Computador/métodos , Melanoma/diagnóstico , Neoplasias Cutáneas/diagnóstico , Colombia , Estudios Transversales , Dermatólogos/estadística & datos numéricos , Dermoscopía/estadística & datos numéricos , Diagnóstico Diferencial , Humanos , Cooperación Internacional , Internado y Residencia/estadística & datos numéricos , Israel , Queratosis Seborreica/diagnóstico , Melanoma/patología , Nevo/diagnóstico , Curva ROC , Piel/diagnóstico por imagen , Piel/patología , Neoplasias Cutáneas/patología , España , Estados UnidosRESUMEN
BACKGROUND: Whether machine-learning algorithms can diagnose all pigmented skin lesions as accurately as human experts is unclear. The aim of this study was to compare the diagnostic accuracy of state-of-the-art machine-learning algorithms with human readers for all clinically relevant types of benign and malignant pigmented skin lesions. METHODS: For this open, web-based, international, diagnostic study, human readers were asked to diagnose dermatoscopic images selected randomly in 30-image batches from a test set of 1511 images. The diagnoses from human readers were compared with those of 139 algorithms created by 77 machine-learning labs, who participated in the International Skin Imaging Collaboration 2018 challenge and received a training set of 10â015 images in advance. The ground truth of each lesion fell into one of seven predefined disease categories: intraepithelial carcinoma including actinic keratoses and Bowen's disease; basal cell carcinoma; benign keratinocytic lesions including solar lentigo, seborrheic keratosis and lichen planus-like keratosis; dermatofibroma; melanoma; melanocytic nevus; and vascular lesions. The two main outcomes were the differences in the number of correct specific diagnoses per batch between all human readers and the top three algorithms, and between human experts and the top three algorithms. FINDINGS: Between Aug 4, 2018, and Sept 30, 2018, 511 human readers from 63 countries had at least one attempt in the reader study. 283 (55·4%) of 511 human readers were board-certified dermatologists, 118 (23·1%) were dermatology residents, and 83 (16·2%) were general practitioners. When comparing all human readers with all machine-learning algorithms, the algorithms achieved a mean of 2·01 (95% CI 1·97 to 2·04; p<0·0001) more correct diagnoses (17·91 [SD 3·42] vs 19·92 [4·27]). 27 human experts with more than 10 years of experience achieved a mean of 18·78 (SD 3·15) correct answers, compared with 25·43 (1·95) correct answers for the top three machine algorithms (mean difference 6·65, 95% CI 6·06-7·25; p<0·0001). The difference between human experts and the top three algorithms was significantly lower for images in the test set that were collected from sources not included in the training set (human underperformance of 11·4%, 95% CI 9·9-12·9 vs 3·6%, 0·8-6·3; p<0·0001). INTERPRETATION: State-of-the-art machine-learning classifiers outperformed human experts in the diagnosis of pigmented skin lesions and should have a more important role in clinical practice. However, a possible limitation of these algorithms is their decreased performance for out-of-distribution images, which should be addressed in future research. FUNDING: None.
Asunto(s)
Algoritmos , Dermoscopía , Internet , Aprendizaje Automático , Trastornos de la Pigmentación/patología , Neoplasias Cutáneas/patología , Adulto , Femenino , Humanos , Masculino , Reproducibilidad de los Resultados , Estudios RetrospectivosRESUMEN
BACKGROUND: Computer vision may aid in melanoma detection. OBJECTIVE: We sought to compare melanoma diagnostic accuracy of computer algorithms to dermatologists using dermoscopic images. METHODS: We conducted a cross-sectional study using 100 randomly selected dermoscopic images (50 melanomas, 44 nevi, and 6 lentigines) from an international computer vision melanoma challenge dataset (n = 379), along with individual algorithm results from 25 teams. We used 5 methods (nonlearned and machine learning) to combine individual automated predictions into "fusion" algorithms. In a companion study, 8 dermatologists classified the lesions in the 100 images as either benign or malignant. RESULTS: The average sensitivity and specificity of dermatologists in classification was 82% and 59%. At 82% sensitivity, dermatologist specificity was similar to the top challenge algorithm (59% vs. 62%, P = .68) but lower than the best-performing fusion algorithm (59% vs. 76%, P = .02). Receiver operating characteristic area of the top fusion algorithm was greater than the mean receiver operating characteristic area of dermatologists (0.86 vs. 0.71, P = .001). LIMITATIONS: The dataset lacked the full spectrum of skin lesions encountered in clinical practice, particularly banal lesions. Readers and algorithms were not provided clinical data (eg, age or lesion history/symptoms). Results obtained using our study design cannot be extrapolated to clinical practice. CONCLUSION: Deep learning computer vision systems classified melanoma dermoscopy images with accuracy that exceeded some but not all dermatologists.
Asunto(s)
Algoritmos , Dermatólogos , Dermoscopía , Lentigo/diagnóstico por imagen , Melanoma/diagnóstico , Nevo/diagnóstico por imagen , Neoplasias Cutáneas/diagnóstico por imagen , Congresos como Asunto , Estudios Transversales , Diagnóstico por Computador , Humanos , Aprendizaje Automático , Melanoma/patología , Curva ROC , Neoplasias Cutáneas/patologíaRESUMEN
Advancements in dermatological artificial intelligence research require high-quality and comprehensive datasets that mirror real-world clinical scenarios. We introduce a collection of 18,946 dermoscopic images spanning from 2010 to 2016, collated at the Hospital Clínic in Barcelona, Spain. The BCN20000 dataset aims to address the problem of unconstrained classification of dermoscopic images of skin cancer, including lesions in hard-to-diagnose locations such as those found in nails and mucosa, large lesions which do not fit in the aperture of the dermoscopy device, and hypo-pigmented lesions. Our dataset covers eight key diagnostic categories in dermoscopy, providing a diverse range of lesions for artificial intelligence model training. Furthermore, a ninth out-of-distribution (OOD) class is also present on the test set, comprised of lesions which could not be distinctively classified as any of the others. By providing a comprehensive collection of varied images, BCN20000 helps bridge the gap between the training data for machine learning models and the day-to-day practice of medical practitioners. Additionally, we present a set of baseline classifiers based on state-of-the-art neural networks, which can be extended by other researchers for further experimentation.
Asunto(s)
Dermoscopía , Neoplasias Cutáneas , Humanos , Neoplasias Cutáneas/diagnóstico por imagen , España , Redes Neurales de la Computación , Inteligencia Artificial , Aprendizaje AutomáticoRESUMEN
Dermoscopy aids in melanoma detection; however, agreement on dermoscopic features, including those of high clinical relevance, remains poor. In this study, we attempted to evaluate agreement among experts on exemplar images not only for the presence of melanocytic-specific features but also for spatial localization. This was a cross-sectional, multicenter, observational study. Dermoscopy images exhibiting at least 1 of 31 melanocytic-specific features were submitted by 25 world experts as exemplars. Using a web-based platform that allows for image markup of specific contrast-defined regions (superpixels), 20 expert readers annotated 248 dermoscopic images in collections of 62 images. Each collection was reviewed by five independent readers. A total of 4,507 feature observations were performed. Good-to-excellent agreement was found for 14 of 31 features (45.2%), with eight achieving excellent agreement (Gwet's AC >0.75) and seven of them being melanoma-specific features. These features were peppering/granularity (0.91), shiny white streaks (0.89), typical pigment network (0.83), blotch irregular (0.82), negative network (0.81), irregular globules (0.78), dotted vessels (0.77), and blue-whitish veil (0.76). By utilizing an exemplar dataset, a good-to-excellent agreement was found for 14 features that have previously been shown useful in discriminating nevi from melanoma. All images are public (www.isic-archive.com) and can be used for education, scientific communication, and machine learning experiments.
Asunto(s)
Melanoma , Neoplasias Cutáneas , Humanos , Melanoma/diagnóstico por imagen , Neoplasias Cutáneas/diagnóstico por imagen , Dermoscopía/métodos , Estudios Transversales , MelanocitosRESUMEN
Iron is essential to the virulence of Aspergillus species, and restricting iron availability is a critical mechanism of antimicrobial host defense. Macrophages recruited to the site of infection are at the crux of this process, employing multiple intersecting mechanisms to orchestrate iron sequestration from pathogens. To gain an integrated understanding of how this is achieved in aspergillosis, we generated a transcriptomic time series of the response of human monocyte-derived macrophages to Aspergillus and used this and the available literature to construct a mechanistic computational model of iron handling of macrophages during this infection. We found an overwhelming macrophage response beginning 2 to 4 h after exposure to the fungus, which included upregulated transcription of iron import proteins transferrin receptor-1, divalent metal transporter-1, and ZIP family transporters, and downregulated transcription of the iron exporter ferroportin. The computational model, based on a discrete dynamical systems framework, consisted of 21 3-state nodes, and was validated with additional experimental data that were not used in model generation. The model accurately captures the steady state and the trajectories of most of the quantitatively measured nodes. In the experimental data, we surprisingly found that transferrin receptor-1 upregulation preceded the induction of inflammatory cytokines, a feature that deviated from model predictions. Model simulations suggested that direct induction of transferrin receptor-1 (TfR1) after fungal recognition, independent of the iron regulatory protein-labile iron pool (IRP-LIP) system, explains this finding. We anticipate that this model will contribute to a quantitative understanding of iron regulation as a fundamental host defense mechanism during aspergillosis. IMPORTANCE Invasive pulmonary aspergillosis is a major cause of death among immunosuppressed individuals despite the best available therapy. Depriving the pathogen of iron is an essential component of host defense in this infection, but the mechanisms by which the host achieves this are complex. To understand how recruited macrophages mediate iron deprivation during the infection, we developed and validated a mechanistic computational model that integrates the available information in the field. The insights provided by this approach can help in designing iron modulation therapies as anti-fungal treatments.
Asunto(s)
Aspergilosis , Hierro , Aspergillus/genética , Aspergillus/metabolismo , Simulación por Computador , Humanos , Hierro/metabolismo , Macrófagos/microbiología , Receptores de Transferrina/genética , Receptores de Transferrina/metabolismoRESUMEN
IMPORTANCE: The use of artificial intelligence (AI) is accelerating in all aspects of medicine and has the potential to transform clinical care and dermatology workflows. However, to develop image-based algorithms for dermatology applications, comprehensive criteria establishing development and performance evaluation standards are required to ensure product fairness, reliability, and safety. OBJECTIVE: To consolidate limited existing literature with expert opinion to guide developers and reviewers of dermatology AI. EVIDENCE REVIEW: In this consensus statement, the 19 members of the International Skin Imaging Collaboration AI working group volunteered to provide a consensus statement. A systematic PubMed search was performed of English-language articles published between December 1, 2008, and August 24, 2021, for "artificial intelligence" and "reporting guidelines," as well as other pertinent studies identified by the expert panel. Factors that were viewed as critical to AI development and performance evaluation were included and underwent 2 rounds of electronic discussion to achieve consensus. FINDINGS: A checklist of items was developed that outlines best practices of image-based AI development and assessment in dermatology. CONCLUSIONS AND RELEVANCE: Clinically effective AI needs to be fair, reliable, and safe; this checklist of best practices will help both developers and reviewers achieve this goal.
Asunto(s)
Inteligencia Artificial , Dermatología , Lista de Verificación , Consenso , Humanos , Reproducibilidad de los ResultadosRESUMEN
BACKGROUND: Previous studies of artificial intelligence (AI) applied to dermatology have shown AI to have higher diagnostic classification accuracy than expert dermatologists; however, these studies did not adequately assess clinically realistic scenarios, such as how AI systems behave when presented with images of disease categories that are not included in the training dataset or images drawn from statistical distributions with significant shifts from training distributions. We aimed to simulate these real-world scenarios and evaluate the effects of image source institution, diagnoses outside of the training set, and other image artifacts on classification accuracy, with the goal of informing clinicians and regulatory agencies about safety and real-world accuracy. METHODS: We designed a large dermoscopic image classification challenge to quantify the performance of machine learning algorithms for the task of skin cancer classification from dermoscopic images, and how this performance is affected by shifts in statistical distributions of data, disease categories not represented in training datasets, and imaging or lesion artifacts. Factors that might be beneficial to performance, such as clinical metadata and external training data collected by challenge participants, were also evaluated. 25 331 training images collected from two datasets (in Vienna [HAM10000] and Barcelona [BCN20000]) between Jan 1, 2000, and Dec 31, 2018, across eight skin diseases, were provided to challenge participants to design appropriate algorithms. The trained algorithms were then tested for balanced accuracy against the HAM10000 and BCN20000 test datasets and data from countries not included in the training dataset (Turkey, New Zealand, Sweden, and Argentina). Test datasets contained images of all diagnostic categories available in training plus other diagnoses not included in training data (not trained category). We compared the performance of the algorithms against that of 18 dermatologists in a simulated setting that reflected intended clinical use. FINDINGS: 64 teams submitted 129 state-of-the-art algorithm predictions on a test set of 8238 images. The best performing algorithm achieved 58·8% balanced accuracy on the BCN20000 data, which was designed to better reflect realistic clinical scenarios, compared with 82·0% balanced accuracy on HAM10000, which was used in a previously published benchmark. Shifted statistical distributions and disease categories not included in training data contributed to decreases in accuracy. Image artifacts, including hair, pen markings, ulceration, and imaging source institution, decreased accuracy in a complex manner that varied based on the underlying diagnosis. When comparing algorithms to expert dermatologists (2460 ratings on 1269 images), algorithms performed better than experts in most categories, except for actinic keratoses (similar accuracy on average) and images from categories not included in training data (26% correct for experts vs 6% correct for algorithms, p<0·0001). For the top 25 submitted algorithms, 47·1% of the images from categories not included in training data were misclassified as malignant diagnoses, which would lead to a substantial number of unnecessary biopsies if current state-of-the-art AI technologies were clinically deployed. INTERPRETATION: We have identified specific deficiencies and safety issues in AI diagnostic systems for skin cancer that should be addressed in future diagnostic evaluation protocols to improve safety and reliability in clinical practice. FUNDING: Melanoma Research Alliance and La Marató de TV3.
Asunto(s)
Melanoma , Neoplasias Cutáneas , Inteligencia Artificial , Dermoscopía/métodos , Humanos , Melanoma/diagnóstico por imagen , Melanoma/patología , Reproducibilidad de los Resultados , Neoplasias Cutáneas/diagnóstico por imagen , Neoplasias Cutáneas/patologíaRESUMEN
Prior skin image datasets have not addressed patient-level information obtained from multiple skin lesions from the same patient. Though artificial intelligence classification algorithms have achieved expert-level performance in controlled studies examining single images, in practice dermatologists base their judgment holistically from multiple lesions on the same patient. The 2020 SIIM-ISIC Melanoma Classification challenge dataset described herein was constructed to address this discrepancy between prior challenges and clinical practice, providing for each image in the dataset an identifier allowing lesions from the same patient to be mapped to one another. This patient-level contextual information is frequently used by clinicians to diagnose melanoma and is especially useful in ruling out false positives in patients with many atypical nevi. The dataset represents 2,056 patients (20.8% with at least one melanoma, 79.2% with zero melanomas) from three continents with an average of 16 lesions per patient, consisting of 33,126 dermoscopic images and 584 (1.8%) histopathologically confirmed melanomas compared with benign melanoma mimickers.
Asunto(s)
Melanoma , Neoplasias Cutáneas , Inteligencia Artificial , Humanos , Melanoma/diagnóstico por imagen , Melanoma/patología , Melanoma/fisiopatología , Metadatos , Piel/patología , Neoplasias Cutáneas/diagnóstico por imagen , Neoplasias Cutáneas/patología , Neoplasias Cutáneas/fisiopatologíaRESUMEN
A challenge in multicenter trials that use quantitative positron emission tomography (PET) imaging is the often unknown variability in PET image values, typically measured as standardized uptake values, introduced by intersite differences in global and resolution-dependent biases. We present a method for the simultaneous monitoring of scanner calibration and reconstructed image resolution on a per-scan basis using a PET/computed tomography (CT) "pocket" phantom. We use simulation and phantom studies to optimize the design and construction of the PET/CT pocket phantom (120 × 30 × 30 mm). We then evaluate the performance of the PET/CT pocket phantom and accompanying software used alongside an anthropomorphic phantom when known variations in global bias (±20%, ±40%) and resolution (3-, 6-, and 12-mm postreconstruction filters) are introduced. The resulting prototype PET/CT pocket phantom design uses 3 long-lived sources (15-mm diameter) containing germanium-68 and a CT contrast agent in an epoxy matrix. Activity concentrations varied from 30 to 190 kBq/mL. The pocket phantom software can accurately estimate global bias and can detect changes in resolution in measured phantom images. The pocket phantom is small enough to be scanned with patients and can potentially be used on a per-scan basis for quality assurance for clinical trials and quantitative PET imaging in general. Further studies are being performed to evaluate its performance under variations in clinical conditions that occur in practice.
RESUMEN
This paper presents a robust segmentation method based on multi-scale classification to identify the lesion boundary in dermoscopic images. Our proposed method leverages a collection of classifiers which are trained at various resolutions to categorize each pixel as "lesion" or "surrounding skin". In detection phase, trained classifiers are applied on new images. The classifier outputs are fused at pixel level to build probability maps which represent lesion saliency maps. In the next step, Otsu thresholding is applied to convert the saliency maps to binary masks, which determine the border of the lesions. We compared our proposed method with existing lesion segmentation methods proposed in the literature using two dermoscopy data sets (International Skin Imaging Collaboration and Pedro Hispano Hospital) which demonstrates the superiority of our method with Dice Coefficient of 0.91 and accuracy of 94%.
Asunto(s)
Procesamiento de Imagen Asistido por Computador/métodos , Neoplasias Cutáneas/diagnóstico por imagen , Piel/diagnóstico por imagen , Algoritmos , Bases de Datos Factuales , Dermoscopía/métodos , Humanos , Aprendizaje Automático , Nevo/diagnóstico por imagen , Nevo/patología , Piel/patología , Neoplasias Cutáneas/patologíaRESUMEN
To address the error introduced by computed tomography (CT) scanners when assessing volume and unidimensional measurement of solid tumors, we scanned a precision manufactured pocket phantom simultaneously with patients enrolled in a lung cancer clinical trial. Dedicated software quantified bias and random error in the [Formula: see text], and [Formula: see text] dimensions of a Teflon sphere and also quantified response evaluation criteria in solid tumors and volume measurements using both constant and adaptive thresholding. We found that underestimation bias was essentially the same for [Formula: see text], and [Formula: see text] dimensions using constant thresholding and had similar values for adaptive thresholding. The random error of these length measurements as measured by the standard deviation and coefficient of variation was 0.10 mm (0.65), 0.11 mm (0.71), and 0.59 mm (3.75) for constant thresholding and 0.08 mm (0.51), 0.09 mm (0.56), and 0.58 mm (3.68) for adaptive thresholding, respectively. For random error, however, [Formula: see text] lengths had at least a fivefold higher standard deviation and coefficient of variation than [Formula: see text] and [Formula: see text]. Observed [Formula: see text]-dimension error was especially high for some 8 and 16 slice CT models. Error in CT image formation, in particular, for models with low numbers of detector rows, may be large enough to be misinterpreted as representing either treatment response or disease progression.