RESUMEN
The major vascular cause of dementia is cerebral small vessel disease (SVD). Its diagnosis relies on imaging hallmarks, such as white matter hyperintensities (WMH). WMH present a heterogenous pathology, including myelin and axonal loss. Yet, these might be only the "tip of the iceberg." Imaging modalities imply that microstructural alterations underlie still normal-appearing white matter (NAWM), preceding the conversion to WMH. Unfortunately, direct pathological characterization of these microstructural alterations affecting myelinated axonal fibers in WMH, and especially NAWM, is still missing. Given that there are no treatments to significantly reduce WMH progression, it is important to extend our knowledge on pathological processes that might already be occurring within NAWM. Staining of myelin with Luxol Fast Blue, while valuable, fails to assess subtle alterations in white matter microstructure. Therefore, we aimed to quantify myelin surrounding axonal fibers and axonal- and microstructural damage in detail by combining (immuno)histochemistry with polarized light imaging (PLI). To study the extent (of early) microstructural damage from periventricular NAWM to the center of WMH, we refined current analysis techniques by using deep learning to define smaller segments of white matter, capturing increasing fluid-attenuated inversion recovery signal. Integration of (immuno)histochemistry and PLI with post-mortem imaging of the brains of individuals with hypertension and normotensive controls enables voxel-wise assessment of the pathology throughout periventricular WMH and NAWM. Myelin loss, axonal integrity, and white matter microstructural damage are not limited to WMH but already occur within NAWM. Notably, we found that axonal damage is higher in individuals with hypertension, particularly in NAWM. These findings highlight the added value of advanced segmentation techniques to visualize subtle changes occurring already in NAWM preceding WMH. By using quantitative MRI and advanced diffusion MRI, future studies may elucidate these very early mechanisms leading to neurodegeneration, which ultimately contribute to the conversion of NAWM to WMH.
RESUMEN
Image analysis can play an important role in supporting histopathological diagnoses of lung cancer, with deep learning methods already achieving remarkable results. However, due to the large scale of whole-slide images (WSIs), creating manual pixel-wise annotations from expert pathologists is expensive and time-consuming. In addition, the heterogeneity of tumors and similarities in the morphological phenotype of tumor subtypes have caused inter-observer variability in annotations, which limits optimal performance. Effective use of weak labels could potentially alleviate these issues. In this paper, we propose a two-stage transformer-based weakly supervised learning framework called Simple Shuffle-Remix Vision Transformer (SSRViT). Firstly, we introduce a Shuffle-Remix Vision Transformer (SRViT) to retrieve discriminative local tokens and extract effective representative features. Then, the token features are selected and aggregated to generate sparse representations of WSIs, which are fed into a simple transformer-based classifier (SViT) for slide-level prediction. Experimental results demonstrate that the performance of our proposed SSRViT is significantly improved compared with other state-of-the-art methods in discriminating between adenocarcinoma, pulmonary sclerosing pneumocytoma and normal lung tissue (accuracy of 96.9% and AUC of 99.6%).
RESUMEN
The biopsy Gleason score is an important prognostic marker for prostate cancer patients. It is, however, subject to substantial variability among pathologists. Artificial intelligence (AI)-based algorithms employing deep learning have shown their ability to match pathologists' performance in assigning Gleason scores, with the potential to enhance pathologists' grading accuracy. The performance of Gleason AI algorithms in research is mostly reported on common benchmark data sets or within public challenges. In contrast, many commercial algorithms are evaluated in clinical studies, for which data are not publicly released. As commercial AI vendors typically do not publish performance on public benchmarks, comparison between research and commercial AI is difficult. The aims of this study are to evaluate and compare the performance of top-ranked public and commercial algorithms using real-world data. We curated a diverse data set of whole-slide prostate biopsy images through crowdsourcing containing images with a range of Gleason scores and from diverse sources. Predictions were obtained from 5 top-ranked public algorithms from the Prostate cANcer graDe Assessment (PANDA) challenge and 2 commercial Gleason grading algorithms. Additionally, 10 pathologists (A.C., C.R., J.v.I., K.R.M.L., P.R., P.G.S., R.G., S.F.K.J., T.v.d.K., X.F.) evaluated the data set in a reader study. Overall, the pairwise quadratic weighted kappa among pathologists ranged from 0.777 to 0.916. Both public and commercial algorithms showed high agreement with pathologists, with quadratic kappa ranging from 0.617 to 0.900. Commercial algorithms performed on par or outperformed top public algorithms.
RESUMEN
PURPOSE: This study aims to introduce an innovative multi-step pipeline for automatic tumor-stroma ratio (TSR) quantification as a potential prognostic marker for pancreatic cancer, addressing the limitations of existing staging systems and the lack of commonly used prognostic biomarkers. METHODS: The proposed approach involves a deep-learning-based method for the automatic segmentation of tumor epithelial cells, tumor bulk, and stroma from whole-slide images (WSIs). Models were trained using five-fold cross-validation and evaluated on an independent external test set. TSR was computed based on the segmented components. Additionally, TSR's predictive value for six-month survival on the independent external dataset was assessed. RESULTS: Median Dice (inter-quartile range (IQR)) of 0.751(0.15) and 0.726(0.25) for tumor epithelium segmentation on internal and external test sets, respectively. Median Dice of 0.76(0.11) and 0.863(0.17) for tumor bulk segmentation on internal and external test sets, respectively. TSR was evaluated as an independent prognostic marker, demonstrating a cross-validation AUC of 0.61±0.12 for predicting six-month survival on the external dataset. CONCLUSION: Our pipeline for automatic TSR quantification offers promising potential as a prognostic marker for pancreatic cancer. The results underscore the feasibility of computational biomarker discovery in enhancing patient outcome prediction, thus contributing to personalized patient management.
Asunto(s)
Biomarcadores de Tumor , Neoplasias Pancreáticas , Humanos , Neoplasias Pancreáticas/patología , Neoplasias Pancreáticas/diagnóstico , Neoplasias Pancreáticas/mortalidad , Pronóstico , Femenino , Células del Estroma/patología , Masculino , Aprendizaje Profundo , Anciano , Persona de Mediana Edad , Procesamiento de Imagen Asistido por Computador/métodosRESUMEN
Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. In biomedical image analysis, chosen performance metrics often do not reflect the domain interest, and thus fail to adequately measure scientific progress and hinder translation of ML techniques into practice. To overcome this, we created Metrics Reloaded, a comprehensive framework guiding researchers in the problem-aware selection of metrics. Developed by a large international consortium in a multistage Delphi process, it is based on the novel concept of a problem fingerprint-a structured representation of the given problem that captures all aspects that are relevant for metric selection, from the domain interest to the properties of the target structure(s), dataset and algorithm output. On the basis of the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics while being made aware of potential pitfalls. Metrics Reloaded targets image analysis problems that can be interpreted as classification tasks at image, object or pixel level, namely image-level classification, object detection, semantic segmentation and instance segmentation tasks. To improve the user experience, we implemented the framework in the Metrics Reloaded online tool. Following the convergence of ML methodology across application domains, Metrics Reloaded fosters the convergence of validation methodology. Its applicability is demonstrated for various biomedical use cases.
Asunto(s)
Algoritmos , Procesamiento de Imagen Asistido por Computador , Aprendizaje Automático , SemánticaRESUMEN
Validation metrics are key for tracking scientific progress and bridging the current chasm between artificial intelligence research and its translation into practice. However, increasing evidence shows that, particularly in image analysis, metrics are often chosen inadequately. Although taking into account the individual strengths, weaknesses and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multistage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides a reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Although focused on biomedical image analysis, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. The work serves to enhance global comprehension of a key topic in image analysis validation.
Asunto(s)
Inteligencia ArtificialRESUMEN
White matter hyperintensities (WMH) are the most prevalent markers of cerebral small vessel disease (SVD), which is the major vascular risk factor for dementia. Microvascular pathology and neuroinflammation are suggested to drive the transition from normal-appearing white matter (NAWM) to WMH, particularly in individuals with hypertension. However, current imaging techniques cannot capture ongoing NAWM changes. The transition from NAWM into WMH is a continuous process, yet white matter lesions are often examined dichotomously, which may explain their underlying heterogeneity. Therefore, we examined microvascular and neurovascular inflammation pathology in NAWM and severe WMH three-dimensionally, along with gradual magnetic resonance imaging (MRI) fluid-attenuated inversion recovery (FLAIR) signal (sub-)segmentation. In WMH, the vascular network exhibited reduced length and complexity compared to NAWM. Neuroinflammation was more severe in WMH. Vascular inflammation was more pronounced in NAWM, suggesting its potential significance in converting NAWM into WMH. Moreover, the (sub-)segmentation of FLAIR signal displayed varying degrees of vascular pathology, particularly within WMH regions. These findings highlight the intricate interplay between microvascular pathology and neuroinflammation in the transition from NAWM to WMH. Further examination of neurovascular inflammation across MRI-visible alterations could aid deepening our understanding on WMH conversion, and therewith how to improve the prognosis of SVD.
Asunto(s)
Sustancia Blanca , Humanos , Sustancia Blanca/patología , Enfermedades Neuroinflamatorias , Imagen por Resonancia Magnética/métodos , Inflamación/diagnóstico por imagen , Inflamación/patología , Factores de RiesgoRESUMEN
The frequency of basal cell carcinoma (BCC) cases is putting an increasing strain on dermatopathologists. BCC is the most common type of skin cancer, and its incidence is increasing rapidly worldwide. AI can play a significant role in reducing the time and effort required for BCC diagnostics and thus improve the overall efficiency of the process. To train such an AI system in a fully-supervised fashion however, would require a large amount of pixel-level annotation by already strained dermatopathologists. Therefore, in this study, our primary objective was to develop a weakly-supervised for the identification of basal cell carcinoma (BCC) and the stratification of BCC into low-risk and high-risk categories within histopathology whole-slide images (WSI). We compared Clustering-constrained Attention Multiple instance learning (CLAM) with StreamingCLAM and hypothesized that the latter would be the superior approach. A total of 5147 images were used to train and validate the models, which were subsequently tested on an internal set of 949 images and an external set of 183 images. The labels for training were automatically extracted from free-text pathology reports using a rule-based approach. All data has been made available through the COBRA dataset. The results showed that both the CLAM and StreamingCLAM models achieved high performance for the detection of BCC, with an area under the ROC curve (AUC) of 0.994 and 0.997, respectively, on the internal test set and 0.983 and 0.993 on the external dataset. Furthermore, the models performed well on risk stratification, with AUC values of 0.912 and 0.931, respectively, on the internal set, and 0.851 and 0.883 on the external set. In every single metric the StreamingCLAM model outperformed the CLAM model or is on par. The performance of both models was comparable to that of two pathologists who scored 240 BCC positive slides. Additionally, in the public test set, StreamingCLAM demonstrated a comparable AUC of 0.958, markedly superior to CLAM's 0.803. This difference was statistically significant and emphasized the strength and better adaptability of the StreamingCLAM approach.
Asunto(s)
Carcinoma Basocelular , Neoplasias Cutáneas , Humanos , Carcinoma Basocelular/diagnóstico por imagen , Área Bajo la Curva , Neoplasias Cutáneas/diagnóstico por imagen , Aprendizaje Automático SupervisadoRESUMEN
The ability to detect anomalies, i.e. anything not seen during training or out-of-distribution (OOD), in medical imaging applications is essential for successfully deploying machine learning systems. Filtering out OOD data using unsupervised learning is especially promising because it does not require costly annotations. A new class of models called AnoDDPMs, based on denoising diffusion probabilistic models (DDPMs), has recently achieved significant progress in unsupervised OOD detection. This work provides a benchmark for unsupervised OOD detection methods in digital pathology. By leveraging fast sampling techniques, we apply AnoDDPM on a large enough scale for whole-slide image analysis on the complete test set of the Camelyon16 challenge. Based on ROC analysis, we show that AnoDDPMs can detect OOD data with an AUC of up to 94.13 and 86.93 on two patch-level OOD detection tasks, outperforming the other unsupervised methods. We observe that AnoDDPMs alter the semantic properties of inputs, replacing anomalous data with more benign-looking tissue. Furthermore, we highlight the flexibility of AnoDDPM towards different information bottlenecks by evaluating reconstruction errors for inputs with different signal-to-noise ratios. While there is still a significant performance gap with fully supervised learning, AnoDDPMs show considerable promise in the field of OOD detection in digital pathology.
Asunto(s)
Benchmarking , Procesamiento de Imagen Asistido por Computador , Humanos , Difusión , Aprendizaje Automático , Modelos EstadísticosRESUMEN
In histopathology practice, scanners, tissue processing, staining, and image acquisition protocols vary from center to center, resulting in subtle variations in images. Vanilla convolutional neural networks are sensitive to such domain shifts. Data augmentation is a popular way to improve domain generalization. Currently, state-of-the-art domain generalization in computational pathology is achieved using a manually curated set of augmentation transforms. However, manual tuning of augmentation parameters is time-consuming and can lead to sub-optimal generalization performance. Meta-learning frameworks can provide efficient ways to find optimal training hyper-parameters, including data augmentation. In this study, we hypothesize that an automated search of augmentation hyper-parameters can provide superior generalization performance and reduce experimental optimization time. We select four state-of-the-art automatic augmentation methods from general computer vision and investigate their capacity to improve domain generalization in histopathology. We analyze their performance on data from 25 centers across two different tasks: tumor metastasis detection in lymph nodes and breast cancer tissue type classification. On tumor metastasis detection, most automatic augmentation methods achieve comparable performance to state-of-the-art manual augmentation. On breast cancer tissue type classification, the leading automatic augmentation method significantly outperforms state-of-the-art manual data augmentation.
Asunto(s)
Neoplasias de la Mama , Aprendizaje Profundo , Humanos , Femenino , Procesamiento de Imagen Asistido por Computador/métodos , Redes Neurales de la Computación , MamaRESUMEN
Whole-mount sectioning is a technique in histopathology where a full slice of tissue, such as a transversal cross-section of a prostate specimen, is prepared on a large microscope slide without further sectioning into smaller fragments. Although this technique can offer improved correlation with pre-operative imaging and is paramount for multimodal research, it is not commonly employed due to its technical difficulty, associated cost and cumbersome integration in (digital) pathology workflows. In this work, we present a computational tool named PythoStitcher which reconstructs artificial whole-mount sections from digitized tissue fragments, thereby bringing the benefits of whole-mount sections to pathology labs currently unable to employ this technique. Our proposed algorithm consists of a multi-step approach where it (i) automatically determines how fragments need to be reassembled, (ii) iteratively optimizes the stitch using a genetic algorithm and (iii) efficiently reconstructs the final artificial whole-mount section on full resolution (0.25 µm/pixel). PythoStitcher was validated on a total of 198 cases spanning five datasets with a varying number of tissue fragments originating from different organs from multiple centers. PythoStitcher successfully reconstructed the whole-mount section in 86-100% of cases for a given dataset with a residual registration mismatch of 0.65-2.76 mm on automatically selected landmarks. It is expected that our algorithm can aid pathology labs unable to employ whole-mount sectioning through faster clinical case evaluation and improved radiology-pathology correlation workflows.
Asunto(s)
Algoritmos , Diagnóstico por Imagen , Procesamiento de Imagen Asistido por Computador , HumanosRESUMEN
The expanding digitalization of routine diagnostic histological slides holds a potential to apply artificial intelligence (AI) to pathology, including bone marrow (BM) histology. In this perspective, we describe potential tasks in diagnostics that can be supported, investigations that can be guided, and questions that can be answered by the future application of AI on whole-slide images of BM biopsies. These range from characterization of cell lineages and quantification of cells and stromal structures to disease prediction. First glimpses show an exciting potential to detect subtle phenotypic changes with AI that are due to specific genotypes. The discussion is illustrated by examples of current AI research using BM biopsy slides. In addition, we briefly discuss current challenges for implementation of AI-supported diagnostics.
Asunto(s)
Inteligencia Artificial , Médula Ósea , Humanos , Biopsia , Linaje de la Célula , GenotipoRESUMEN
Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibility of metric-related knowledge: While taking into account the individual strengths, weaknesses, and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multi-stage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides the first reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Focusing on biomedical image analysis but with the potential of transfer to other fields, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. To facilitate comprehension, illustrations and specific examples accompany each pitfall. As a structured body of information accessible to researchers of all levels of expertise, this work enhances global comprehension of a key topic in image analysis validation.
RESUMEN
Many inherently ambiguous tasks in medical imaging suffer from inter-observer variability, resulting in a reference standard defined by a distribution of labels with high variance. Training only on a consensus or majority vote label, as is common in medical imaging, discards valuable information on uncertainty amongst a panel of experts. In this work, we propose to train on the full label distribution to predict the uncertainty within a panel of experts and the most likely ground-truth label. To do so, we propose a new stochastic classification framework based on the conditional variational auto-encoder, which we refer to as the Latent Doctor Model (LDM). In an extensive comparative analysis, we compare the LDM with a model trained on the majority vote label and other methods capable of learning a distribution of labels. We show that the LDM is able to reproduce the reference-standard distribution significantly better than the majority vote baseline. Compared to the other baseline methods, we demonstrate that the LDM performs best at modeling the label distribution and its corresponding uncertainty in two prostate tumor grading tasks. Furthermore, we show competitive performance of the LDM with the more computationally demanding deep ensembles on a tumor budding classification task.
RESUMEN
Current hardware limitations make it impossible to train convolutional neural networks on gigapixel image inputs directly. Recent developments in weakly supervised learning, such as attention-gated multiple instance learning, have shown promising results, but often use multi-stage or patch-wise training strategies risking suboptimal feature extraction, which can negatively impact performance. In this paper, we propose to train a ResNet-34 encoder with an attention-gated classification head in an end-to-end fashion, which we call StreamingCLAM, using a streaming implementation of convolutional layers. This allows us to train end-to-end on 4-gigapixel microscopic images using only slide-level labels. We achieve a mean area under the receiver operating characteristic curve of 0.9757 for metastatic breast cancer detection (CAMELYON16), close to fully supervised approaches using pixel-level annotations. Our model can also detect MYC-gene translocation in histologic slides of diffuse large B-cell lymphoma, achieving a mean area under the ROC curve of 0.8259. Furthermore, we show that our model offers a degree of interpretability through the attention mechanism.
Asunto(s)
Neoplasias de la Mama , Redes Neurales de la Computación , Humanos , Femenino , Neoplasias de la Mama/diagnóstico por imagen , Neoplasias de la Mama/patología , Curva ROCRESUMEN
INTRODUCTION: Prostate specific membrane antigen (PSMA) directed radioligand therapy (RLT) is a novel therapy for metastatic castration-resistant prostate cancer (mCRPC) patients. However, it is still poorly understood why approximately 40% of the patients does not respond to PSMA-RLT. The aims of this study were to evaluate the pretreatment PSMA expression on immunohistochemistry (IHC) and PSMA uptake on PET/CT imaging in mCRPC patients who underwent PSMA-RLT. We correlated these parameters and a cell proliferation marker (Ki67) to the therapeutic efficacy of PSMA-RLT. PATIENTS AND METHODS: In this retrospective study, mCRPC patients who underwent PSMA-RLT were analyzed. Patients biopsies were scored for immunohistochemical Ki67 expression, PSMA staining intensity and percentage of cells with PSMA expression. Moreover, the PSMA tracer uptake of the tumor lesion(s) and healthy organs on PET/CT imaging was assessed. The primary outcome was to evaluate the association between histological PSMA protein expression of tumor in pre-PSMA-RLT biopsies and the PSMA uptake on PSMA PET/CT imaging of the biopsied lesion. Secondary outcomes were to assess the relationship between PSMA expression and Ki67 on IHC and the progression free survival (PFS) and overall survival (OS) following PSMA-RLT. RESULTS: In total, 22 mCRPC patients were included in this study. Nineteen (86%) patients showed a high and homogenous PSMA expression of >80% on IHC. Three (14%) patients had low PSMA expression on IHC. Although there was limited PSMA uptake on PET/CT imaging, these 3 patients had lower PSMA uptake on PET/CT imaging compared to the patients with high PSMA expression on IHC. Yet, no correlation was found between PSMA uptake on PET/CT imaging and PSMA expression on IHC (SUVmax: R2 = 0.046 and SUVavg: R2 = 0.036). The 3 patients had a shorter PFS compared to the patients with high PSMA expression on IHC (HR: 4.76, 95% CI: 1.14-19.99; P = .033). Patients with low Ki67 expression had a longer PFS and OS compared to patients with a high Ki67 expression (HR: 0.40, 95% CI: 0.15-1.06; P = .013) CONCLUSION: The PSMA uptake on PSMA-PET/CT generally followed the PSMA expression on IHC. However, heterogeneity may be missed on PSMA-PET/CT. Immunohistochemical PSMA and Ki67 expression in fresh tumor biopsies, may contribute to predict treatment efficacy of PSMA-RLT in mCRPC patients. This needs to be further explored in prospective cohorts.
Asunto(s)
Tomografía Computarizada por Tomografía de Emisión de Positrones , Neoplasias de la Próstata Resistentes a la Castración , Masculino , Humanos , Antígeno Ki-67 , Tomografía Computarizada por Tomografía de Emisión de Positrones/métodos , Neoplasias de la Próstata Resistentes a la Castración/diagnóstico por imagen , Neoplasias de la Próstata Resistentes a la Castración/radioterapia , Neoplasias de la Próstata Resistentes a la Castración/metabolismo , Estudios Retrospectivos , Estudios Prospectivos , Antígeno Prostático Específico , Dipéptidos/uso terapéutico , Resultado del Tratamiento , BiopsiaRESUMEN
Recently, large, high-quality public datasets have led to the development of convolutional neural networks that can detect lymph node metastases of breast cancer at the level of expert pathologists. Many cancers, regardless of the site of origin, can metastasize to lymph nodes. However, collecting and annotating high-volume, high-quality datasets for every cancer type is challenging. In this paper we investigate how to leverage existing high-quality datasets most efficiently in multi-task settings for closely related tasks. Specifically, we will explore different training and domain adaptation strategies, including prevention of catastrophic forgetting, for breast, colon and head-and-neck cancer metastasis detection in lymph nodes. Our results show state-of-the-art performance on colon and head-and-neck cancer metastasis detection tasks. We show the effectiveness of adaptation of networks from one cancer type to another to obtain multi-task metastasis detection networks. Furthermore, we show that leveraging existing high-quality datasets can significantly boost performance on new target tasks and that catastrophic forgetting can be effectively mitigated.Last, we compare different mitigation strategies.
Asunto(s)
Neoplasias de la Mama , Neoplasias de Cabeza y Cuello , Humanos , Femenino , Metástasis Linfática/patología , Redes Neurales de la Computación , Ganglios Linfáticos/patología , Neoplasias de la Mama/patologíaRESUMEN
Machine learning model deployment in clinical practice demands real-time risk assessment to identify situations in which the model is uncertain. Once deployed, models should be accurate for classes seen during training while providing informative estimates of uncertainty to flag abnormalities and unseen classes for further analysis. Although recent developments in uncertainty estimation have resulted in an increasing number of methods, a rigorous empirical evaluation of their performance on large-scale digital pathology datasets is lacking. This work provides a benchmark for evaluating prevalent methods on multiple datasets by comparing the uncertainty estimates on both in-distribution and realistic near and far out-of-distribution (OOD) data on a whole-slide level. To this end, we aggregate uncertainty values from patch-based classifiers to whole-slide level uncertainty scores. We show that results found in classical computer vision benchmarks do not always translate to the medical imaging setting. Specifically, we demonstrate that deep ensembles perform best at detecting far-OOD data but can be outperformed on a more challenging near-OOD detection task by multi-head ensembles trained for optimal ensemble diversity. Furthermore, we demonstrate the harmful impact OOD data can have on the performance of deployed machine learning models. Overall, we show that uncertainty estimates can be used to discriminate in-distribution from OOD data with high AUC scores. Still, model deployment might require careful tuning based on prior knowledge of prospective OOD data.
Asunto(s)
Aprendizaje Automático , Patología , Humanos , Estudios ProspectivosRESUMEN
Poor generalizability is a major barrier to clinical implementation of artificial intelligence in digital pathology. The aim of this study was to test the generalizability of a pretrained deep learning model to a new diagnostic setting and to a small change in surgical indication. A deep learning model for breast cancer metastases detection in sentinel lymph nodes, trained on CAMELYON multicenter data, was used as a base model, and achieved an AUC of 0.969 (95% CI 0.926-0.998) and FROC of 0.838 (95% CI 0.757-0.913) on CAMELYON16 test data. On local sentinel node data, the base model performance dropped to AUC 0.929 (95% CI 0.800-0.998) and FROC 0.744 (95% CI 0.566-0.912). On data with a change in surgical indication (axillary dissections) the base model performance indicated an even larger drop with a FROC of 0.503 (95%CI 0.201-0.911). The model was retrained with addition of local data, resulting in about a 4% increase for both AUC and FROC for sentinel nodes, and an increase of 11% in AUC and 49% in FROC for axillary nodes. Pathologist qualitative evaluation of the retrained model´s output showed no missed positive slides. False positives, false negatives and one previously undetected micro-metastasis were observed. The study highlights the generalization challenge even when using a multicenter trained model, and that a small change in indication can considerably impact the model´s performance.
RESUMEN
In the present work, we present a publicly available, expert-segmented representative dataset of 158 3.0 Tesla biparametric MRIs [1]. There is an increasing number of studies investigating prostate and prostate carcinoma segmentation using deep learning (DL) with 3D architectures [2], [3], [4], [5], [6], [7]. The development of robust and data-driven DL models for prostate segmentation and assessment is currently limited by the availability of openly available expert-annotated datasets [8], [9], [10]. The dataset contains 3.0 Tesla MRI images of the prostate of patients with suspected prostate cancer. Patients over 50 years of age who had a 3.0 Tesla MRI scan of the prostate that met PI-RADS version 2.1 technical standards were included. All patients received a subsequent biopsy or surgery so that the MRI diagnosis could be verified/matched with the histopathologic diagnosis. For patients who had undergone multiple MRIs, the last MRI, which was less than six months before biopsy/surgery, was included. All patients were examined at a German university hospital (Charité Universitätsmedizin Berlin) between 02/2016 and 01/2020. All MRI were acquired with two 3.0 Tesla MRI scanners (Siemens VIDA and Skyra, Siemens Healthineers, Erlangen, Germany). Axial T2W sequences and axial diffusion-weighted sequences (DWI) with apparent diffusion coefficient maps (ADC) were included in the data set. T2W sequences and ADC maps were annotated by two board-certified radiologists with 6 and 8 years of experience, respectively. For T2W sequences, the central gland (central zone and transitional zone) and peripheral zone were segmented. If areas of suspected prostate cancer (PIRADS score of ≥ 4) were identified on examination, they were segmented in both the T2W sequences and ADC maps. Because restricted diffusion is best seen in DWI images with high b-values, only these images were selected and all images with low b-values were discarded. Data were then anonymized and converted to NIfTI (Neuroimaging Informatics Technology Initiative) format.