RESUMO
Numerous prognostic factors are currently assessed histologically and immunohistochemically in canine mast cell tumors (MCTs) to evaluate clinical behavior. In addition, polymerase chain reaction (PCR) is often performed to detect internal tandem duplication (ITD) mutations in exon 11 of the c-KIT gene (c-KIT-11-ITD) to predict the therapeutic response to tyrosine kinase inhibitors. This project aimed at training deep learning models (DLMs) to identify MCTs with c-KIT-11-ITD solely based on morphology. Hematoxylin and eosin (HE) stained slides of 368 cutaneous, subcutaneous, and mucocutaneous MCTs (195 with ITD and 173 without) were stained consecutively in 2 different laboratories and scanned with 3 different slide scanners. This resulted in 6 data sets (stain-scanner variations representing diagnostic institutions) of whole-slide images. DLMs were trained with single and mixed data sets and their performances were assessed under stain-scanner variations (domain shifts). The DLM correctly classified HE slides according to their c-KIT-11-ITD status in up to 87% of cases with a 0.90 sensitivity and a 0.83 specificity. A relevant performance drop could be observed when the stain-scanner combination of training and test data set differed. Multi-institutional data sets improved the average accuracy but did not reach the maximum accuracy of algorithms trained and tested on the same stain-scanner variant (ie, intra-institutional). In summary, DLM-based morphological examination can predict c-KIT-11-ITD with high accuracy in canine MCTs in HE slides. However, staining protocol and scanner type influence accuracy. Larger data sets of scans from different laboratories and scanners may lead to more robust DLMs to identify c-KIT mutations in HE slides.
RESUMO
PURPOSE: Confocal Laser Endomicroscopy (CLE) is an imaging tool, that has demonstrated potential for intraoperative, real-time, non-invasive, microscopical assessment of surgical margins of oropharyngeal squamous cell carcinoma (OPSCC). However, interpreting CLE images remains challenging. This study investigates the application of OpenAI's Generative Pretrained Transformer (GPT) 4.0 with Vision capabilities for automated classification of CLE images in OPSCC. METHODS: CLE Images of histological confirmed SCC or healthy mucosa from a database of 12 809 CLE images from 5 patients with OPSCC were retrieved and anonymized. Using a training data set of 16 images, a validation set of 139 images, comprising SCC (83 images, 59.7%) and healthy normal mucosa (56 images, 40.3%) was classified using the application programming interface (API) of GPT4.0. The same set of images was also classified by CLE experts (two surgeons and one pathologist), who were blinded to the histology. Diagnostic metrics, the reliability of GPT and inter-rater reliability were assessed. RESULTS: Overall accuracy of the GPT model was 71.2%, the intra-rater agreement was κ = 0.837, indicating an almost perfect agreement across the three runs of GPT-generated results. Human experts achieved an accuracy of 88.5% with a substantial level of agreement (κ = 0.773). CONCLUSIONS: Though limited to a specific clinical framework, patient and image set, this study sheds light on some previously unexplored diagnostic capabilities of large language models using few-shot prompting. It suggests the model`s ability to extrapolate information and classify CLE images with minimal example data. Whether future versions of the model can achieve clinically relevant diagnostic accuracy, especially in uncurated data sets, remains to be investigated.
Assuntos
Neoplasias de Cabeça e Pescoço , Humanos , Reprodutibilidade dos Testes , Microscopia Confocal/métodos , Carcinoma de Células Escamosas de Cabeça e Pescoço , LasersRESUMO
OBJECTIVES: Confocal laser endomicroscopy (CLE) is an optical method that enables microscopic visualization of oral mucosa. Previous studies have shown that it is possible to differentiate between physiological and malignant oral mucosa. However, differences in mucosal architecture were not taken into account. The objective was to map the different oral mucosal morphologies and to establish a "CLE map" of physiological mucosa as baseline for further application of this powerful technology. MATERIALS AND METHODS: The CLE database consisted of 27 patients. The following spots were examined: (1) upper lip (intraoral) (2) alveolar ridge (3) lateral tongue (4) floor of the mouth (5) hard palate (6) intercalary line. All sequences were examined by two CLE experts for morphological differences and video quality. RESULTS: Analysis revealed clear differences in image quality and possibility of depicting tissue morphologies between the various localizations of oral mucosa: imaging of the alveolar ridge and hard palate showed visually most discriminative tissue morphology. Labial mucosa was also visualized well using CLE. Here, typical morphological features such as uniform cells with regular intercellular gaps and vessels could be clearly depicted. Image generation and evaluation was particularly difficult in the area of the buccal mucosa, the lateral tongue and the floor of the mouth. CONCLUSION: A physiological "CLE map" for the entire oral cavity could be created for the first time. CLINICAL RELEVANCE: This will make it possible to take into account the existing physiological morphological features when differentiating between normal mucosa and oral squamous cell carcinoma in future work.
Assuntos
Microscopia Confocal , Mucosa Bucal , Humanos , Microscopia Confocal/métodos , Mucosa Bucal/diagnóstico por imagem , Mucosa Bucal/citologia , Masculino , Feminino , Pessoa de Meia-Idade , Neoplasias Bucais/patologia , Neoplasias Bucais/diagnóstico por imagemRESUMO
Microscopic evaluation of hematoxylin and eosin-stained slides is still the diagnostic gold standard for a variety of diseases, including neoplasms. Nevertheless, intra- and interrater variability are well documented among pathologists. So far, computer assistance via automated image analysis has shown potential to support pathologists in improving accuracy and reproducibility of quantitative tasks. In this proof of principle study, we describe a machine-learning-based algorithm for the automated diagnosis of 7 of the most common canine skin tumors: trichoblastoma, squamous cell carcinoma, peripheral nerve sheath tumor, melanoma, histiocytoma, mast cell tumor, and plasmacytoma. We selected, digitized, and annotated 350 hematoxylin and eosin-stained slides (50 per tumor type) to create a database divided into training, n = 245 whole-slide images (WSIs), validation (n = 35 WSIs), and test sets (n = 70 WSIs). Full annotations included the 7 tumor classes and 6 normal skin structures. The data set was used to train a convolutional neural network (CNN) for the automatic segmentation of tumor and nontumor classes. Subsequently, the detected tumor regions were classified patch-wise into 1 of the 7 tumor classes. A majority of patches-approach led to a tumor classification accuracy of the network on the slide-level of 95% (133/140 WSIs), with a patch-level precision of 85%. The same 140 WSIs were provided to 6 experienced pathologists for diagnosis, who achieved a similar slide-level accuracy of 98% (137/140 correct majority votes). Our results highlight the feasibility of artificial intelligence-based methods as a support tool in diagnostic oncologic pathology with future applications in other species and tumor types.
Assuntos
Aprendizado Profundo , Doenças do Cão , Neoplasias Cutâneas , Animais , Cães , Inteligência Artificial , Amarelo de Eosina-(YS) , Hematoxilina , Reprodutibilidade dos Testes , Neoplasias Cutâneas/diagnóstico , Neoplasias Cutâneas/veterinária , Aprendizado de Máquina , Doenças do Cão/diagnósticoRESUMO
Exercise-induced pulmonary hemorrhage (EIPH) is a relevant respiratory disease in sport horses, which can be diagnosed by examination of bronchoalveolar lavage fluid (BALF) cells using the total hemosiderin score (THS). The aim of this study was to evaluate the diagnostic accuracy and reproducibility of annotators and to validate a deep learning-based algorithm for the THS. Digitized cytological specimens stained for iron were prepared from 52 equine BALF samples. Ten annotators produced a THS for each slide according to published methods. The reference methods for comparing annotator's and algorithmic performance included a ground truth dataset, the mean annotators' THSs, and chemical iron measurements. Results of the study showed that annotators had marked interobserver variability of the THS, which was mostly due to a systematic error between annotators in grading the intracytoplasmatic hemosiderin content of individual macrophages. Regarding overall measurement error between the annotators, 87.7% of the variance could be reduced by using standardized grades based on the ground truth. The algorithm was highly consistent with the ground truth in assigning hemosiderin grades. Compared with the ground truth THS, annotators had an accuracy of diagnosing EIPH (THS of < or ≥ 75) of 75.7%, whereas, the algorithm had an accuracy of 92.3% with no relevant differences in correlation with chemical iron measurements. The results show that deep learning-based algorithms are useful for improving reproducibility and routine applicability of the THS. For THS by experts, a diagnostic uncertainty interval of 40 to 110 is proposed. THSs within this interval have insufficient reproducibility regarding the EIPH diagnosis.
Assuntos
Aprendizado Profundo , Doenças dos Cavalos , Pneumopatias , Animais , Líquido da Lavagem Broncoalveolar , Hemorragia/diagnóstico , Hemorragia/veterinária , Hemossiderina , Doenças dos Cavalos/diagnóstico , Cavalos , Ferro , Pneumopatias/diagnóstico , Pneumopatias/veterinária , Reprodutibilidade dos TestesRESUMO
OBJECTIVES: To evaluate whether neural networks can distinguish between seropositive RA, seronegative RA, and PsA based on inflammatory patterns from hand MRIs and to test how psoriasis patients with subclinical inflammation fit into such patterns. METHODS: ResNet neural networks were utilized to compare seropositive RA vs PsA, seronegative RA vs PsA, and seropositive vs seronegative RA with respect to hand MRI data. Results from T1 coronal, T2 coronal, T1 coronal and axial fat-suppressed contrast-enhanced (CE), and T2 fat-suppressed axial sequences were used. The performance of such trained networks was analysed by the area under the receiver operating characteristics curve (AUROC) with and without presentation of demographic and clinical parameters. Additionally, the trained networks were applied to psoriasis patients without clinical arthritis. RESULTS: MRI scans from 649 patients (135 seronegative RA, 190 seropositive RA, 177 PsA, 147 psoriasis) were fed into ResNet neural networks. The AUROC was 75% for seropositive RA vs PsA, 74% for seronegative RA vs PsA, and 67% for seropositive vs seronegative RA. All MRI sequences were relevant for classification, however, when deleting contrast agent-based sequences the loss of performance was only marginal. The addition of demographic and clinical data to the networks did not provide significant improvements for classification. Psoriasis patients were mostly assigned to PsA by the neural networks, suggesting that a PsA-like MRI pattern may be present early in the course of psoriatic disease. CONCLUSION: Neural networks can be successfully trained to distinguish MRI inflammation related to seropositive RA, seronegative RA, and PsA.
Assuntos
Artrite Psoriásica , Artrite Reumatoide , Psoríase , Humanos , Artrite Psoriásica/diagnóstico por imagem , Artrite Reumatoide/diagnóstico por imagem , Psoríase/diagnóstico por imagem , Inflamação , Imageamento por Ressonância Magnética , Redes Neurais de ComputaçãoRESUMO
The mitotic count (MC) is an important histological parameter for prognostication of malignant neoplasms. However, it has inter- and intraobserver discrepancies due to difficulties in selecting the region of interest (MC-ROI) and in identifying or classifying mitotic figures (MFs). Recent progress in the field of artificial intelligence has allowed the development of high-performance algorithms that may improve standardization of the MC. As algorithmic predictions are not flawless, computer-assisted review by pathologists may ensure reliability. In the present study, we compared partial (MC-ROI preselection) and full (additional visualization of MF candidates and display of algorithmic confidence values) computer-assisted MC analysis to the routine (unaided) MC analysis by 23 pathologists for whole-slide images of 50 canine cutaneous mast cell tumors (ccMCTs). Algorithmic predictions aimed to assist pathologists in detecting mitotic hotspot locations, reducing omission of MFs, and improving classification against imposters. The interobserver consistency for the MC significantly increased with computer assistance (interobserver correlation coefficient, ICC = 0.92) compared to the unaided approach (ICC = 0.70). Classification into prognostic stratifications had a higher accuracy with computer assistance. The algorithmically preselected hotspot MC-ROIs had a consistently higher MCs than the manually selected MC-ROIs. Compared to a ground truth (developed with immunohistochemistry for phosphohistone H3), pathologist performance in detecting individual MF was augmented when using computer assistance (F1-score of 0.68 increased to 0.79) with a reduction in false negatives by 38%. The results of this study demonstrate that computer assistance may lead to more reproducible and accurate MCs in ccMCTs.
Assuntos
Aprendizado Profundo , Algoritmos , Animais , Inteligência Artificial , Cães , Humanos , Patologistas , Reprodutibilidade dos TestesAssuntos
Aneurisma da Aorta Abdominal/cirurgia , Implante de Prótese Vascular/efeitos adversos , Procedimentos Endovasculares/efeitos adversos , Artéria Ilíaca/cirurgia , Idoso , Idoso de 80 Anos ou mais , Aneurisma da Aorta Abdominal/diagnóstico por imagem , Prótese Vascular , Implante de Prótese Vascular/instrumentação , Angiografia por Tomografia Computadorizada , Procedimentos Endovasculares/instrumentação , Feminino , Humanos , Artéria Ilíaca/diagnóstico por imagem , Masculino , Valor Preditivo dos Testes , Desenho de Prótese , Medição de Risco , Fatores de Risco , Stents , Resultado do Tratamento , Dispositivos de Acesso VascularRESUMO
In numerous studies, deep learning algorithms have proven their potential for the analysis of histopathology images, for example, for revealing the subtypes of tumors or the primary origin of metastases. These models require large datasets for training, which must be anonymized to prevent possible patient identity leaks. This study demonstrates that even relatively simple deep learning algorithms can re-identify patients in large histopathology datasets with substantial accuracy. In addition, we compared a comprehensive set of state-of-the-art whole slide image classifiers and feature extractors for the given task. We evaluated our algorithms on two TCIA datasets including lung squamous cell carcinoma (LSCC) and lung adenocarcinoma (LUAD). We also demonstrate the algorithm's performance on an in-house dataset of meningioma tissue. We predicted the source patient of a slide with F1 scores of up to 80.1% and 77.19% on the LSCC and LUAD datasets, respectively, and with 77.09% on our meningioma dataset. Based on our findings, we formulated a risk assessment scheme to estimate the risk to the patient's privacy prior to publication.
RESUMO
PURPOSE: This study investigates the application of Radiomic features within graph neural networks (GNNs) for the classification of multiple-epitope-ligand cartography (MELC) pathology samples. It aims to enhance the diagnosis of often misdiagnosed skin diseases such as eczema, lymphoma, and melanoma. The novel contribution lies in integrating Radiomic features with GNNs and comparing their efficacy against traditional multi-stain profiles. METHODS: We utilized GNNs to process multiple pathological slides as cell-level graphs, comparing their performance with XGBoost and Random Forest classifiers. The analysis included two feature types: multi-stain profiles and Radiomic features. Dimensionality reduction techniques such as UMAP and t-SNE were applied to optimize the feature space, and graph connectivity was based on spatial and feature closeness. RESULTS: Integrating Radiomic features into spatially connected graphs significantly improved classification accuracy over traditional models. The application of UMAP further enhanced the performance of GNNs, particularly in classifying diseases with similar pathological features. The GNN model outperformed baseline methods, demonstrating its robustness in handling complex histopathological data. CONCLUSION: Radiomic features processed through GNNs show significant promise for multi-disease classification, improving diagnostic accuracy. This study's findings suggest that integrating advanced imaging analysis with graph-based modeling can lead to better diagnostic tools. Future research should expand these methods to a wider range of diseases to validate their generalizability and effectiveness.
RESUMO
To develop and evaluate the performance of a deep learning model (DLM) that predicts eyes at high risk of surgical intervention for uncontrolled glaucoma based on multimodal data from an initial ophthalmology visit. Longitudinal, observational, retrospective study. 4898 unique eyes from 4038 adult glaucoma or glaucoma-suspect patients who underwent surgery for uncontrolled glaucoma (trabeculectomy, tube shunt, xen, or diode surgery) between 2013 and 2021, or did not undergo glaucoma surgery but had 3 or more ophthalmology visits. We constructed a DLM to predict the occurrence of glaucoma surgery within various time horizons from a baseline visit. Model inputs included spatially oriented visual field (VF) and optical coherence tomography (OCT) data as well as clinical and demographic features. Separate DLMs with the same architecture were trained to predict the occurrence of surgery within 3 months, within 3-6 months, within 6 months-1 year, within 1-2 years, within 2-3 years, within 3-4 years, and within 4-5 years from the baseline visit. Included eyes were randomly split into 60%, 20%, and 20% for training, validation, and testing. DLM performance was measured using area under the receiver operating characteristic curve (AUC) and precision-recall curve (PRC). Shapley additive explanations (SHAP) were utilized to assess the importance of different features. Model prediction of surgery for uncontrolled glaucoma within 3 months had the best AUC of 0.92 (95% CI 0.88, 0.96). DLMs achieved clinically useful AUC values (> 0.8) for all models that predicted the occurrence of surgery within 3 years. According to SHAP analysis, all 7 models placed intraocular pressure (IOP) within the five most important features in predicting the occurrence of glaucoma surgery. Mean deviation (MD) and average retinal nerve fiber layer (RNFL) thickness were listed among the top 5 most important features by 6 of the 7 models. DLMs can successfully identify eyes requiring surgery for uncontrolled glaucoma within specific time horizons. Predictive performance decreases as the time horizon for forecasting surgery increases. Implementing prediction models in a clinical setting may help identify patients that should be referred to a glaucoma specialist for surgical evaluation.
Assuntos
Aprendizado Profundo , Glaucoma , Oftalmologia , Trabeculectomia , Adulto , Humanos , Estudos Retrospectivos , Glaucoma/cirurgia , RetinaRESUMO
PURPOSE: Develop and evaluate the performance of a deep learning model (DLM) that forecasts eyes with low future visual field (VF) variability, and study the impact of using this DLM on sample size requirements for neuroprotective trials. DESIGN: Retrospective cohort and simulation study. METHODS: We included 1 eye per patient with baseline reliable VFs, OCT, clinical measures (demographics, intraocular pressure, and visual acuity), and 5 subsequent reliable VFs to forecast VF variability using DLMs and perform sample size estimates. We estimated sample size for 3 groups of eyes: all eyes (AE), low variability eyes (LVE: the subset of AE with a standard deviation of mean deviation [MD] slope residuals in the bottom 25th percentile), and DLM-predicted low variability eyes (DLPE: the subset of AE predicted to be low variability by the DLM). Deep learning models using only baseline VF/OCT/clinical data as input (DLM1), or also using a second VF (DLM2) were constructed to predict low VF variability (DLPE1 and DLPE2, respectively). Data were split 60/10/30 into train/val/test. Clinical trial simulations were performed only on the test set. We estimated the sample size necessary to detect treatment effects of 20% to 50% in MD slope with 80% power. Power was defined as the percentage of simulated clinical trials where the MD slope was significantly worse from the control. Clinical trials were simulated with visits every 3 months with a total of 10 visits. RESULTS: A total of 2817 eyes were included in the analysis. Deep learning models 1 and 2 achieved an area under the receiver operating characteristic curve of 0.73 (95% confidence interval [CI]: 0.68, 0.76) and 0.82 (95% CI: 0.78, 0.85) in forecasting low VF variability. When compared with including AE, using DLPE1 and DLPE2 reduced sample size to achieve 80% power by 30% and 38% for 30% treatment effect, and 31% and 38% for 50% treatment effect. CONCLUSIONS: Deep learning models can forecast eyes with low VF variability using data from a single baseline clinical visit. This can reduce sample size requirements, and potentially reduce the burden of future glaucoma clinical trials. FINANCIAL DISCLOSURE(S): Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.
Assuntos
Aprendizado Profundo , Pressão Intraocular , Campos Visuais , Humanos , Campos Visuais/fisiologia , Estudos Retrospectivos , Pressão Intraocular/fisiologia , Feminino , Masculino , Ensaios Clínicos como Assunto , Glaucoma/fisiopatologia , Glaucoma/diagnóstico , Acuidade Visual/fisiologia , Idoso , Testes de Campo Visual/métodos , Pessoa de Meia-Idade , Tomografia de Coerência Óptica/métodosRESUMO
OBJECTIVES: To train, test and validate the performance of a convolutional neural network (CNN)-based approach for the automated assessment of bone erosions, osteitis and synovitis in hand MRI of patients with inflammatory arthritis. METHODS: Hand MRIs (coronal T1-weighted, T2-weighted fat-suppressed, T1-weighted fat-suppressed contrast-enhanced) of rheumatoid arthritis (RA) and psoriatic arthritis (PsA) patients from the rheumatology department of the Erlangen University Hospital were assessed by two expert rheumatologists using the Outcome Measures in Rheumatology-validated RA MRI Scoring System and PsA MRI Scoring System scores and were used to train, validate and test CNNs to automatically score erosions, osteitis and synovitis. Scoring performance was compared with human annotations in terms of macro-area under the receiver operating characteristic curve (AUC) and balanced accuracy using fivefold cross-validation. Validation was performed on an independent dataset of MRIs from a second patient cohort. RESULTS: In total, 211 MRIs from 112 patients (14 906 region of interests (ROIs)) were included for training/internal validation using cross-validation and 220 MRIs from 75 patients (11 040 ROIs) for external validation of the networks. The networks achieved high mean (SD) macro-AUC of 92%±1% for erosions, 91%±2% for osteitis and 85%±2% for synovitis. Compared with human annotation, CNNs achieved a high mean Spearman correlation for erosions (90±2%), osteitis (78±8%) and synovitis (69±7%), which remained consistent in the validation dataset. CONCLUSIONS: We developed a CNN-based automated scoring system that allowed a rapid grading of erosions, osteitis and synovitis with good diagnostic accuracy and using less MRI sequences compared with conventional scoring. This CNN-based approach may help develop standardised cost-efficient and time-efficient assessments of hand MRIs for patients with arthritis.
Assuntos
Aprendizado Profundo , Imageamento por Ressonância Magnética , Osteíte , Sinovite , Humanos , Osteíte/diagnóstico por imagem , Osteíte/etiologia , Osteíte/diagnóstico , Osteíte/patologia , Sinovite/diagnóstico por imagem , Sinovite/etiologia , Sinovite/diagnóstico , Imageamento por Ressonância Magnética/métodos , Masculino , Feminino , Pessoa de Meia-Idade , Artrite Reumatoide/diagnóstico por imagem , Artrite Reumatoide/complicações , Mãos/diagnóstico por imagem , Mãos/patologia , Artrite Psoriásica/diagnóstico por imagem , Artrite Psoriásica/diagnóstico , Adulto , Idoso , Curva ROC , Índice de Gravidade de Doença , Redes Neurais de ComputaçãoRESUMO
Background: This research aims to improve glioblastoma survival prediction by integrating MR images, clinical, and molecular-pathologic data in a transformer-based deep learning model, addressing data heterogeneity and performance generalizability. Methods: We propose and evaluate a transformer-based nonlinear and nonproportional survival prediction model. The model employs self-supervised learning techniques to effectively encode the high-dimensional MRI input for integration with nonimaging data using cross-attention. To demonstrate model generalizability, the model is assessed with the time-dependent concordance index (Cdt) in 2 training setups using 3 independent public test sets: UPenn-GBM, UCSF-PDGM, and Rio Hortega University Hospital (RHUH)-GBM, each comprising 378, 366, and 36 cases, respectively. Results: The proposed transformer model achieved a promising performance for imaging as well as nonimaging data, effectively integrating both modalities for enhanced performance (UCSF-PDGM test-set, imaging Cdt 0.578, multimodal Cdt 0.672) while outperforming state-of-the-art late-fusion 3D-CNN-based models. Consistent performance was observed across the 3 independent multicenter test sets with Cdt values of 0.707 (UPenn-GBM, internal test set), 0.672 (UCSF-PDGM, first external test set), and 0.618 (RHUH-GBM, second external test set). The model achieved significant discrimination between patients with favorable and unfavorable survival for all 3 datasets (log-rank P 1.9 × 10-8, 9.7 × 10-3, and 1.2 × 10-2). Comparable results were obtained in the second setup using UCSF-PDGM for training/internal testing and UPenn-GBM and RHUH-GBM for external testing (Cdt 0.670, 0.638, and 0.621). Conclusions: The proposed transformer-based survival prediction model integrates complementary information from diverse input modalities, contributing to improved glioblastoma survival prediction compared to state-of-the-art methods. Consistent performance was observed across institutions supporting model generalizability.
RESUMO
Recognition of mitotic figures in histologic tumor specimens is highly relevant to patient outcome assessment. This task is challenging for algorithms and human experts alike, with deterioration of algorithmic performance under shifts in image representations. Considerable covariate shifts occur when assessment is performed on different tumor types, images are acquired using different digitization devices, or specimens are produced in different laboratories. This observation motivated the inception of the 2022 challenge on MItosis Domain Generalization (MIDOG 2022). The challenge provided annotated histologic tumor images from six different domains and evaluated the algorithmic approaches for mitotic figure detection provided by nine challenge participants on ten independent domains. Ground truth for mitotic figure detection was established in two ways: a three-expert majority vote and an independent, immunohistochemistry-assisted set of labels. This work represents an overview of the challenge tasks, the algorithmic strategies employed by the participants, and potential factors contributing to their success. With an F1 score of 0.764 for the top-performing team, we summarize that domain generalization across various tumor domains is possible with today's deep learning-based recognition pipelines. However, we also found that domain characteristics not present in the training set (feline as new species, spindle cell shape as new morphology and a new scanner) led to small but significant decreases in performance. When assessed against the immunohistochemistry-assisted reference standard, all methods resulted in reduced recall scores, with only minor changes in the order of participants in the ranking.
Assuntos
Laboratórios , Mitose , Humanos , Animais , Gatos , Algoritmos , Processamento de Imagem Assistida por Computador/métodos , Padrões de ReferênciaRESUMO
The success of immuno-oncology treatments promises long-term cancer remission for an increasing number of patients. The response to checkpoint inhibitor drugs has shown a correlation with the presence of immune cells in the tumor and tumor microenvironment. An in-depth understanding of the spatial localization of immune cells is therefore critical for understanding the tumor's immune landscape and predicting drug response. Computer-aided systems are well suited for efficiently quantifying immune cells in their spatial context. Conventional image analysis approaches are often based on color features and therefore require a high level of manual interaction. More robust image analysis methods based on deep learning are expected to decrease this reliance on human interaction and improve the reproducibility of immune cell scoring. However, these methods require sufficient training data and previous work has reported low robustness of these algorithms when they are tested on out-of-distribution data from different pathology labs or samples from different organs. In this work, we used a new image analysis pipeline to explicitly evaluate the robustness of marker-labeled lymphocyte quantification algorithms depending on the number of training samples before and after being transferred to a new tumor indication. For these experiments, we adapted the RetinaNet architecture for the task of T-lymphocyte detection and employed transfer learning to bridge the domain gap between tumor indications and reduce the annotation costs for unseen domains. On our test set, we achieved human-level performance for almost all tumor indications with an average precision of 0.74 in-domain and 0.72-0.74 cross-domain. From our results, we derive recommendations for model development regarding annotation extent, training sample selection, and label extraction for the development of robust algorithms for immune cell scoring. By extending the task of marker-labeled lymphocyte quantification to a multi-class detection task, the pre-requisite for subsequent analyses, e.g., distinguishing lymphocytes in the tumor stroma from tumor-infiltrating lymphocytes, is met.
RESUMO
Recently, algorithms capable of assessing the severity of Coronary Artery Disease (CAD) in form of the Coronary Artery Disease-Reporting and Data System (CAD-RADS) grade from Coronary Computed Tomography Angiography (CCTA) scans using Deep Learning (DL) were proposed. Before considering to apply these algorithms in clinical practice, their robustness regarding different commonly used Computed Tomography (CT)-specific image formation parameters-including denoising strength, slab combination, and reconstruction kernel-needs to be evaluated. For this study, we reconstructed a data set of 500 patient CCTA scans under seven image formation parameter configurations. We select one default configuration and evaluate how varying individual parameters impacts the performance and stability of a typical algorithm for automated CAD assessment from CCTA. This algorithm consists of multiple preprocessing and a DL prediction step. We evaluate the influence of the parameter changes on the entire pipeline and additionally on only the DL step by propagating the centerline extraction results of the default configuration to all others. We consider the standard deviation of the CAD severity prediction grade difference between the default and variation configurations to assess the stability w.r.t. parameter changes. For the full pipeline we observe slight instability (± 0.226 CAD-RADS) for all variations. Predictions are more stable with centerlines propagated from the default to the variation configurations (± 0.122 CAD-RADS), especially for differing denoising strengths (± 0.046 CAD-RADS). However, stacking slabs with sharp boundaries instead of mixing slabs in overlapping regions (called true stack ± 0.313 CAD-RADS) and increasing the sharpness of the reconstruction kernel (± 0.150 CAD-RADS) leads to unstable predictions. Regarding the clinically relevant tasks of excluding CAD (called rule-out; AUC default 0.957, min 0.937) and excluding obstructive CAD (called hold-out; AUC default 0.971, min 0.964) the performance remains on a high level for all variations. Concluding, an influence of reconstruction parameters on the predictions is observed. Especially, scans reconstructed with the true stack parameter need to be treated with caution when using a DL-based method. Also, reconstruction kernels which are underrepresented in the training data increase the prediction uncertainty.
Assuntos
Doença da Artéria Coronariana , Aprendizado Profundo , Humanos , Doença da Artéria Coronariana/diagnóstico por imagem , Doença da Artéria Coronariana/terapia , Angiografia Coronária/métodos , Tomografia Computadorizada por Raios X , Coração , Valor Preditivo dos TestesRESUMO
The prognostic value of mitotic figures in tumor tissue is well-established for many tumor types and automating this task is of high research interest. However, especially deep learning-based methods face performance deterioration in the presence of domain shifts, which may arise from different tumor types, slide preparation and digitization devices. We introduce the MIDOG++ dataset, an extension of the MIDOG 2021 and 2022 challenge datasets. We provide region of interest images from 503 histological specimens of seven different tumor types with variable morphology with in total labels for 11,937 mitotic figures: breast carcinoma, lung carcinoma, lymphosarcoma, neuroendocrine tumor, cutaneous mast cell tumor, cutaneous melanoma, and (sub)cutaneous soft tissue sarcoma. The specimens were processed in several laboratories utilizing diverse scanners. We evaluated the extent of the domain shift by using state-of-the-art approaches, observing notable differences in single-domain training. In a leave-one-domain-out setting, generalizability improved considerably. This mitotic figure dataset is the first that incorporates a wide domain shift based on different tumor types, laboratories, whole slide image scanners, and species.
Assuntos
Mitose , Neoplasias , Humanos , Algoritmos , Prognóstico , Neoplasias/patologiaRESUMO
The density of mitotic figures (MF) within tumor tissue is known to be highly correlated with tumor proliferation and thus is an important marker in tumor grading. Recognition of MF by pathologists is subject to a strong inter-rater bias, limiting its prognostic value. State-of-the-art deep learning methods can support experts but have been observed to strongly deteriorate when applied in a different clinical environment. The variability caused by using different whole slide scanners has been identified as one decisive component in the underlying domain shift. The goal of the MICCAI MIDOG 2021 challenge was the creation of scanner-agnostic MF detection algorithms. The challenge used a training set of 200 cases, split across four scanning systems. As test set, an additional 100 cases split across four scanning systems, including two previously unseen scanners, were provided. In this paper, we evaluate and compare the approaches that were submitted to the challenge and identify methodological factors contributing to better performance. The winning algorithm yielded an F1 score of 0.748 (CI95: 0.704-0.781), exceeding the performance of six experts on the same task.