RESUMO
BACKGROUND: Weakly supervised learning promises reduced annotation effort while maintaining performance. PURPOSE: To compare weakly supervised training with full slice-wise annotated training of a deep convolutional classification network (CNN) for prostate cancer (PC). STUDY TYPE: Retrospective. SUBJECTS: One thousand four hundred eighty-nine consecutive institutional prostate MRI examinations from men with suspicion for PC (65 ± 8 years) between January 2015 and November 2020 were split into training (N = 794, enriched with 204 PROSTATEx examinations) and test set (N = 695). FIELD STRENGTH/SEQUENCE: 1.5 and 3T, T2-weighted turbo-spin-echo and diffusion-weighted echo-planar imaging. ASSESSMENT: Histopathological ground truth was provided by targeted and extended systematic biopsy. Reference training was performed using slice-level annotation (SLA) and compared to iterative training utilizing patient-level annotations (PLAs) with supervised feedback of CNN estimates into the next training iteration at three incremental training set sizes (N = 200, 500, 998). Model performance was assessed by comparing specificity at fixed sensitivity of 0.97 [254/262] emulating PI-RADS ≥ 3, and 0.88-0.90 [231-236/262] emulating PI-RADS ≥ 4 decisions. STATISTICAL TESTS: Receiver operating characteristic (ROC) and area under the curve (AUC) was compared using DeLong and Obuchowski test. Sensitivity and specificity were compared using McNemar test. Statistical significance threshold was P = 0.05. RESULTS: Test set (N = 695) ROC-AUC performance of SLA (trained with 200/500/998 exams) was 0.75/0.80/0.83, respectively. PLA achieved lower ROC-AUC of 0.64/0.72/0.78. Both increased performance significantly with increasing training set size. ROC-AUC for SLA at 500 exams was comparable to PLA at 998 exams (P = 0.28). ROC-AUC was significantly different between SLA and PLA at same training set sizes, however the ROC-AUC difference decreased significantly from 200 to 998 training exams. Emulating PI-RADS ≥ 3 decisions, difference between PLA specificity of 0.12 [51/433] and SLA specificity of 0.13 [55/433] became undetectable (P = 1.0) at 998 exams. Emulating PI-RADS ≥ 4 decisions, at 998 exams, SLA specificity of 0.51 [221/433] remained higher than PLA specificity at 0.39 [170/433]. However, PLA specificity at 998 exams became comparable to SLA specificity of 0.37 [159/433] at 200 exams (P = 0.70). DATA CONCLUSION: Weakly supervised training of a classification CNN using patient-level-only annotation had lower performance compared to training with slice-wise annotations, but improved significantly faster with additional training data. EVIDENCE LEVEL: 3 TECHNICAL EFFICACY: Stage 2.
Assuntos
Aprendizado Profundo , Neoplasias da Próstata , Masculino , Humanos , Imageamento por Ressonância Magnética/métodos , Neoplasias da Próstata/diagnóstico por imagem , Neoplasias da Próstata/patologia , Estudos Retrospectivos , PoliésteresRESUMO
OBJECTIVES: Risk calculators (RCs) improve patient selection for prostate biopsy with clinical/demographic information, recently with prostate MRI using the prostate imaging reporting and data system (PI-RADS). Fully-automated deep learning (DL) analyzes MRI data independently, and has been shown to be on par with clinical radiologists, but has yet to be incorporated into RCs. The goal of this study is to re-assess the diagnostic quality of RCs, the impact of replacing PI-RADS with DL predictions, and potential performance gains by adding DL besides PI-RADS. MATERIAL AND METHODS: One thousand six hundred twenty-seven consecutive examinations from 2014 to 2021 were included in this retrospective single-center study, including 517 exams withheld for RC testing. Board-certified radiologists assessed PI-RADS during clinical routine, then systematic and MRI/Ultrasound-fusion biopsies provided histopathological ground truth for significant prostate cancer (sPC). nnUNet-based DL ensembles were trained on biparametric MRI predicting the presence of sPC lesions (UNet-probability) and a PI-RADS-analogous five-point scale (UNet-Likert). Previously published RCs were validated as is; with PI-RADS substituted by UNet-Likert (UNet-Likert-substituted RC); and with both UNet-probability and PI-RADS (UNet-probability-extended RC). Together with a newly fitted RC using clinical data, PI-RADS and UNet-probability, existing RCs were compared by receiver-operating characteristics, calibration, and decision-curve analysis. RESULTS: Diagnostic performance remained stable for UNet-Likert-substituted RCs. DL contained complementary diagnostic information to PI-RADS. The newly-fitted RC spared 49% [252/517] of biopsies while maintaining the negative predictive value (94%), compared to PI-RADS ≥ 4 cut-off which spared 37% [190/517] (p < 0.001). CONCLUSIONS: Incorporating DL as an independent diagnostic marker for RCs can improve patient stratification before biopsy, as there is complementary information in DL features and clinical PI-RADS assessment. CLINICAL RELEVANCE STATEMENT: For patients with positive prostate screening results, a comprehensive diagnostic workup, including prostate MRI, DL analysis, and individual classification using nomograms can identify patients with minimal prostate cancer risk, as they benefit less from the more invasive biopsy procedure. KEY POINTS: The current MRI-based nomograms result in many negative prostate biopsies. The addition of DL to nomograms with clinical data and PI-RADS improves patient stratification before biopsy. Fully automatic DL can be substituted for PI-RADS without sacrificing the quality of nomogram predictions. Prostate nomograms show cancer detection ability comparable to previous validation studies while being suitable for the addition of DL analysis.
RESUMO
OBJECTIVES: To evaluate a fully automatic deep learning system to detect and segment clinically significant prostate cancer (csPCa) on same-vendor prostate MRI from two different institutions not contributing to training of the system. MATERIALS AND METHODS: In this retrospective study, a previously bi-institutionally validated deep learning system (UNETM) was applied to bi-parametric prostate MRI data from one external institution (A), a PI-RADS distribution-matched internal cohort (B), and a csPCa stratified subset of single-institution external public challenge data (C). csPCa was defined as ISUP Grade Group ≥ 2 determined from combined targeted and extended systematic MRI/transrectal US-fusion biopsy. Performance of UNETM was evaluated by comparing ROC AUC and specificity at typical PI-RADS sensitivity levels. Lesion-level analysis between UNETM segmentations and radiologist-delineated segmentations was performed using Dice coefficient, free-response operating characteristic (FROC), and weighted alternative (waFROC). The influence of using different diffusion sequences was analyzed in cohort A. RESULTS: In 250/250/140 exams in cohorts A/B/C, differences in ROC AUC were insignificant with 0.80 (95% CI: 0.74-0.85)/0.87 (95% CI: 0.83-0.92)/0.82 (95% CI: 0.75-0.89). At sensitivities of 95% and 90%, UNETM achieved specificity of 30%/50% in A, 44%/71% in B, and 43%/49% in C, respectively. Dice coefficient of UNETM and radiologist-delineated lesions was 0.36 in A and 0.49 in B. The waFROC AUC was 0.67 (95% CI: 0.60-0.83) in A and 0.7 (95% CI: 0.64-0.78) in B. UNETM performed marginally better on readout-segmented than on single-shot echo-planar-imaging. CONCLUSION: For same-vendor examinations, deep learning provided comparable discrimination of csPCa and non-csPCa lesions and examinations between local and two independent external data sets, demonstrating the applicability of the system to institutions not participating in model training. CLINICAL RELEVANCE STATEMENT: A previously bi-institutionally validated fully automatic deep learning system maintained acceptable exam-level diagnostic performance in two independent external data sets, indicating the potential of deploying AI models without retraining or fine-tuning, and corroborating evidence that AI models extract a substantial amount of transferable domain knowledge about MRI-based prostate cancer assessment. KEY POINTS: ⢠A previously bi-institutionally validated fully automatic deep learning system maintained acceptable exam-level diagnostic performance in two independent external data sets. ⢠Lesion detection performance and segmentation congruence was similar on the institutional and an external data set, as measured by the weighted alternative FROC AUC and Dice coefficient. ⢠Although the system generalized to two external institutions without re-training, achieving expected sensitivity and specificity levels using the deep learning system requires probability thresholds to be adjusted, underlining the importance of institution-specific calibration and quality control.
Assuntos
Aprendizado Profundo , Neoplasias da Próstata , Masculino , Humanos , Imageamento por Ressonância Magnética , Próstata/diagnóstico por imagem , Próstata/patologia , Neoplasias da Próstata/diagnóstico por imagem , Neoplasias da Próstata/patologia , Estudos RetrospectivosRESUMO
Prostate cancer (PCa) diagnosis on multi-parametric magnetic resonance images (MRI) requires radiologists with a high level of expertise. Misalignments between the MRI sequences can be caused by patient movement, elastic soft-tissue deformations, and imaging artifacts. They further increase the complexity of the task prompting radiologists to interpret the images. Recently, computer-aided diagnosis (CAD) tools have demonstrated potential for PCa diagnosis typically relying on complex co-registration of the input modalities. However, there is no consensus among research groups on whether CAD systems profit from using registration. Furthermore, alternative strategies to handle multi-modal misalignments have not been explored so far. Our study introduces and compares different strategies to cope with image misalignments and evaluates them regarding to their direct effect on diagnostic accuracy of PCa. In addition to established registration algorithms, we propose 'misalignment augmentation' as a concept to increase CAD robustness. As the results demonstrate, misalignment augmentations can not only compensate for a complete lack of registration, but if used in conjunction with registration, also improve the overall performance on an independent test set.
Assuntos
Próstata , Neoplasias da Próstata , Masculino , Humanos , Próstata/diagnóstico por imagem , Próstata/patologia , Imageamento por Ressonância Magnética/métodos , Diagnóstico por Computador/métodos , Neoplasias da Próstata/diagnóstico por imagem , Neoplasias da Próstata/patologia , ComputadoresRESUMO
OBJECTIVES: The aim of this study was to estimate the prospective utility of a previously retrospectively validated convolutional neural network (CNN) for prostate cancer (PC) detection on prostate magnetic resonance imaging (MRI). MATERIALS AND METHODS: The biparametric (T2-weighted and diffusion-weighted) portion of clinical multiparametric prostate MRI from consecutive men included between November 2019 and September 2020 was fully automatically and individually analyzed by a CNN briefly after image acquisition (pseudoprospective design). Radiology residents performed 2 research Prostate Imaging Reporting and Data System (PI-RADS) assessments of the multiparametric dataset independent from clinical reporting (paraclinical design) before and after review of the CNN results and completed a survey. Presence of clinically significant PC was determined by the presence of an International Society of Urological Pathology grade 2 or higher PC on combined targeted and extended systematic transperineal MRI/transrectal ultrasound fusion biopsy. Sensitivities and specificities on a patient and prostate sextant basis were compared using the McNemar test and compared with the receiver operating characteristic (ROC) curve of CNN. Survey results were summarized as absolute counts and percentages. RESULTS: A total of 201 men were included. The CNN achieved an ROC area under the curve of 0.77 on a patient basis. Using PI-RADS ≥3-emulating probability threshold (c3), CNN had a patient-based sensitivity of 81.8% and specificity of 54.8%, not statistically different from the current clinical routine PI-RADS ≥4 assessment at 90.9% and 54.8%, respectively ( P = 0.30/ P = 1.0). In general, residents achieved similar sensitivity and specificity before and after CNN review. On a prostate sextant basis, clinical assessment possessed the highest ROC area under the curve of 0.82, higher than CNN (AUC = 0.76, P = 0.21) and significantly higher than resident performance before and after CNN review (AUC = 0.76 / 0.76, P ≤ 0.03). The resident survey indicated CNN to be helpful and clinically useful. CONCLUSIONS: Pseudoprospective paraclinical integration of fully automated CNN-based detection of suspicious lesions on prostate multiparametric MRI was demonstrated and showed good acceptance among residents, whereas no significant improvement in resident performance was found. General CNN performance was preserved despite an observed shift in CNN calibration, identifying the requirement for continuous quality control and recalibration.
Assuntos
Aprendizado Profundo , Neoplasias da Próstata , Radiologia , Humanos , Biópsia Guiada por Imagem/métodos , Imageamento por Ressonância Magnética/métodos , Masculino , Próstata/diagnóstico por imagem , Próstata/patologia , Neoplasias da Próstata/patologia , Estudos RetrospectivosRESUMO
BACKGROUND: The potential of deep learning to support radiologist prostate magnetic resonance imaging (MRI) interpretation has been demonstrated. PURPOSE: The aim of this study was to evaluate the effects of increased and diversified training data (TD) on deep learning performance for detection and segmentation of clinically significant prostate cancer-suspicious lesions. MATERIALS AND METHODS: In this retrospective study, biparametric (T2-weighted and diffusion-weighted) prostate MRI acquired with multiple 1.5-T and 3.0-T MRI scanners in consecutive men was used for training and testing of prostate segmentation and lesion detection networks. Ground truth was the combination of targeted and extended systematic MRI-transrectal ultrasound fusion biopsies, with significant prostate cancer defined as International Society of Urological Pathology grade group greater than or equal to 2. U-Nets were internally validated on full, reduced, and PROSTATEx-enhanced training sets and subsequently externally validated on the institutional test set and the PROSTATEx test set. U-Net segmentation was calibrated to clinically desired levels in cross-validation, and test performance was subsequently compared using sensitivities, specificities, predictive values, and Dice coefficient. RESULTS: One thousand four hundred eighty-eight institutional examinations (median age, 64 years; interquartile range, 58-70 years) were temporally split into training (2014-2017, 806 examinations, supplemented by 204 PROSTATEx examinations) and test (2018-2020, 682 examinations) sets. In the test set, Prostate Imaging-Reporting and Data System (PI-RADS) cutoffs greater than or equal to 3 and greater than or equal to 4 on a per-patient basis had sensitivity of 97% (241/249) and 90% (223/249) at specificity of 19% (82/433) and 56% (242/433), respectively. The full U-Net had corresponding sensitivity of 97% (241/249) and 88% (219/249) with specificity of 20% (86/433) and 59% (254/433), not statistically different from PI-RADS (P > 0.3 for all comparisons). U-Net trained using a reduced set of 171 consecutive examinations achieved inferior performance (P < 0.001). PROSTATEx training enhancement did not improve performance. Dice coefficients were 0.90 for prostate and 0.42/0.53 for MRI lesion segmentation at PI-RADS category 3/4 equivalents. CONCLUSIONS: In a large institutional test set, U-Net confirms similar performance to clinical PI-RADS assessment and benefits from more TD, with neither institutional nor PROSTATEx performance improved by adding multiscanner or bi-institutional TD.