RESUMO
BACKGROUND: Artificial intelligence (AI) systems can potentially aid the diagnostic pathway of prostate cancer by alleviating the increasing workload, preventing overdiagnosis, and reducing the dependence on experienced radiologists. We aimed to investigate the performance of AI systems at detecting clinically significant prostate cancer on MRI in comparison with radiologists using the Prostate Imaging-Reporting and Data System version 2.1 (PI-RADS 2.1) and the standard of care in multidisciplinary routine practice at scale. METHODS: In this international, paired, non-inferiority, confirmatory study, we trained and externally validated an AI system (developed within an international consortium) for detecting Gleason grade group 2 or greater cancers using a retrospective cohort of 10 207 MRI examinations from 9129 patients. Of these examinations, 9207 cases from three centres (11 sites) based in the Netherlands were used for training and tuning, and 1000 cases from four centres (12 sites) based in the Netherlands and Norway were used for testing. In parallel, we facilitated a multireader, multicase observer study with 62 radiologists (45 centres in 20 countries; median 7 [IQR 5-10] years of experience in reading prostate MRI) using PI-RADS (2.1) on 400 paired MRI examinations from the testing cohort. Primary endpoints were the sensitivity, specificity, and the area under the receiver operating characteristic curve (AUROC) of the AI system in comparison with that of all readers using PI-RADS (2.1) and in comparison with that of the historical radiology readings made during multidisciplinary routine practice (ie, the standard of care with the aid of patient history and peer consultation). Histopathology and at least 3 years (median 5 [IQR 4-6] years) of follow-up were used to establish the reference standard. The statistical analysis plan was prespecified with a primary hypothesis of non-inferiority (considering a margin of 0·05) and a secondary hypothesis of superiority towards the AI system, if non-inferiority was confirmed. This study was registered at ClinicalTrials.gov, NCT05489341. FINDINGS: Of the 10 207 examinations included from Jan 1, 2012, through Dec 31, 2021, 2440 cases had histologically confirmed Gleason grade group 2 or greater prostate cancer. In the subset of 400 testing cases in which the AI system was compared with the radiologists participating in the reader study, the AI system showed a statistically superior and non-inferior AUROC of 0·91 (95% CI 0·87-0·94; p<0·0001), in comparison to the pool of 62 radiologists with an AUROC of 0·86 (0·83-0·89), with a lower boundary of the two-sided 95% Wald CI for the difference in AUROC of 0·02. At the mean PI-RADS 3 or greater operating point of all readers, the AI system detected 6·8% more cases with Gleason grade group 2 or greater cancers at the same specificity (57·7%, 95% CI 51·6-63·3), or 50·4% fewer false-positive results and 20·0% fewer cases with Gleason grade group 1 cancers at the same sensitivity (89·4%, 95% CI 85·3-92·9). In all 1000 testing cases where the AI system was compared with the radiology readings made during multidisciplinary practice, non-inferiority was not confirmed, as the AI system showed lower specificity (68·9% [95% CI 65·3-72·4] vs 69·0% [65·5-72·5]) at the same sensitivity (96·1%, 94·0-98·2) as the PI-RADS 3 or greater operating point. The lower boundary of the two-sided 95% Wald CI for the difference in specificity (-0·04) was greater than the non-inferiority margin (-0·05) and a p value below the significance threshold was reached (p<0·001). INTERPRETATION: An AI system was superior to radiologists using PI-RADS (2.1), on average, at detecting clinically significant prostate cancer and comparable to the standard of care. Such a system shows the potential to be a supportive tool within a primary diagnostic setting, with several associated benefits for patients and radiologists. Prospective validation is needed to test clinical applicability of this system. FUNDING: Health~Holland and EU Horizon 2020.
Assuntos
Inteligência Artificial , Imageamento por Ressonância Magnética , Neoplasias da Próstata , Radiologistas , Humanos , Masculino , Neoplasias da Próstata/diagnóstico por imagem , Neoplasias da Próstata/patologia , Idoso , Estudos Retrospectivos , Pessoa de Meia-Idade , Gradação de Tumores , Países Baixos , Curva ROCRESUMO
Background A priori identification of patients at risk of artificial intelligence (AI) failure in diagnosing cancer would contribute to the safer clinical integration of diagnostic algorithms. Purpose To evaluate AI prediction variability as an uncertainty quantification (UQ) metric for identifying cases at risk of AI failure in diagnosing cancer at MRI and CT across different cancer types, data sets, and algorithms. Materials and Methods Multicenter data sets and publicly available AI algorithms from three previous studies that evaluated detection of pancreatic cancer on contrast-enhanced CT images, detection of prostate cancer on MRI scans, and prediction of pulmonary nodule malignancy on low-dose CT images were analyzed retrospectively. Each task's algorithm was extended to generate an uncertainty score based on ensemble prediction variability. AI accuracy percentage and partial area under the receiver operating characteristic curve (pAUC) were compared between certain and uncertain patient groups in a range of percentile thresholds (10%-90%) for the uncertainty score using permutation tests for statistical significance. The pulmonary nodule malignancy prediction algorithm was compared with 11 clinical readers for the certain group (CG) and uncertain group (UG). Results In total, 18 022 images were used for training and 838 images were used for testing. AI diagnostic accuracy was higher for the cases in the CG across all tasks (P < .001). At an 80% threshold of certain predictions, accuracy in the CG was 21%-29% higher than in the UG and 4%-6% higher than in the overall test data sets. The lesion-level pAUC in the CG was 0.25-0.39 higher than in the UG and 0.05-0.08 higher than in the overall test data sets (P < .001). For pulmonary nodule malignancy prediction, accuracy of AI was on par with clinicians for cases in the CG (AI results vs clinician results, 80% [95% CI: 76, 85] vs 78% [95% CI: 70, 87]; P = .07) but worse for cases in the UG (AI results vs clinician results, 50% [95% CI: 37, 64] vs 68% [95% CI: 60, 76]; P < .001). Conclusion An AI-prediction UQ metric consistently identified reduced performance of AI in cancer diagnosis. © RSNA, 2023 Supplemental material is available for this article. See also the editorial by Babyn in this issue.
Assuntos
Neoplasias Pulmonares , Transtornos Mentais , Masculino , Humanos , Inteligência Artificial , Estudos Retrospectivos , Imageamento por Ressonância Magnética , Neoplasias Pulmonares/diagnóstico por imagem , Tomografia Computadorizada por Raios XRESUMO
OBJECTIVES: Deep learning (DL) studies for the detection of clinically significant prostate cancer (csPCa) on magnetic resonance imaging (MRI) often overlook potentially relevant clinical parameters such as prostate-specific antigen, prostate volume, and age. This study explored the integration of clinical parameters and MRI-based DL to enhance diagnostic accuracy for csPCa on MRI. MATERIALS AND METHODS: We retrospectively analyzed 932 biparametric prostate MRI examinations performed for suspected csPCa (ISUP ≥2) at 2 institutions. Each MRI scan was automatically analyzed by a previously developed DL model to detect and segment csPCa lesions. Three sets of features were extracted: DL lesion suspicion levels, clinical parameters (prostate-specific antigen, prostate volume, age), and MRI-based lesion volumes for all DL-detected lesions. Six multimodal artificial intelligence (AI) classifiers were trained for each combination of feature sets, employing both early (feature-level) and late (decision-level) information fusion methods. The diagnostic performance of each model was tested internally on 20% of center 1 data and externally on center 2 data (n = 529). Receiver operating characteristic comparisons determined the optimal feature combination and information fusion method and assessed the benefit of multimodal versus unimodal analysis. The optimal model performance was compared with a radiologist using PI-RADS. RESULTS: Internally, the multimodal AI integrating DL suspicion levels with clinical features via early fusion achieved the highest performance. Externally, it surpassed baselines using clinical parameters (0.77 vs 0.67 area under the curve [AUC], P < 0.001) and DL suspicion levels alone (AUC: 0.77 vs 0.70, P = 0.006). Early fusion outperformed late fusion in external data (0.77 vs 0.73 AUC, P = 0.005). No significant performance gaps were observed between multimodal AI and radiologist assessments (internal: 0.87 vs 0.88 AUC; external: 0.77 vs 0.75 AUC, both P > 0.05). CONCLUSIONS: Multimodal AI (combining DL suspicion levels and clinical parameters) outperforms clinical and MRI-only AI for csPCa detection. Early information fusion enhanced AI robustness in our multicenter setting. Incorporating lesion volumes did not enhance diagnostic efficacy.
RESUMO
PURPOSE: To explore diagnostic deep learning for optimizing the prostate MRI protocol by assessing the diagnostic efficacy of MRI sequences. METHOD: This retrospective study included 840 patients with a biparametric prostate MRI scan. The MRI protocol included a T2-weighted image, three DWI sequences (b50, b400, and b800 s/mm2), a calculated ADC map, and a calculated b1400 sequence. Two accelerated MRI protocols were simulated, using only two acquired b-values to calculate the ADC and b1400. Deep learning models were trained to detect prostate cancer lesions on accelerated and full protocols. The diagnostic performances of the protocols were compared on the patient-level with the area under the receiver operating characteristic (AUROC), using DeLong's test, and on the lesion-level with the partial area under the free response operating characteristic (pAUFROC), using a permutation test. Validation of the results was performed among expert radiologists. RESULTS: No significant differences in diagnostic performance were found between the accelerated protocols and the full bpMRI baseline. Omitting b800 reduced 53% DWI scan time, with a performance difference of + 0.01 AUROC (p = 0.20) and -0.03 pAUFROC (p = 0.45). Omitting b400 reduced 32% DWI scan time, with a performance difference of -0.01 AUROC (p = 0.65) and + 0.01 pAUFROC (p = 0.73). Multiple expert radiologists underlined the findings. CONCLUSIONS: This study shows that deep learning can assess the diagnostic efficacy of MRI sequences by comparing prostate MRI protocols on diagnostic accuracy. Omitting either the b400 or the b800 DWI sequence can optimize the prostate MRI protocol by reducing scan time without compromising diagnostic quality.
Assuntos
Aprendizado Profundo , Imageamento por Ressonância Magnética , Neoplasias da Próstata , Humanos , Masculino , Neoplasias da Próstata/diagnóstico por imagem , Estudos Retrospectivos , Imageamento por Ressonância Magnética/métodos , Pessoa de Meia-Idade , Idoso , Interpretação de Imagem Assistida por Computador/métodos , Reprodutibilidade dos Testes , Sensibilidade e EspecificidadeRESUMO
BACKGROUND AND OBJECTIVE: Biparametric magnetic resonance imaging (bpMRI), excluding dynamic contrast-enhanced (DCE) magnetic resonance imaging (MRI), is a potential replacement for multiparametric MRI (mpMRI) in diagnosing clinically significant prostate cancer (csPCa). An extensive international multireader multicase observer study was conducted to assess the noninferiority of bpMRI to mpMRI in csPCa diagnosis. METHODS: An observer study was conducted with 400 mpMRI examinations from four European centers, excluding examinations with prior prostate treatment or csPCa (Gleason grade [GG] ≥2) findings. Readers assessed bpMRI and mpMRI sequentially, assigning lesion-specific Prostate Imaging Reporting and Data System (PI-RADS) scores (3-5) and a patient-level suspicion score (0-100). The noninferiority of patient-level bpMRI versus mpMRI csPCa diagnosis was evaluated using the area under the receiver operating curve (AUROC) alongside the sensitivity and specificity at PI-RADS ≥3 with a 5% margin. The secondary outcomes included insignificant prostate cancer (GG1) diagnosis, diagnostic evaluations at alternative risk thresholds, decision curve analyses (DCAs), and subgroup analyses considering reader expertise. Histopathology and ≥3 yr of follow-up were used for the reference standard. KEY FINDINGS AND LIMITATIONS: Sixty-two readers (45 centers and 20 countries) participated. The prevalence of csPCa was 33% (133/400); bpMRI and mpMRI showed similar AUROC values of 0.853 (95% confidence interval [CI], 0.819-0.887) and 0.859 (95% CI, 0.826-0.893), respectively, with a noninferior difference of -0.6% (95% CI, -1.2% to 0.1%, p < 0.001). At PI-RADS ≥3, bpMRI and mpMRI had sensitivities of 88.6% (95% CI, 84.8-92.3%) and 89.4% (95% CI, 85.8-93.1%), respectively, with a noninferior difference of -0.9% (95% CI, -1.7% to 0.0%, p < 0.001), and specificities of 58.6% (95% CI, 52.3-63.1%) and 57.7% (95% CI, 52.3-63.1%), respectively, with a noninferior difference of 0.9% (95% CI, 0.0-1.8%, p < 0.001). At alternative risk thresholds, mpMRI increased sensitivity at the expense of reduced specificity. DCA demonstrated the highest net benefit for an mpMRI pathway in cancer-averse scenarios, whereas a bpMRI pathway showed greater benefit for biopsy-averse scenarios. A subgroup analysis indicated limited additional benefit of DCE MRI for nonexperts. Limitations included that biopsies were conducted based on mpMRI imaging, and reading was performed in a sequential order. CONCLUSIONS AND CLINICAL IMPLICATIONS: It has been found that bpMRI is noninferior to mpMRI in csPCa diagnosis at AUROC, along with the sensitivity and specificity at PI-RADS ≥3, showing its value in individuals without prior csPCa findings and prostate treatment. Additional randomized prospective studies are required to investigate the generalizability of outcomes.
RESUMO
Purpose: To evaluate a novel method of semisupervised learning (SSL) guided by automated sparse information from diagnostic reports to leverage additional data for deep learning-based malignancy detection in patients with clinically significant prostate cancer. Materials and Methods: This retrospective study included 7756 prostate MRI examinations (6380 patients) performed between January 2014 and December 2020 for model development. An SSL method, report-guided SSL (RG-SSL), was developed for detection of clinically significant prostate cancer using biparametric MRI. RG-SSL, supervised learning (SL), and state-of-the-art SSL methods were trained using 100, 300, 1000, or 3050 manually annotated examinations. Performance on detection of clinically significant prostate cancer by RG-SSL, SL, and SSL was compared on 300 unseen examinations from an external center with a histopathologically confirmed reference standard. Performance was evaluated using receiver operating characteristic (ROC) and free-response ROC analysis. P values for performance differences were generated with a permutation test. Results: At 100 manually annotated examinations, mean examination-based diagnostic area under the ROC curve (AUC) values for RG-SSL, SL, and the best SSL were 0.86 ± 0.01 (SD), 0.78 ± 0.03, and 0.81 ± 0.02, respectively. Lesion-based detection partial AUCs were 0.62 ± 0.02, 0.44 ± 0.04, and 0.48 ± 0.09, respectively. Examination-based performance of SL with 3050 examinations was matched by RG-SSL with 169 manually annotated examinations, thus requiring 14 times fewer annotations. Lesion-based performance was matched with 431 manually annotated examinations, requiring six times fewer annotations. Conclusion: RG-SSL outperformed SSL in clinically significant prostate cancer detection and achieved performance similar to SL even at very low annotation budgets.Keywords: Annotation Efficiency, Computer-aided Detection and Diagnosis, MRI, Prostate Cancer, Semisupervised Deep Learning Supplemental material is available for this article. Published under a CC BY 4.0 license.
RESUMO
Early detection improves prognosis in pancreatic ductal adenocarcinoma (PDAC), but is challenging as lesions are often small and poorly defined on contrast-enhanced computed tomography scans (CE-CT). Deep learning can facilitate PDAC diagnosis; however, current models still fail to identify small (<2 cm) lesions. In this study, state-of-the-art deep learning models were used to develop an automatic framework for PDAC detection, focusing on small lesions. Additionally, the impact of integrating the surrounding anatomy was investigated. CE-CT scans from a cohort of 119 pathology-proven PDAC patients and a cohort of 123 patients without PDAC were used to train a nnUnet for automatic lesion detection and segmentation (nnUnet_T). Two additional nnUnets were trained to investigate the impact of anatomy integration: (1) segmenting the pancreas and tumor (nnUnet_TP), and (2) segmenting the pancreas, tumor, and multiple surrounding anatomical structures (nnUnet_MS). An external, publicly available test set was used to compare the performance of the three networks. The nnUnet_MS achieved the best performance, with an area under the receiver operating characteristic curve of 0.91 for the whole test set and 0.88 for tumors <2 cm, showing that state-of-the-art deep learning can detect small PDAC and benefits from anatomy information.