RESUMO
Accurate characterization of microcalcifications (MCs) in 2D digital mammography is a necessary step toward reducing the diagnostic uncertainty associated with the callback of indeterminate MCs. Quantitative analysis of MCs can better identify MCs with a higher likelihood of ductal carcinoma in situ or invasive cancer. However, automated identification and segmentation of MCs remain challenging with high false positive rates. We present a two-stage multiscale approach to MC segmentation in 2D full-field digital mammograms (FFDMs) and diagnostic magnification views. Candidate objects are first delineated using blob detection and Hessian analysis. A regression convolutional network, trained to output a function with a higher response near MCs, chooses the objects which constitute actual MCs. The method was trained and validated on 435 screening and diagnostic FFDMs from two separate datasets. We then used our approach to segment MCs on magnification views of 248 cases with amorphous MCs. We modeled the extracted features using gradient tree boosting to classify each case as benign or malignant. Compared to state-of-the-art comparison methods, our approach achieved superior mean intersection over the union (0.670 ± 0.121 per image versus 0.524 ± 0.034 per image), intersection over the union per MC object (0.607 ± 0.250 versus 0.363 ± 0.278) and true positive rate of 0.744 versus 0.581 at 0.4 false positive detections per square centimeter. Features generated using our approach outperformed the comparison method (0.763 versus 0.710 AUC) in distinguishing amorphous calcifications as benign or malignant.
Assuntos
Doenças Mamárias , Neoplasias da Mama , Calcinose , Humanos , Feminino , Intensificação de Imagem Radiográfica/métodos , Doenças Mamárias/diagnóstico por imagem , Mamografia/métodos , Calcinose/diagnóstico por imagem , Probabilidade , Neoplasias da Mama/diagnóstico por imagemRESUMO
Importance: An accurate and robust artificial intelligence (AI) algorithm for detecting cancer in digital breast tomosynthesis (DBT) could significantly improve detection accuracy and reduce health care costs worldwide. Objectives: To make training and evaluation data for the development of AI algorithms for DBT analysis available, to develop well-defined benchmarks, and to create publicly available code for existing methods. Design, Setting, and Participants: This diagnostic study is based on a multi-institutional international grand challenge in which research teams developed algorithms to detect lesions in DBT. A data set of 22â¯032 reconstructed DBT volumes was made available to research teams. Phase 1, in which teams were provided 700 scans from the training set, 120 from the validation set, and 180 from the test set, took place from December 2020 to January 2021, and phase 2, in which teams were given the full data set, took place from May to July 2021. Main Outcomes and Measures: The overall performance was evaluated by mean sensitivity for biopsied lesions using only DBT volumes with biopsied lesions; ties were broken by including all DBT volumes. Results: A total of 8 teams participated in the challenge. The team with the highest mean sensitivity for biopsied lesions was the NYU B-Team, with 0.957 (95% CI, 0.924-0.984), and the second-place team, ZeDuS, had a mean sensitivity of 0.926 (95% CI, 0.881-0.964). When the results were aggregated, the mean sensitivity for all submitted algorithms was 0.879; for only those who participated in phase 2, it was 0.926. Conclusions and Relevance: In this diagnostic study, an international competition produced algorithms with high sensitivity for using AI to detect lesions on DBT images. A standardized performance benchmark for the detection task using publicly available clinical imaging data was released, with detailed descriptions and analyses of submitted algorithms accompanied by a public release of their predictions and code for selected methods. These resources will serve as a foundation for future research on computer-assisted diagnosis methods for DBT, significantly lowering the barrier of entry for new researchers.
Assuntos
Inteligência Artificial , Neoplasias da Mama , Humanos , Feminino , Benchmarking , Mamografia/métodos , Algoritmos , Interpretação de Imagem Radiográfica Assistida por Computador/métodos , Neoplasias da Mama/diagnóstico por imagemRESUMO
BACKGROUND: Amorphous calcifications noted on mammograms (i.e., small and indistinct calcifications that are difficult to characterize) are associated with high diagnostic uncertainty, often leading to biopsies. Yet, only 20% of biopsied amorphous calcifications are cancer. We present a quantitative approach for distinguishing between benign and actionable (high-risk and malignant) amorphous calcifications using a combination of local textures, global spatial relationships, and interpretable handcrafted expert features. METHOD: Our approach was trained and validated on a set of 168 2D full-field digital mammography exams (248 images) from 168 patients. Within these 248 images, we identified 276 image regions with segmented amorphous calcifications and a biopsy-confirmed diagnosis. A set of local (radiomic and region measurements) and global features (distribution and expert-defined) were extracted from each image. Local features were grouped using an unsupervised k-means clustering algorithm. All global features were concatenated with clustered local features and used to train a LightGBM classifier to distinguish benign from actionable cases. RESULTS: On the held-out test set of 60 images, our approach achieved a sensitivity of 100%, specificity of 35%, and a positive predictive value of 38% when the decision threshold was set to 0.4. Given that all of the images in our test set resulted in a recommendation of a biopsy, the use of our algorithm would have identified 15 images (25%) that were benign, potentially reducing the number of breast biopsies. CONCLUSIONS: Quantitative analysis of full-field digital mammograms can extract subtle shape, texture, and distribution features that may help to distinguish between benign and actionable amorphous calcifications.
Assuntos
Doenças Mamárias , Neoplasias da Mama , Mama/diagnóstico por imagem , Mama/patologia , Doenças Mamárias/diagnóstico por imagem , Neoplasias da Mama/diagnóstico por imagem , Neoplasias da Mama/patologia , Feminino , Humanos , Mamografia/métodos , Interpretação de Imagem Radiográfica Assistida por Computador/métodos , Medição de RiscoRESUMO
Importance: With a shortfall in fellowship-trained breast radiologists, mammography screening programs are looking toward artificial intelligence (AI) to increase efficiency and diagnostic accuracy. External validation studies provide an initial assessment of how promising AI algorithms perform in different practice settings. Objective: To externally validate an ensemble deep-learning model using data from a high-volume, distributed screening program of an academic health system with a diverse patient population. Design, Setting, and Participants: In this diagnostic study, an ensemble learning method, which reweights outputs of the 11 highest-performing individual AI models from the Digital Mammography Dialogue on Reverse Engineering Assessment and Methods (DREAM) Mammography Challenge, was used to predict the cancer status of an individual using a standard set of screening mammography images. This study was conducted using retrospective patient data collected between 2010 and 2020 from women aged 40 years and older who underwent a routine breast screening examination and participated in the Athena Breast Health Network at the University of California, Los Angeles (UCLA). Main Outcomes and Measures: Performance of the challenge ensemble method (CEM) and the CEM combined with radiologist assessment (CEM+R) were compared with diagnosed ductal carcinoma in situ and invasive cancers within a year of the screening examination using performance metrics, such as sensitivity, specificity, and area under the receiver operating characteristic curve (AUROC). Results: Evaluated on 37â¯317 examinations from 26â¯817 women (mean [SD] age, 58.4 [11.5] years), individual model AUROC estimates ranged from 0.77 (95% CI, 0.75-0.79) to 0.83 (95% CI, 0.81-0.85). The CEM model achieved an AUROC of 0.85 (95% CI, 0.84-0.87) in the UCLA cohort, lower than the performance achieved in the Kaiser Permanente Washington (AUROC, 0.90) and Karolinska Institute (AUROC, 0.92) cohorts. The CEM+R model achieved a sensitivity (0.813 [95% CI, 0.781-0.843] vs 0.826 [95% CI, 0.795-0.856]; P = .20) and specificity (0.925 [95% CI, 0.916-0.934] vs 0.930 [95% CI, 0.929-0.932]; P = .18) similar to the radiologist performance. The CEM+R model had significantly lower sensitivity (0.596 [95% CI, 0.466-0.717] vs 0.850 [95% CI, 0.766-0.923]; P < .001) and specificity (0.803 [95% CI, 0.734-0.861] vs 0.945 [95% CI, 0.936-0.954]; P < .001) than the radiologist in women with a prior history of breast cancer and Hispanic women (0.894 [95% CI, 0.873-0.910] vs 0.926 [95% CI, 0.919-0.933]; P = .004). Conclusions and Relevance: This study found that the high performance of an ensemble deep-learning model for automated screening mammography interpretation did not generalize to a more diverse screening cohort, suggesting that the model experienced underspecification. This study suggests the need for model transparency and fine-tuning of AI models for specific target populations prior to their clinical adoption.