Your browser doesn't support javascript.
loading
Edge roughness quantifies impact of physician variation on training and performance of deep learning auto-segmentation models for the esophagus.
Yan, Yujie; Kehayias, Christopher; He, John; Aerts, Hugo J W L; Fitzgerald, Kelly J; Kann, Benjamin H; Kozono, David E; Guthier, Christian V; Mak, Raymond H.
Affiliation
  • Yan Y; Department of Radiation Oncology, Brigham and Women's Hospital, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA.
  • Kehayias C; Artificial Intelligence in Medicine (AIM) Program, Mass General Brigham, Harvard Medical School, Boston, MA, USA.
  • He J; Department of Radiation Oncology, Brigham and Women's Hospital, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA.
  • Aerts HJWL; Artificial Intelligence in Medicine (AIM) Program, Mass General Brigham, Harvard Medical School, Boston, MA, USA.
  • Fitzgerald KJ; Department of Radiation Oncology, Brigham and Women's Hospital, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA.
  • Kann BH; Department of Radiation Oncology, Brigham and Women's Hospital, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA.
  • Kozono DE; Artificial Intelligence in Medicine (AIM) Program, Mass General Brigham, Harvard Medical School, Boston, MA, USA.
  • Guthier CV; Department of Radiation Oncology, Brigham and Women's Hospital, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA.
  • Mak RH; Department of Radiation Oncology, Brigham and Women's Hospital, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA.
Sci Rep ; 14(1): 2536, 2024 01 30.
Article in En | MEDLINE | ID: mdl-38291051
ABSTRACT
Manual segmentation of tumors and organs-at-risk (OAR) in 3D imaging for radiation-therapy planning is time-consuming and subject to variation between different observers. Artificial intelligence (AI) can assist with segmentation, but challenges exist in ensuring high-quality segmentation, especially for small, variable structures, such as the esophagus. We investigated the effect of variation in segmentation quality and style of physicians for training deep-learning models for esophagus segmentation and proposed a new metric, edge roughness, for evaluating/quantifying slice-to-slice inconsistency. This study includes a real-world cohort of 394 patients who each received radiation therapy (mainly for lung cancer). Segmentation of the esophagus was performed by 8 physicians as part of routine clinical care. We evaluated manual segmentation by comparing the length and edge roughness of segmentations among physicians to analyze inconsistencies. We trained eight multiple- and individual-physician segmentation models in total, based on U-Net architectures and residual backbones. We used the volumetric Dice coefficient to measure the performance for each model. We proposed a metric, edge roughness, to quantify the shift of segmentation among adjacent slices by calculating the curvature of edges of the 2D sagittal- and coronal-view projections. The auto-segmentation model trained on multiple physicians (MD1-7) achieved the highest mean Dice of 73.7 ± 14.8%. The individual-physician model (MD7) with the highest edge roughness (mean ± SD 0.106 ± 0.016) demonstrated significantly lower volumetric Dice for test cases compared with other individual models (MD7 58.5 ± 15.8%, MD6 67.1 ± 16.8%, p < 0.001). A multiple-physician model trained after removing the MD7 data resulted in fewer outliers (e.g., Dice ≤ 40% 4 cases for MD1-6, 7 cases for MD1-7, Ntotal = 394). While we initially detected this pattern in a single clinician, we validated the edge roughness metric across the entire dataset. The model trained with the lowest-quantile edge roughness (MDER-Q1, Ntrain = 62) achieved significantly higher Dice (Ntest = 270) than the model trained with the highest-quantile ones (MDER-Q4, Ntrain = 62) (MDER-Q1 67.8 ± 14.8%, MDER-Q4 62.8 ± 15.7%, p < 0.001). This study demonstrates that there is significant variation in style and quality in manual segmentations in clinical care, and that training AI auto-segmentation algorithms from real-world, clinical datasets may result in unexpectedly under-performing algorithms with the inclusion of outliers. Importantly, this study provides a novel evaluation metric, edge roughness, to quantify physician variation in segmentation which will allow developers to filter clinical training data to optimize model performance.
Subject(s)

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Deep Learning Type of study: Guideline / Prognostic_studies Limits: Humans Language: En Journal: Sci Rep Year: 2024 Document type: Article Affiliation country: Country of publication: ENGLAND / ESCOCIA / GB / GREAT BRITAIN / INGLATERRA / REINO UNIDO / SCOTLAND / UK / UNITED KINGDOM

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Deep Learning Type of study: Guideline / Prognostic_studies Limits: Humans Language: En Journal: Sci Rep Year: 2024 Document type: Article Affiliation country: Country of publication: ENGLAND / ESCOCIA / GB / GREAT BRITAIN / INGLATERRA / REINO UNIDO / SCOTLAND / UK / UNITED KINGDOM