RESUMEN
Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. In biomedical image analysis, chosen performance metrics often do not reflect the domain interest, and thus fail to adequately measure scientific progress and hinder translation of ML techniques into practice. To overcome this, we created Metrics Reloaded, a comprehensive framework guiding researchers in the problem-aware selection of metrics. Developed by a large international consortium in a multistage Delphi process, it is based on the novel concept of a problem fingerprint-a structured representation of the given problem that captures all aspects that are relevant for metric selection, from the domain interest to the properties of the target structure(s), dataset and algorithm output. On the basis of the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics while being made aware of potential pitfalls. Metrics Reloaded targets image analysis problems that can be interpreted as classification tasks at image, object or pixel level, namely image-level classification, object detection, semantic segmentation and instance segmentation tasks. To improve the user experience, we implemented the framework in the Metrics Reloaded online tool. Following the convergence of ML methodology across application domains, Metrics Reloaded fosters the convergence of validation methodology. Its applicability is demonstrated for various biomedical use cases.
Asunto(s)
Algoritmos , Procesamiento de Imagen Asistido por Computador , Aprendizaje Automático , SemánticaRESUMEN
Validation metrics are key for tracking scientific progress and bridging the current chasm between artificial intelligence research and its translation into practice. However, increasing evidence shows that, particularly in image analysis, metrics are often chosen inadequately. Although taking into account the individual strengths, weaknesses and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multistage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides a reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Although focused on biomedical image analysis, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. The work serves to enhance global comprehension of a key topic in image analysis validation.
Asunto(s)
Inteligencia ArtificialRESUMEN
Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibility of metric-related knowledge: While taking into account the individual strengths, weaknesses, and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multi-stage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides the first reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Focusing on biomedical image analysis but with the potential of transfer to other fields, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. To facilitate comprehension, illustrations and specific examples accompany each pitfall. As a structured body of information accessible to researchers of all levels of expertise, this work enhances global comprehension of a key topic in image analysis validation.
RESUMEN
Segmentation of abdominal organs has been a comprehensive, yet unresolved, research field for many years. In the last decade, intensive developments in deep learning (DL) introduced new state-of-the-art segmentation systems. Despite outperforming the overall accuracy of existing systems, the effects of DL model properties and parameters on the performance are hard to interpret. This makes comparative analysis a necessary tool towards interpretable studies and systems. Moreover, the performance of DL for emerging learning approaches such as cross-modality and multi-modal semantic segmentation tasks has been rarely discussed. In order to expand the knowledge on these topics, the CHAOS - Combined (CT-MR) Healthy Abdominal Organ Segmentation challenge was organized in conjunction with the IEEE International Symposium on Biomedical Imaging (ISBI), 2019, in Venice, Italy. Abdominal organ segmentation from routine acquisitions plays an important role in several clinical applications, such as pre-surgical planning or morphological and volumetric follow-ups for various diseases. These applications require a certain level of performance on a diverse set of metrics such as maximum symmetric surface distance (MSSD) to determine surgical error-margin or overlap errors for tracking size and shape differences. Previous abdomen related challenges are mainly focused on tumor/lesion detection and/or classification with a single modality. Conversely, CHAOS provides both abdominal CT and MR data from healthy subjects for single and multiple abdominal organ segmentation. Five different but complementary tasks were designed to analyze the capabilities of participating approaches from multiple perspectives. The results were investigated thoroughly, compared with manual annotations and interactive methods. The analysis shows that the performance of DL models for single modality (CT / MR) can show reliable volumetric analysis performance (DICE: 0.98 ± 0.00 / 0.95 ± 0.01), but the best MSSD performance remains limited (21.89 ± 13.94 / 20.85 ± 10.63 mm). The performances of participating models decrease dramatically for cross-modality tasks both for the liver (DICE: 0.88 ± 0.15 MSSD: 36.33 ± 21.97 mm). Despite contrary examples on different applications, multi-tasking DL models designed to segment all organs are observed to perform worse compared to organ-specific ones (performance drop around 5%). Nevertheless, some of the successful models show better performance with their multi-organ versions. We conclude that the exploration of those pros and cons in both single vs multi-organ and cross-modality segmentations is poised to have an impact on further research for developing effective algorithms that would support real-world clinical applications. Finally, having more than 1500 participants and receiving more than 550 submissions, another important contribution of this study is the analysis on shortcomings of challenge organizations such as the effects of multiple submissions and peeking phenomenon.
Asunto(s)
Algoritmos , Tomografía Computarizada por Rayos X , Abdomen/diagnóstico por imagen , Humanos , HígadoRESUMEN
BACKGROUND: DICOM standard does not have modules that provide the possibilities of two-dimensional Presentation States to three-dimensional (3D). Once the final 3D rendering is obtained, only video/image exporting or snapshots can be used. To increase the utility of 3D Presentation States in clinical practice and teleradiology, the storing and transferring the segmentation results, obtained after tedious procedures, can be very effective. PURPOSE: To propose a strategy for preserving interaction and mobility of visualizations for teleradiology by storing and transferring only binary segmented data, which is effectively compressed by modern adaptive and context-based reversible methods. MATERIAL AND METHODS: A diverse set of segmented data, which include four abdominal organs (liver, spleen, right, and left kidneys) from 20 T1-DUAL and 20 T2-SPIR MRI, liver from 20 CT, and abdominal aorta with aneurysms (AAA) from 19 computed tomography-angiography datasets, are collected. Each organ is segmented manually by expert physicians, and binary volumes are created. The well-established reversible binary compression methods PNG, JPEG-LS, JPEG-XR, CCITT-G4, LZW, JBIG2, and ZIP are applied to medical datasets. Recently proposed context-based (3D-RLE) and adaptive (ABIC) algorithms are also employed. The performance assessment has been presented in terms of the compression ratio that is a universal compression metric. RESULTS: Reversible compression of binary volumes results with substantial decreases in file size such as 254 to 2.14 MB for CT-AAA, 56.7 to 0.3 MB for CT-liver. Moreover, compared to the performance of well-established methods (i.e., mean 76.14%), CR is observed to be increased significantly for all segmented organs from both CT and MRI datasets when ABIC (95.49%) and 3D-RLE (94.98%) are utilized. The hypothesis is that morphological coherence of scanning procedure and adaptation between the segmented organs, that is, bi-level images, contributes to compression performance. Although the performance of well-established techniques is satisfactory, the sensitivity of ABIC to modality type and the advantage of 3D-RLE when the spatial coherence between the adjacent slices are high results with up to 10 times more CR performance. CONCLUSION: Adaptive and context-based compression strategies allow effective storage and transfer of segmented binary data, which can be used to re-produce visualizations for better teleradiology practices preserving all interaction mechanisms.
Asunto(s)
Compresión de Datos/métodos , Imagenología Tridimensional , Almacenamiento y Recuperación de la Información/métodos , Radiología , TelemedicinaRESUMEN
PURPOSE: To compare the accuracy and repeatability of emerging machine learning based (i.e. deep) automatic segmentation algorithms with those of well-established semi-automatic (interactive) methods for determining liver volume in living liver transplant donors at computerized tomography (CT) imaging. METHODS: A total of 12 (6 semi-, 6 full-automatic) methods are evaluated. The semi-automatic segmentation algorithms are based on both traditional iterative models including watershed, fast marching, region growing, active contours and modern techniques including robust statistical segmenter and super-pixels. These methods entail some sort of interaction mechanism such as placing initialization seeds on images or determining a parameter range. The automatic methods are based on deep learning and they include three framework templates (DeepMedic, NiftyNet and U-Net) the first two of which are applied with default parameter sets and the last two involve adapted novel model designs. For 20 living donors (6 training and 12 test datasets), a group of imaging scientists and radiologists created ground truths by performing manual segmentations on contrast material-enhanced CT images. Each segmentation is evaluated using five metrics (i.e. volume overlap and relative volume errors, average/RMS/maximum symmetrical surface distances). The results are mapped to a scoring system and a final grade is calculated by taking their average. Accuracy and repeatability were evaluated using slice by slice comparisons and volumetric analysis. Diversity and complementarity are observed through heatmaps. Majority voting and Simultaneous Truth and Performance Level Estimation (STAPLE) algorithms are utilized to obtain the fusion of the individual results. RESULTS: The top four methods are determined to be automatic deep models having 79.63, 79.46 and 77.15 and 74.50 scores. Intra-user score is determined as 95.14. Overall, deep automatic segmentation outperformed interactive techniques on all metrics. The mean volume of liver of ground truth is found to be 1409.93 mL ± 271.28 mL, while it is calculated as 1342.21 mL ± 231.24 mL using automatic and 1201.26 mL ± 258.13 mL using interactive methods, showing higher accuracy and less variation on behalf of automatic methods. The qualitative analysis of segmentation results showed significant diversity and complementarity enabling the idea of using ensembles to obtain superior results. The fusion of automatic methods reached 83.87 with majority voting and 86.20 using STAPLE that are only slightly less than fusion of all methods that achieved 86.70 (majority voting) and 88.74 (STAPLE). CONCLUSION: Use of the new deep learning based automatic segmentation algorithms substantially increases the accuracy and repeatability for segmentation and volumetric measurements of liver. Fusion of automatic methods based on ensemble approaches exhibits best results almost without any additional time cost due to potential parallel execution of multiple models.
Asunto(s)
Aprendizaje Profundo , Procesamiento de Imagen Asistido por Computador/métodos , Trasplante de Hígado , Hígado/anatomía & histología , Donadores Vivos , Tomografía Computarizada por Rayos X/métodos , Humanos , Hígado/diagnóstico por imagen , Tamaño de los Órganos , Reproducibilidad de los ResultadosRESUMEN
PURPOSE: Precise extraction of aorta and the vessels departing from it (i.e. coeliac, renal, and iliac) is vital for correct positioning of a graft prior to abdominal aortic surgery. To perform this task, most of the segmentation algorithms rely on seed points, and better-located seed points provide better initial positions for cross-sectional methods. Under non-optimal acquisition characteristics of daily clinical routine and complex morphology of these vessels, inserting seed points to all these small, but critically important vessels is a tedious, time-consuming, and error-prone task. Thus, in this paper, a novel strategy is developed to generate pathways between user-inserted seed points in order to initialize segmentation methods effectively. METHOD: The proposed method requires only a single user-inserted seed for each vessel of interest for initializations. Starting from these initial seeds, it automatically generates pathways that span all vessels in between. To accomplish this, first, a geodesic mask is generated by adaptive thresholding, which reinforces the initial seeds to be kept in the vascular tree. Then, a novel implementation of 3D pairwise geodesic distance field (3D-PGDF) is utilized. It is shown that the minimal-valued geodesic of 3D-PGDF successfully defines a path linking the initial seeds as being the shortest geodesic. Moreover, the robustness of the minimum level set of the 3D-PGDF to local variations and regions of high curvature is increased by a region classification strategy, which adds partial geodesics to these critical regions. RESULTS: The proposed method was applied to 19 challenging CT data sets obtained from four different scanners and compared to two benchmark methods. The first method is a high-precision technique with very long processing time (subvoxel precise multi-stencil fast marching-MSFM), while the second is a very fast method with lower accuracy (3D fast marching). The results, which are obtained using various measures, show that the pathways generated by the developed technique enable significantly higher segmentation performance than 3D fast marching and require much less computational power and time than MSFM. CONCLUSION: The developed technique offers a useful tool for generating pathways between seed points with minimal user interaction. It guarantees to include all important vessels in a computationally effective manner and thus, it can be used to initialize segmentation methods for abdominal aortic tree.