ABSTRACT
MOTIVATION: A gold standard for perceptual similarity in medical images is vital to content-based image retrieval, but inter-reader variability complicates development. Our objective was to develop a statistical model that predicts the number of readers (N) necessary to achieve acceptable levels of variability. MATERIALS AND METHODS: We collected 3 radiologists' ratings of the perceptual similarity of 171 pairs of CT images of focal liver lesions rated on a 9-point scale. We modeled the readers' scores as bimodal distributions in additive Gaussian noise and estimated the distribution parameters from the scores using an expectation maximization algorithm. We (a) sampled 171 similarity scores to simulate a ground truth and (b) simulated readers by adding noise, with standard deviation between 0 and 5 for each reader. We computed the mean values of 2-50 readers' scores and calculated the agreement (AGT) between these means and the simulated ground truth, and the inter-reader agreement (IRA), using Cohen's Kappa metric. RESULTS: IRA for the empirical data ranged from =0.41 to 0.66. For between 1.5 and 2.5, IRA between three simulated readers was comparable to agreement in the empirical data. For these values , AGT ranged from =0.81 to 0.91. As expected, AGT increased with N, ranging from =0.83 to 0.92 for N = 2 to 50, respectively, with =2. CONCLUSION: Our simulations demonstrated that for moderate to good IRA, excellent AGT could nonetheless be obtained. This model may be used to predict the required N to accurately evaluate similarity in arbitrary size datasets.
Subject(s)
Liver Neoplasms/diagnostic imaging , Models, Statistical , Tomography, X-Ray Computed/methods , Tomography, X-Ray Computed/statistics & numerical data , Adult , Aged , Aged, 80 and over , Algorithms , Artifacts , Female , Humans , Liver/diagnostic imaging , Male , Middle Aged , Observer Variation , Reproducibility of ResultsABSTRACT
We have developed a method to quantify the shape of liver lesions in CT images and to evaluate its performance for retrieval of images with similarly-shaped lesions. We employed a machine learning method to combine several shape descriptors and defined similarity measures for a pair of shapes as a weighted combination of distances calculated based on each feature. We created a dataset of 144 simulated shapes and established several reference standards for similarity and computed the optimal weights so that the retrieval result agrees best with the reference standard. Then we evaluated our method on a clinical database consisting of 79 portal-venous-phase CT liver images, where we derived a reference standard of similarity from radiologists' visual evaluation. Normalized Discounted Cumulative Gain (NDCG) was calculated to compare this ordering with the expected ordering based on the reference standard. For the simulated lesions, the mean NDCG values ranged from 91% to 100%, indicating that our methods for combining features were very accurate in representing true similarity. For the clinical images, the mean NDCG values were still around 90%, suggesting a strong correlation between the computed similarity and the independent similarity reference derived the radiologists.
Subject(s)
Artificial Intelligence , Information Storage and Retrieval/methods , Liver/diagnostic imaging , Radiographic Image Interpretation, Computer-Assisted/methods , Tomography, X-Ray Computed/methods , Adult , Aged , Aged, 80 and over , Algorithms , Cohort Studies , Diagnostic Imaging/methods , Female , Humans , Liver/pathology , Liver Diseases/diagnostic imaging , Liver Diseases/pathology , Liver Neoplasms/diagnostic imaging , Liver Neoplasms/pathology , Male , Middle Aged , Reference StandardsABSTRACT
In January 2016 the U.S. National Library of Medicine announced a challenge competition calling for the development and discovery of high-quality algorithms and software that rank how well consumer images of prescription pills match reference images of pills in its authoritative RxIMAGE collection. This challenge was motivated by the need to easily identify unknown prescription pills both by healthcare personnel and the general public. Potential benefits of this capability include confirmation of the pill in settings where the documentation and medication have been separated, such as in a disaster or emergency; and confirmation of a pill when the prescribed medication changes from brand to generic, or for any other reason the shape and color of the pill change. The data for the competition consisted of two types of images, high quality macro photographs, reference images, and consumer quality photographs of the quality we expect users of a proposed application to acquire. A training dataset consisting of 2000 reference images and 5000 corresponding consumer quality images acquired from 1000 pills was provided to challenge participants. A second dataset acquired from 1000 pills with similar distributions of shape and color was reserved as a segregated testing set. Challenge submissions were required to produce a ranking of the reference images, given a consumer quality image as input. Determination of the winning teams was done using the mean average precision quality metric, with the three winners obtaining mean average precision scores of 0.27, 0.09, and 0.08. In the retrieval results, the correct image was amongst the top five ranked images 43%, 12%, and 11% of the time, out of 5000 query/consumer images. This is an initial promising step towards development of an NLM software system and application-programming interface facilitating pill identification. The training dataset will continue to be freely available online at: http://pir.nlm.nih.gov/challenge/submission.html.
ABSTRACT
We aim to develop a better understanding of perception of similarity in focal computed tomography (CT) liver images to determine the feasibility of techniques for developing reference sets for training and validating content-based image retrieval systems. In an observer study, four radiologists and six nonradiologists assessed overall similarity and similarity in 5 image features in 136 pairs of focal CT liver lesions. We computed intra- and inter-reader agreements in these similarity ratings and viewed the distributions of the ratings. The readers' ratings of overall similarity and similarity in each feature primarily appeared to be bimodally distributed. Median Kappa scores for intra-reader agreement ranged from 0.57 to 0.86 in the five features and from 0.72 to 0.82 for overall similarity. Median Kappa scores for inter-reader agreement ranged from 0.24 to 0.58 in the five features and were 0.39 for overall similarity. There was no significant difference in agreement for radiologists and nonradiologists. Our results show that developing perceptual similarity reference standards is a complex task. Moderate to high inter-reader variability precludes ease of dividing up the workload of rating perceptual similarity among many readers, while low intra-reader variability may make it possible to acquire large volumes of data by asking readers to view image pairs over many sessions.