Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 25
Filter
1.
Med Phys ; 50(7): 4151-4172, 2023 Jul.
Article in English | MEDLINE | ID: mdl-37057360

ABSTRACT

BACKGROUND: This study reports the results of a set of discrimination experiments using simulated images that represent the appearance of subtle lesions in low-dose computed tomography (CT) of the lungs. Noise in these images has a characteristic ramp-spectrum before apodization by noise control filters. We consider three specific diagnostic features that determine whether a lesion is considered malignant or benign, two system-resolution levels, and four apodization levels for a total of 24 experimental conditions. PURPOSE: The goal of the investigation is to better understand how well human observers perform subtle discrimination tasks like these, and the mechanisms of that performance. We use a forced-choice psychophysical paradigm to estimate observer efficiency and classification images. These measures quantify how effectively subjects can read the images, and how they use images to perform discrimination tasks across the different imaging conditions. MATERIALS AND METHODS: The simulated CT images used as stimuli in the psychophysical experiments are generated from high-resolution objects passed through a modulation transfer function (MTF) before down-sampling to the image-pixel grid. Acquisition noise is then added with a ramp noise-power spectrum (NPS), with subsequent smoothing through apodization filters. The features considered are lesion size, indistinct lesion boundary, and a nonuniform lesion interior. System resolution is implemented by an MTF with resolution (10% max.) of 0.47 or 0.58 cyc/mm. Apodization is implemented by a Shepp-Logan filter (Sinc profile) with various cutoffs. Six medically naïve subjects participated in the psychophysical studies, entailing training and testing components for each condition. Training consisted of staircase procedures to find the 80% correct threshold for each subject, and testing involved 2000 psychophysical trials at the threshold value for each subject. Human-observer performance is compared to the Ideal Observer to generate estimates of task efficiency. The significance of imaging factors is assessed using ANOVA. Classification images are used to estimate the linear template weights used by subjects to perform these tasks. Classification-image spectra are used to analyze subject weights in the spatial-frequency domain. RESULTS: Overall, average observer efficiency is relatively low in these experiments (10%-40%) relative to detection and localization studies reported previously. We find significant effects for feature type and apodization level on observer efficiency. Somewhat surprisingly, system resolution is not a significant factor. Efficiency effects of the different features appear to be well explained by the profile of the linear templates in the classification images. Increasingly strong apodization is found to both increase the classification-image weights and to increase the mean-frequency of the classification-image spectra. A secondary analysis of "Unapodized" classification images shows that this is largely due to observers undoing (inverting) the effects of apodization filters. CONCLUSIONS: These studies demonstrate that human observers can be relatively inefficient at feature-discrimination tasks in ramp-spectrum noise. Observers appear to be adapting to frequency suppression implemented in apodization filters, but there are residual effects that are not explained by spatial weighting patterns. The studies also suggest that the mechanisms for improving performance through the application of noise-control filters may require further investigation.


Subject(s)
Image Processing, Computer-Assisted , Tomography, X-Ray Computed , Humans , Image Processing, Computer-Assisted/methods , Phantoms, Imaging , Algorithms
2.
JNCI Cancer Spectr ; 6(1)2022 01 05.
Article in English | MEDLINE | ID: mdl-35699495

ABSTRACT

Medical image interpretation is central to detecting, diagnosing, and staging cancer and many other disorders. At a time when medical imaging is being transformed by digital technologies and artificial intelligence, understanding the basic perceptual and cognitive processes underlying medical image interpretation is vital for increasing diagnosticians' accuracy and performance, improving patient outcomes, and reducing diagnostician burnout. Medical image perception remains substantially understudied. In September 2019, the National Cancer Institute convened a multidisciplinary panel of radiologists and pathologists together with researchers working in medical image perception and adjacent fields of cognition and perception for the "Cognition and Medical Image Perception Think Tank." The Think Tank's key objectives were to identify critical unsolved problems related to visual perception in pathology and radiology from the perspective of diagnosticians, discuss how these clinically relevant questions could be addressed through cognitive and perception research, identify barriers and solutions for transdisciplinary collaborations, define ways to elevate the profile of cognition and perception research within the medical image community, determine the greatest needs to advance medical image perception, and outline future goals and strategies to evaluate progress. The Think Tank emphasized diagnosticians' perspectives as the crucial starting point for medical image perception research, with diagnosticians describing their interpretation process and identifying perceptual and cognitive problems that arise. This article reports the deliberations of the Think Tank participants to address these objectives and highlight opportunities to expand research on medical image perception.


Subject(s)
Artificial Intelligence , Radiology , Cognition , Diagnostic Imaging , Humans , Radiology/methods , Visual Perception
3.
J Med Imaging (Bellingham) ; 7(4): 042802, 2020 Jul.
Article in English | MEDLINE | ID: mdl-32118094

ABSTRACT

A recent study reported on an in-silico imaging trial that evaluated the performance of digital breast tomosynthesis (DBT) as a replacement for full-field digital mammography (FFDM) for breast cancer screening. In this in-silico trial, the whole imaging chain was simulated, including the breast phantom generation, the x-ray transport process, and computational readers for image interpretation. We focus on the design and performance characteristics of the computational reader in the above-mentioned trial. Location-known lesion (spiculated mass and clustered microcalcifications) detection tasks were used to evaluate the imaging system performance. The computational readers were designed based on the mechanism of a channelized Hotelling observer (CHO), and the reader models were selected to trend human performance. Parameters were tuned to ensure stable lesion detectability. A convolutional CHO that can adapt a round channel function to irregular lesion shapes was compared with the original CHO and was found to be suitable for detecting clustered microcalcifications but was less optimal in detecting spiculated masses. A three-dimensional CHO that operated on the multiple slices was compared with a two-dimensional (2-D) CHO that operated on three versions of 2-D slabs converted from the multiple slices and was found to be optimal in detecting lesions in DBT. Multireader multicase reader output analysis was used to analyze the performance difference between FFDM and DBT for various breast and lesion types. The results showed that DBT was more beneficial in detecting masses than detecting clustered microcalcifications compared with FFDM, consistent with the finding in a clinical imaging trial. Statistical uncertainty smaller than 0.01 standard error for the estimated performance differences was achieved with a dataset containing approximately 3000 breast phantoms. The computational reader design methodology presented provides evidence that model observers can be useful in-silico tools for supporting the performance comparison of breast imaging systems.

4.
Article in English | MEDLINE | ID: mdl-33384465

ABSTRACT

We investigate a series of two-alternative forced-choice (2AFC) discrimination tasks based on malignant features of abnormalities in low-dose lung CT scans. A total of 3 tasks are evaluated, and these consist of a size-discrimination task, a boundary-sharpness task, and an irregular-interior task. Target and alternative signal profiles for these tasks are modulated by one of two system transfer functions and embedded in ramp-spectrum noise that has been apodized for noise control in one of 4 different ways. This gives the resulting images statistical properties that are related to weak ground-glass lesions in axial slices of low-dose lung CT images. We investigate observer performance in these tasks using a combination of statistical efficiency and classification images. We report results of 24 2AFC experiments involving the three tasks. A staircase procedure is used to find the approximate 80% correct discrimination threshold in each task, with a subsequent set of 2,000 trials at this threshold. These data are used to estimate statistical efficiency with respect to the ideal observer for each task, and to estimate the observer template using the classification-image methodology. We find efficiency varies between the different tasks with lowest efficiency in the boundary-sharpness task, and highest efficiency in the non-uniform interior task. All three tasks produce clearly visible patterns of positive and negative weighting in the classification images. The spatial frequency plots of classification images show how apodization results in larger weights at higher spatial frequencies.

5.
Stat Methods Med Res ; 29(6): 1592-1611, 2020 06.
Article in English | MEDLINE | ID: mdl-31456480

ABSTRACT

Evaluation of medical imaging devices often involves clinical studies where multiple readers (MR) read images of multiple cases (MC) for a clinical task, which are often called MRMC studies. In addition to sizing patient cases as is required in most clinical trials, MRMC studies also require sizing readers, since both readers and cases contribute to the uncertainty of the estimated diagnostic performance, which is often measured by the area under the ROC curve (AUC). Due to limited prior information, sizing of such a study is often unreliable. It is desired to adaptively resize the study toward a target power after an interim analysis. Although adaptive methods are available in clinical trials where only the patient sample is sized, such methodologies have not been established for MRMC studies. The challenge lies in the fact that there is a correlation structure in MRMC data and the sizing involves both readers and cases. We develop adaptive MRMC design methodologies to enable study resizing. In particular, we resize the study and adjust the critical value for hypothesis testing simultaneously after an interim analysis to achieve a target power and control the type I error rate in comparing AUCs of two modalities. Analytical results have been derived. Simulations show that the type I error rate is controlled close to the nominal level and the power is adjusted toward the target value under a variety of simulation conditions. We demonstrate the use of our methods in a real-world application comparing two imaging modalities for breast cancer detection.


Subject(s)
Research Design , Area Under Curve , Computer Simulation , Humans , ROC Curve
6.
J Med Imaging (Bellingham) ; 6(1): 015501, 2019 Jan.
Article in English | MEDLINE | ID: mdl-30713851

ABSTRACT

We investigated effects of prevalence and case distribution on radiologist diagnostic performance as measured by area under the receiver operating characteristic curve (AUC) and sensitivity-specificity in lab-based reader studies evaluating imaging devices. Our retrospective reader studies compared full-field digital mammography (FFDM) to screen-film mammography (SFM) for women with dense breasts. Mammograms were acquired from the prospective Digital Mammographic Imaging Screening Trial. We performed five reader studies that differed in terms of cancer prevalence and the distribution of noncancers. Twenty radiologists participated in each reader study. Using split-plot study designs, we collected recall decisions and multilevel scores from the radiologists for calculating sensitivity, specificity, and AUC. Differences in reader-averaged AUCs slightly favored SFM over FFDM (biggest AUC difference: 0.047, SE = 0.023 , p = 0.047 ), where standard error accounts for reader and case variability. The differences were not significant at a level of 0.01 (0.05/5 reader studies). The differences in sensitivities and specificities were also indeterminate. Prevalence had little effect on AUC (largest difference: 0.02), whereas sensitivity increased and specificity decreased as prevalence increased. We found that AUC is robust to changes in prevalence, while radiologists were more aggressive with recall decisions as prevalence increased.

7.
Med Phys ; 45(7): 3019-3030, 2018 Jul.
Article in English | MEDLINE | ID: mdl-29704868

ABSTRACT

PURPOSE: The task-based assessment of image quality using model observers is increasingly used for the assessment of different imaging modalities. However, the performance computation of model observers needs standardization as well as a well-established trust in its implementation methodology and uncertainty estimation. The purpose of this work was to determine the degree of equivalence of the channelized Hotelling observer performance and uncertainty estimation using an intercomparison exercise. MATERIALS AND METHODS: Image samples to estimate model observer performance for detection tasks were generated from two-dimensional CT image slices of a uniform water phantom. A common set of images was sent to participating laboratories to perform and document the following tasks: (a) estimate the detectability index of a well-defined CHO and its uncertainty in three conditions involving different sized targets all at the same dose, and (b) apply this CHO to an image set where ground truth was unknown to participants (lower image dose). In addition, and on an optional basis, we asked the participating laboratories to (c) estimate the performance of real human observers from a psychophysical experiment of their choice. Each of the 13 participating laboratories was confidentially assigned a participant number and image sets could be downloaded through a secure server. Results were distributed with each participant recognizable by its number and then each laboratory was able to modify their results with justification as model observer calculation are not yet a routine and potentially error prone. RESULTS: Detectability index increased with signal size for all participants and was very consistent for 6 mm sized target while showing higher variability for 8 and 10 mm sized target. There was one order of magnitude between the lowest and the largest uncertainty estimation. CONCLUSIONS: This intercomparison helped define the state of the art of model observer performance computation and with thirteen participants, reflects openness and trust within the medical imaging community. The performance of a CHO with explicitly defined channels and a relatively large number of test images was consistently estimated by all participants. In contrast, the paper demonstrates that there is no agreement on estimating the variance of detectability in the training and testing setting.


Subject(s)
Image Processing, Computer-Assisted , Laboratories , Tomography, X-Ray Computed , Observer Variation , Uncertainty
8.
Med Phys ; 45(5): 1970-1984, 2018 May.
Article in English | MEDLINE | ID: mdl-29532479

ABSTRACT

PURPOSE: This study investigates forced localization of targets in simulated images with statistical properties similar to trans-axial sections of x-ray computed tomography (CT) volumes. A total of 24 imaging conditions are considered, comprising two target sizes, three levels of background variability, and four levels of frequency apodization. The goal of the study is to better understand how human observers perform forced-localization tasks in images with CT-like statistical properties. METHODS: The transfer properties of CT systems are modeled by a shift-invariant transfer function in addition to apodization filters that modulate high spatial frequencies. The images contain noise that is the combination of a ramp-spectrum component, simulating the effect of acquisition noise in CT, and a power-law component, simulating the effect of normal anatomy in the background, which are modulated by the apodization filter as well. Observer performance is characterized using two psychophysical techniques: efficiency analysis and classification image analysis. Observer efficiency quantifies how much diagnostic information is being used by observers to perform a task, and classification images show how that information is being accessed in the form of a perceptual filter. RESULTS: Psychophysical studies from five subjects form the basis of the results. Observer efficiency ranges from 29% to 77% across the different conditions. The lowest efficiency is observed in conditions with uniform backgrounds, where significant effects of apodization are found. The classification images, estimated using smoothing windows, suggest that human observers use center-surround filters to perform the task, and these are subjected to a number of subsequent analyses. When implemented as a scanning linear filter, the classification images appear to capture most of the observer variability in efficiency (r2 = 0.86). The frequency spectra of the classification images show that frequency weights generally appear bandpass in nature, with peak frequency and bandwidth that vary with statistical properties of the images. CONCLUSIONS: In these experiments, the classification images appear to capture important features of human-observer performance. Frequency apodization only appears to have a significant effect on performance in the absence of anatomical variability, where the observers appear to underweight low spatial frequencies that have relatively little noise. Frequency weights derived from the classification images generally have a bandpass structure, with adaptation to different conditions seen in the peak frequency and bandwidth. The classification image spectra show relatively modest changes in response to different levels of apodization, with some evidence that observers are attempting to rebalance the apodized spectrum presented to them.


Subject(s)
Image Processing, Computer-Assisted/methods , Signal-To-Noise Ratio , Statistics as Topic , Tomography, X-Ray Computed
9.
Stat Methods Med Res ; 27(5): 1394-1409, 2018 05.
Article in English | MEDLINE | ID: mdl-27507287

ABSTRACT

Scores produced by statistical classifiers in many clinical decision support systems and other medical diagnostic devices are generally on an arbitrary scale, so the clinical meaning of these scores is unclear. Calibration of classifier scores to a meaningful scale such as the probability of disease is potentially useful when such scores are used by a physician. In this work, we investigated three methods (parametric, semi-parametric, and non-parametric) for calibrating classifier scores to the probability of disease scale and developed uncertainty estimation techniques for these methods. We showed that classifier scores on arbitrary scales can be calibrated to the probability of disease scale without affecting their discrimination performance. With a finite dataset to train the calibration function, it is important to accompany the probability estimate with its confidence interval. Our simulations indicate that, when a dataset used for finding the transformation for calibration is also used for estimating the performance of calibration, the resubstitution bias exists for a performance metric involving the truth states in evaluating the calibration performance. However, the bias is small for the parametric and semi-parametric methods when the sample size is moderate to large (>100 per class).


Subject(s)
Calibration , Diagnosis , Disease/classification , Probability , Statistics as Topic , Confidence Intervals , Humans , Sample Size , Statistics, Nonparametric
10.
JAMA Netw Open ; 1(7): e185474, 2018 11 02.
Article in English | MEDLINE | ID: mdl-30646401

ABSTRACT

Importance: Expensive and lengthy clinical trials can delay regulatory evaluation of innovative technologies, affecting patient access to high-quality medical products. Simulation is increasingly being used in product development but rarely in regulatory applications. Objectives: To conduct a computer-simulated imaging trial evaluating digital breast tomosynthesis (DBT) as a replacement for digital mammography (DM) and to compare the results with a comparative clinical trial. Design, Setting, and Participants: The simulated Virtual Imaging Clinical Trial for Regulatory Evaluation (VICTRE) trial was designed to replicate a clinical trial that used human patients and radiologists. Images obtained with in silico versions of DM and DBT systems via fast Monte Carlo x-ray transport were interpreted by a computational reader detecting the presence of lesions. A total of 2986 synthetic image-based virtual patients with breast sizes and radiographic densities representative of a screening population and compressed thicknesses from 3.5 to 6 cm were generated using an analytic approach in which anatomical structures are randomly created within a predefined breast volume and compressed in the craniocaudal orientation. A positive cohort contained a digitally inserted microcalcification cluster or spiculated mass. Main Outcomes and Measures: The trial end point was the difference in area under the receiver operating characteristic curve between modalities for lesion detection. The trial was sized for an SE of 0.01 in the change in area under the curve (AUC), half the uncertainty in the comparative clinical trial. Results: In this trial, computational readers analyzed 31 055 DM and 27 960 DBT cases from 2986 virtual patients with the following Breast Imaging Reporting and Data System densities: 286 (9.6%) extremely dense, 1200 (40.2%) heterogeneously dense, 1200 (40.2%) scattered fibroglandular densities, and 300 (10.0%) almost entirely fat. The mean (SE) change in AUC was 0.0587 (0.0062) (P < .001) in favor of DBT. The change in AUC was larger for masses (mean [SE], 0.0903 [0.008]) than for calcifications (mean [SE], 0.0268 [0.004]), which was consistent with the findings of the comparative trial (mean [SE], 0.065 [0.017] for masses and -0.047 [0.032] for calcifications). Conclusions and Relevance: The results of the simulated VICTRE trial are consistent with the performance seen in the comparative trial. While further research is needed to assess the generalizability of these findings, in silico imaging trials represent a viable source of regulatory evidence for imaging devices.


Subject(s)
Mammography/methods , Mammography/standards , Breast/diagnostic imaging , Breast Neoplasms/diagnostic imaging , Calcinosis/diagnostic imaging , Computer Simulation , Female , Humans , ROC Curve
11.
Acad Radiol ; 24(11): 1436-1446, 2017 11.
Article in English | MEDLINE | ID: mdl-28666723

ABSTRACT

RATIONALE AND OBJECTIVES: In this paper we examine which comparisons of reading performance between diagnostic imaging systems made in controlled retrospective laboratory studies may be representative of what we observe in later clinical studies. The change in a meaningful diagnostic figure of merit between two diagnostic modalities should be qualitatively or quantitatively comparable across all kinds of studies. MATERIALS AND METHODS: In this meta-study we examine the reproducibility of relative measures of sensitivity, false positive fraction (FPF), area under the receiver operating characteristic (ROC) curve, and expected utility across laboratory and observational clinical studies for several different breast imaging modalities, including screen film mammography, digital mammography, breast tomosynthesis, and ultrasound. RESULTS: Across studies of all types, the changes in the FPFs yielded very small probabilities of having a common mean value. The probabilities of relative sensitivity being the same across ultrasound and tomosynthesis studies were low. No evidence was found for different mean values of relative area under the ROC curve or relative expected utility within any of the study sets. CONCLUSION: The comparison demonstrates that the ratios of areas under the ROC curve and expected utilities are reproducible across laboratory and clinical studies, whereas sensitivity and FPF are not.


Subject(s)
Breast Neoplasms/diagnostic imaging , Breast/diagnostic imaging , Mammography/methods , Ultrasonography, Mammary , Area Under Curve , Female , Humans , ROC Curve , Reproducibility of Results
12.
Int J Biostat ; 12(2)2016 11 01.
Article in English | MEDLINE | ID: mdl-27889706

ABSTRACT

Schatzkin et al. and other authors demonstrated that the ratios of some conditional statistics such as the true positive fraction are equal to the ratios of unconditional statistics, such as disease detection rates, and therefore we can calculate these ratios between two screening tests on the same population even if negative test patients are not followed with a reference procedure and the true and false negative rates are unknown. We demonstrate that this same property applies to an expected utility metric. We also demonstrate that while simple estimates of relative specificities and relative areas under ROC curves (AUC) do depend on the unknown negative rates, we can write these ratios in terms of disease prevalence, and the dependence of these ratios on a posited prevalence is often weak particularly if that prevalence is small or the performance of the two screening tests is similar. Therefore we can estimate relative specificity or AUC with little loss of accuracy, if we use an approximate value of disease prevalence.


Subject(s)
Area Under Curve , Diagnosis , Mass Screening , Disease , Humans , Prevalence , Sensitivity and Specificity
13.
Proc SPIE Int Soc Opt Eng ; 97872016 Feb 27.
Article in English | MEDLINE | ID: mdl-27335532

ABSTRACT

Breast cancer risk prediction algorithms are used to identify subpopulations that are at increased risk for developing breast cancer. They can be based on many different sources of data such as demographics, relatives with cancer, gene expression, and various phenotypic features such as breast density. Women who are identified as high risk may undergo a more extensive (and expensive) screening process that includes MRI or ultrasound imaging in addition to the standard full-field digital mammography (FFDM) exam. Given that there are many ways that risk prediction may be accomplished, it is of interest to evaluate them in terms of expected cost, which includes the costs of diagnostic outcomes. In this work we perform an expected-cost analysis of risk prediction algorithms that is based on a published model that includes the costs associated with diagnostic outcomes (true-positive, false-positive, etc.). We assume the existence of a standard screening method and an enhanced screening method with higher scan cost, higher sensitivity, and lower specificity. We then assess expected cost of using a risk prediction algorithm to determine who gets the enhanced screening method under the strong assumption that risk and diagnostic performance are independent. We find that if risk prediction leads to a high enough positive predictive value, it will be cost-effective regardless of the size of the subpopulation. Furthermore, in terms of the hit-rate and false-alarm rate of the of the risk-prediction algorithm, iso-cost contours are lines with slope determined by properties of the available diagnostic systems for screening.

15.
J Opt Soc Am A Opt Image Sci Vis ; 31(11): 2495-510, 2014 Nov 01.
Article in English | MEDLINE | ID: mdl-25401363

ABSTRACT

There is a lack of consensus in measuring observer performance in search tasks. To pursue a consensus, we set our goal to obtain metrics that are practical, meaningful, and predictive. We consider a metric practical if it can be implemented to measure human and computer observers' performance. To be meaningful, we propose to discover intrinsic properties of search observers and formulate the metrics to characterize these properties. If the discovered properties allow verifiable predictions, we consider them predictive. We propose a theory and a conjecture toward two intrinsic properties of search observers: rationality in classification as measured by the location-known-exactly (LKE) receiver operating characteristic (ROC) curve and location uncertainty as measured by the effective set size (M*). These two properties are used to develop search models in both single-response and free-response search tasks. To confirm whether these properties are "intrinsic," we investigate their ability in predicting search performance of both human and scanning channelized Hotelling observers. In particular, for each observer, we designed experiments to measure the LKE-ROC curve and M*, which were then used to predict the same observer's performance in other search tasks. The predictions were then compared to the experimentally measured observer performance. Our results indicate that modeling the search performance using the LKE-ROC curve and M* leads to successful predictions in most cases.


Subject(s)
Image Processing, Computer-Assisted/methods , Models, Theoretical , Humans , Observer Variation , Quality Control , ROC Curve
16.
Med Phys ; 41(7): 072102, 2014 Jul.
Article in English | MEDLINE | ID: mdl-24989397

ABSTRACT

PURPOSE: Variance estimates for detector energy resolution metrics can be used as stopping criteria in Monte Carlo simulations for the purpose of ensuring a small uncertainty of those metrics and for the design of variance reduction techniques. METHODS: The authors derive an estimate for the variance of two energy resolution metrics, the Swank factor and the Fano factor, in terms of statistical moments that can be accumulated without significant computational overhead. The authors examine the accuracy of these two estimators and demonstrate how the estimates of the coefficient of variation of the Swank and Fano factors behave with data from a Monte Carlo simulation of an indirect x-ray imaging detector. RESULTS: The authors' analyses suggest that the accuracy of their variance estimators is appropriate for estimating the actual variances of the Swank and Fano factors for a variety of distributions of detector outputs. CONCLUSIONS: The variance estimators derived in this work provide a computationally convenient way to estimate the error or coefficient of variation of the Swank and Fano factors during Monte Carlo simulations of radiation imaging systems.


Subject(s)
Computer Simulation , Monte Carlo Method , Radiography , Algorithms , Radiography/instrumentation , Software
17.
Acad Radiol ; 21(4): 481-90, 2014 Apr.
Article in English | MEDLINE | ID: mdl-24594418

ABSTRACT

RATIONALE AND OBJECTIVES: Our objective is to determine whether expected utility (EU) and the area under the receiver operator characteristic (AUC) are consistent with one another as endpoints of observer performance studies in mammography. These two measures characterize receiver operator characteristic performance somewhat differently. We compare these two study endpoints at the level of individual reader effects, statistical inference, and components of variance across readers and cases. MATERIALS AND METHODS: We reanalyze three previously published laboratory observer performance studies that investigate various x-ray breast imaging modalities using EU and AUC. The EU measure is based on recent estimates of relative utility for screening mammography. RESULTS: The AUC and EU measures are correlated across readers for individual modalities (r = 0.93) and differences in modalities (r = 0.94 to 0.98). Statistical inference for modality effects based on multi-reader multi-case analysis is very similar, with significant results (P < .05) in exactly the same conditions. Power analyses show mixed results across studies, with a small increase in power on average for EU that corresponds to approximately a 7% reduction in the number of readers. Despite a large number of crossing receiver operator characteristic curves (59% of readers), modality effects only rarely have opposite signs for EU and AUC (6%). CONCLUSIONS: We do not find any evidence of systematic differences between EU and AUC in screening mammography observer studies. Thus, when utility approaches are viable (i.e., an appropriate value of relative utility exists), practical effects such as statistical efficiency may be used to choose study endpoints.


Subject(s)
Breast Neoplasms/diagnostic imaging , Clinical Competence/statistics & numerical data , Mammography/statistics & numerical data , Mass Screening/statistics & numerical data , ROC Curve , Radiographic Image Interpretation, Computer-Assisted/methods , Clinical Laboratory Techniques/statistics & numerical data , Data Interpretation, Statistical , Female , Humans , Observer Variation , Reproducibility of Results , Sensitivity and Specificity
18.
J Med Imaging (Bellingham) ; 1(3): 031004, 2014 Oct.
Article in English | MEDLINE | ID: mdl-26158046

ABSTRACT

The evaluation of medical imaging devices often involves studies that measure the ability of observers to perform a signal detection task on images obtained from those devices. Data from such studies are frequently regressed ordinally using two-sample receiver operating characteristic (ROC) models. We applied some of these models to a number of randomly chosen data sets from medical imaging and evaluated how well they fit using the Akaike and Bayesian information criteria and cross-validation. We find that for many observer data sets, a single-parameter model is sufficient and that only some studies exhibit evidence for the use of models with more than a single parameter. In particular, the single-parameter power-law model frequently well describes observer data. The power-law model has an asymmetric ROC curve and a constant mean-to-sigma ratio seen in studies analyzed with the bi-normal model. It is identical or very similar to special cases of other two-parameter models.

19.
Med Phys ; 40(8): 087001, 2013 Aug.
Article in English | MEDLINE | ID: mdl-23927365

ABSTRACT

Computer-aided detection and diagnosis (CAD) systems are increasingly being used as an aid by clinicians for detection and interpretation of diseases. Computer-aided detection systems mark regions of an image that may reveal specific abnormalities and are used to alert clinicians to these regions during image interpretation. Computer-aided diagnosis systems provide an assessment of a disease using image-based information alone or in combination with other relevant diagnostic data and are used by clinicians as a decision support in developing their diagnoses. While CAD systems are commercially available, standardized approaches for evaluating and reporting their performance have not yet been fully formalized in the literature or in a standardization effort. This deficiency has led to difficulty in the comparison of CAD devices and in understanding how the reported performance might translate into clinical practice. To address these important issues, the American Association of Physicists in Medicine (AAPM) formed the Computer Aided Detection in Diagnostic Imaging Subcommittee (CADSC), in part, to develop recommendations on approaches for assessing CAD system performance. The purpose of this paper is to convey the opinions of the AAPM CADSC members and to stimulate the development of consensus approaches and "best practices" for evaluating CAD systems. Both the assessment of a standalone CAD system and the evaluation of the impact of CAD on end-users are discussed. It is hoped that awareness of these important evaluation elements and the CADSC recommendations will lead to further development of structured guidelines for CAD performance assessment. Proper assessment of CAD system performance is expected to increase the understanding of a CAD system's effectiveness and limitations, which is expected to stimulate further research and development efforts on CAD technologies, reduce problems due to improper use, and eventually improve the utility and efficacy of CAD in clinical practice.


Subject(s)
Diagnosis, Computer-Assisted/methods , Consensus , Diagnosis, Computer-Assisted/standards , Humans , ROC Curve , Reference Standards , Retrospective Studies , Societies, Medical
20.
BMC Med Res Methodol ; 13: 98, 2013 Jul 29.
Article in English | MEDLINE | ID: mdl-23895587

ABSTRACT

BACKGROUND: The surge in biomarker development calls for research on statistical evaluation methodology to rigorously assess emerging biomarkers and classification models. Recently, several authors reported the puzzling observation that, in assessing the added value of new biomarkers to existing ones in a logistic regression model, statistical significance of new predictor variables does not necessarily translate into a statistically significant increase in the area under the ROC curve (AUC). Vickers et al. concluded that this inconsistency is because AUC "has vastly inferior statistical properties," i.e., it is extremely conservative. This statement is based on simulations that misuse the DeLong et al. method. Our purpose is to provide a fair comparison of the likelihood ratio (LR) test and the Wald test versus diagnostic accuracy (AUC) tests. DISCUSSION: We present a test to compare ideal AUCs of nested linear discriminant functions via an F test. We compare it with the LR test and the Wald test for the logistic regression model. The null hypotheses of these three tests are equivalent; however, the F test is an exact test whereas the LR test and the Wald test are asymptotic tests. Our simulation shows that the F test has the nominal type I error even with a small sample size. Our results also indicate that the LR test and the Wald test have inflated type I errors when the sample size is small, while the type I error converges to the nominal value asymptotically with increasing sample size as expected. We further show that the DeLong et al. method tests a different hypothesis and has the nominal type I error when it is used within its designed scope. Finally, we summarize the pros and cons of all four methods we consider in this paper. SUMMARY: We show that there is nothing inherently less powerful or disagreeable about ROC analysis for showing the usefulness of new biomarkers or characterizing the performance of classification models. Each statistical method for assessing biomarkers and classification models has its own strengths and weaknesses. Investigators need to choose methods based on the assessment purpose, the biomarker development phase at which the assessment is being performed, the available patient data, and the validity of assumptions behind the methodologies.


Subject(s)
Biomarkers , Models, Statistical , Predictive Value of Tests , Area Under Curve , Humans , Likelihood Functions , Logistic Models
SELECTION OF CITATIONS
SEARCH DETAIL
...