Búsqueda | Biblioteca Virtual en Salud

1.

On the assessment of the added value of new predictive biomarkers.

Chen, Weijie; Samuelson, Frank W; Gallas, Brandon D; Kang, Le; Sahiner, Berkman; Petrick, Nicholas.

BMC Med Res Methodol ; 13: 98, 2013 Jul 29.

Artículo en Inglés | MEDLINE | ID: mdl-23895587

RESUMEN

BACKGROUND: The surge in biomarker development calls for research on statistical evaluation methodology to rigorously assess emerging biomarkers and classification models. Recently, several authors reported the puzzling observation that, in assessing the added value of new biomarkers to existing ones in a logistic regression model, statistical significance of new predictor variables does not necessarily translate into a statistically significant increase in the area under the ROC curve (AUC). Vickers et al. concluded that this inconsistency is because AUC "has vastly inferior statistical properties," i.e., it is extremely conservative. This statement is based on simulations that misuse the DeLong et al. method. Our purpose is to provide a fair comparison of the likelihood ratio (LR) test and the Wald test versus diagnostic accuracy (AUC) tests. DISCUSSION: We present a test to compare ideal AUCs of nested linear discriminant functions via an F test. We compare it with the LR test and the Wald test for the logistic regression model. The null hypotheses of these three tests are equivalent; however, the F test is an exact test whereas the LR test and the Wald test are asymptotic tests. Our simulation shows that the F test has the nominal type I error even with a small sample size. Our results also indicate that the LR test and the Wald test have inflated type I errors when the sample size is small, while the type I error converges to the nominal value asymptotically with increasing sample size as expected. We further show that the DeLong et al. method tests a different hypothesis and has the nominal type I error when it is used within its designed scope. Finally, we summarize the pros and cons of all four methods we consider in this paper. SUMMARY: We show that there is nothing inherently less powerful or disagreeable about ROC analysis for showing the usefulness of new biomarkers or characterizing the performance of classification models. Each statistical method for assessing biomarkers and classification models has its own strengths and weaknesses. Investigators need to choose methods based on the assessment purpose, the biomarker development phase at which the assessment is being performed, the available patient data, and the validity of assumptions behind the methodologies.

Asunto(s)

Biomarcadores , Modelos Estadísticos , Valor Predictivo de las Pruebas , Área Bajo la Curva , Humanos , Funciones de Verosimilitud , Modelos Logísticos

2.

Discrimination tasks in simulated low-dose CT noise.

Abbey, Craig K; Samuelson, Frank W; Zeng, Rongping; Boone, John M; Myers, Kyle J; Eckstein, Miguel P.

Med Phys ; 50(7): 4151-4172, 2023 Jul.

Artículo en Inglés | MEDLINE | ID: mdl-37057360

RESUMEN

BACKGROUND: This study reports the results of a set of discrimination experiments using simulated images that represent the appearance of subtle lesions in low-dose computed tomography (CT) of the lungs. Noise in these images has a characteristic ramp-spectrum before apodization by noise control filters. We consider three specific diagnostic features that determine whether a lesion is considered malignant or benign, two system-resolution levels, and four apodization levels for a total of 24 experimental conditions. PURPOSE: The goal of the investigation is to better understand how well human observers perform subtle discrimination tasks like these, and the mechanisms of that performance. We use a forced-choice psychophysical paradigm to estimate observer efficiency and classification images. These measures quantify how effectively subjects can read the images, and how they use images to perform discrimination tasks across the different imaging conditions. MATERIALS AND METHODS: The simulated CT images used as stimuli in the psychophysical experiments are generated from high-resolution objects passed through a modulation transfer function (MTF) before down-sampling to the image-pixel grid. Acquisition noise is then added with a ramp noise-power spectrum (NPS), with subsequent smoothing through apodization filters. The features considered are lesion size, indistinct lesion boundary, and a nonuniform lesion interior. System resolution is implemented by an MTF with resolution (10% max.) of 0.47 or 0.58 cyc/mm. Apodization is implemented by a Shepp-Logan filter (Sinc profile) with various cutoffs. Six medically naïve subjects participated in the psychophysical studies, entailing training and testing components for each condition. Training consisted of staircase procedures to find the 80% correct threshold for each subject, and testing involved 2000 psychophysical trials at the threshold value for each subject. Human-observer performance is compared to the Ideal Observer to generate estimates of task efficiency. The significance of imaging factors is assessed using ANOVA. Classification images are used to estimate the linear template weights used by subjects to perform these tasks. Classification-image spectra are used to analyze subject weights in the spatial-frequency domain. RESULTS: Overall, average observer efficiency is relatively low in these experiments (10%-40%) relative to detection and localization studies reported previously. We find significant effects for feature type and apodization level on observer efficiency. Somewhat surprisingly, system resolution is not a significant factor. Efficiency effects of the different features appear to be well explained by the profile of the linear templates in the classification images. Increasingly strong apodization is found to both increase the classification-image weights and to increase the mean-frequency of the classification-image spectra. A secondary analysis of "Unapodized" classification images shows that this is largely due to observers undoing (inverting) the effects of apodization filters. CONCLUSIONS: These studies demonstrate that human observers can be relatively inefficient at feature-discrimination tasks in ramp-spectrum noise. Observers appear to be adapting to frequency suppression implemented in apodization filters, but there are residual effects that are not explained by spatial weighting patterns. The studies also suggest that the mechanisms for improving performance through the application of noise-control filters may require further investigation.

Asunto(s)

Procesamiento de Imagen Asistido por Computador , Tomografía Computarizada por Rayos X , Humanos , Procesamiento de Imagen Asistido por Computador/métodos , Fantasmas de Imagen , Algoritmos

3.

Effect of training-sample size and classification difficulty on the accuracy of genomic predictors.

Popovici, Vlad; Chen, Weijie; Gallas, Brandon G; Hatzis, Christos; Shi, Weiwei; Samuelson, Frank W; Nikolsky, Yuri; Tsyganova, Marina; Ishkin, Alex; Nikolskaya, Tatiana; Hess, Kenneth R; Valero, Vicente; Booser, Daniel; Delorenzi, Mauro; Hortobagyi, Gabriel N; Shi, Leming; Symmans, W Fraser; Pusztai, Lajos.

Breast Cancer Res ; 12(1): R5, 2010.

Artículo en Inglés | MEDLINE | ID: mdl-20064235

RESUMEN

INTRODUCTION: As part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of univariate feature-selection methods and classification algorithms may influence the performance of genomic predictors under varying degrees of prediction difficulty represented by three clinically relevant endpoints. METHODS: We used gene-expression data from 230 breast cancers (grouped into training and independent validation sets), and we examined 40 predictors (five univariate feature-selection methods combined with eight different classifiers) for each of the three endpoints. Their classification performance was estimated on the training set by using two different resampling methods and compared with the accuracy observed in the independent validation set. RESULTS: A ranking of the three classification problems was obtained, and the performance of 120 models was estimated and assessed on an independent validation set. The bootstrapping estimates were closer to the validation performance than were the cross-validation estimates. The required sample size for each endpoint was estimated, and both gene-level and pathway-level analyses were performed on the obtained models. CONCLUSIONS: We showed that genomic predictor accuracy is determined largely by an interplay between sample size and classification difficulty. Variations on univariate feature-selection methods and choice of classification algorithm have only a modest impact on predictor performance, and several statistically equally good predictors can be developed for any given classification problem.

Asunto(s)

Algoritmos , Neoplasias de la Mama/genética , Perfilación de la Expresión Génica/métodos , Área Bajo la Curva , Neoplasias de la Mama/química , Femenino , Humanos , Receptores de Estrógenos/análisis , Tamaño de la Muestra

4.

Human observer templates for lesion discrimination tasks.

Abbey, Craig K; Samuelson, Frank W; Zeng, Rongping; Boone, John M; Eckstein, Miguel P; Myers, Kyle J.

Proc SPIE Int Soc Opt Eng ; 113162020 Feb.

Artículo en Inglés | MEDLINE | ID: mdl-33384465

RESUMEN

We investigate a series of two-alternative forced-choice (2AFC) discrimination tasks based on malignant features of abnormalities in low-dose lung CT scans. A total of 3 tasks are evaluated, and these consist of a size-discrimination task, a boundary-sharpness task, and an irregular-interior task. Target and alternative signal profiles for these tasks are modulated by one of two system transfer functions and embedded in ramp-spectrum noise that has been apodized for noise control in one of 4 different ways. This gives the resulting images statistical properties that are related to weak ground-glass lesions in axial slices of low-dose lung CT images. We investigate observer performance in these tasks using a combination of statistical efficiency and classification images. We report results of 24 2AFC experiments involving the three tasks. A staircase procedure is used to find the approximate 80% correct discrimination threshold in each task, with a subsequent set of 2,000 trials at this threshold. These data are used to estimate statistical efficiency with respect to the ideal observer for each task, and to estimate the observer template using the classification-image methodology. We find efficiency varies between the different tasks with lowest efficiency in the boundary-sharpness task, and highest efficiency in the non-uniform interior task. All three tasks produce clearly visible patterns of positive and negative weighting in the classification images. The spatial frequency plots of classification images show how apodization results in larger weights at higher spatial frequencies.

5.

Computational reader design and statistical performance evaluation of an in-silico imaging clinical trial comparing digital breast tomosynthesis with full-field digital mammography.

Zeng, Rongping; Samuelson, Frank W; Sharma, Diksha; Badal, Andreu; Christian, Graff G; Glick, Stephen J; Myers, Kyle J; Badano, Aldo.

J Med Imaging (Bellingham) ; 7(4): 042802, 2020 Jul.

Artículo en Inglés | MEDLINE | ID: mdl-32118094

RESUMEN

A recent study reported on an in-silico imaging trial that evaluated the performance of digital breast tomosynthesis (DBT) as a replacement for full-field digital mammography (FFDM) for breast cancer screening. In this in-silico trial, the whole imaging chain was simulated, including the breast phantom generation, the x-ray transport process, and computational readers for image interpretation. We focus on the design and performance characteristics of the computational reader in the above-mentioned trial. Location-known lesion (spiculated mass and clustered microcalcifications) detection tasks were used to evaluate the imaging system performance. The computational readers were designed based on the mechanism of a channelized Hotelling observer (CHO), and the reader models were selected to trend human performance. Parameters were tuned to ensure stable lesion detectability. A convolutional CHO that can adapt a round channel function to irregular lesion shapes was compared with the original CHO and was found to be suitable for detecting clustered microcalcifications but was less optimal in detecting spiculated masses. A three-dimensional CHO that operated on the multiple slices was compared with a two-dimensional (2-D) CHO that operated on three versions of 2-D slabs converted from the multiple slices and was found to be optimal in detecting lesions in DBT. Multireader multicase reader output analysis was used to analyze the performance difference between FFDM and DBT for various breast and lesion types. The results showed that DBT was more beneficial in detecting masses than detecting clustered microcalcifications compared with FFDM, consistent with the finding in a clinical imaging trial. Statistical uncertainty smaller than 0.01 standard error for the estimated performance differences was achieved with a dataset containing approximately 3000 breast phantoms. The computational reader design methodology presented provides evidence that model observers can be useful in-silico tools for supporting the performance comparison of breast imaging systems.

6.

Impact of prevalence and case distribution in lab-based diagnostic imaging studies.

Gallas, Brandon D; Chen, Weijie; Cole, Elodia; Ochs, Robert; Petrick, Nicholas; Pisano, Etta D; Sahiner, Berkman; Samuelson, Frank W; Myers, Kyle J.

J Med Imaging (Bellingham) ; 6(1): 015501, 2019 Jan.

Artículo en Inglés | MEDLINE | ID: mdl-30713851

RESUMEN

We investigated effects of prevalence and case distribution on radiologist diagnostic performance as measured by area under the receiver operating characteristic curve (AUC) and sensitivity-specificity in lab-based reader studies evaluating imaging devices. Our retrospective reader studies compared full-field digital mammography (FFDM) to screen-film mammography (SFM) for women with dense breasts. Mammograms were acquired from the prospective Digital Mammographic Imaging Screening Trial. We performed five reader studies that differed in terms of cancer prevalence and the distribution of noncancers. Twenty radiologists participated in each reader study. Using split-plot study designs, we collected recall decisions and multilevel scores from the radiologists for calculating sensitivity, specificity, and AUC. Differences in reader-averaged AUCs slightly favored SFM over FFDM (biggest AUC difference: 0.047, SE = 0.023 , p = 0.047 ), where standard error accounts for reader and case variability. The differences were not significant at a level of 0.01 (0.05/5 reader studies). The differences in sensitivities and specificities were also indeterminate. Prevalence had little effect on AUC (largest difference: 0.02), whereas sensitivity increased and specificity decreased as prevalence increased. We found that AUC is robust to changes in prevalence, while radiologists were more aggressive with recall decisions as prevalence increased.

7.

Classification images for localization performance in ramp-spectrum noise.

Abbey, Craig K; Samuelson, Frank W; Zeng, Rongping; Boone, John M; Eckstein, Miguel P; Myers, Kyle.

Med Phys ; 45(5): 1970-1984, 2018 May.

Artículo en Inglés | MEDLINE | ID: mdl-29532479

RESUMEN

PURPOSE: This study investigates forced localization of targets in simulated images with statistical properties similar to trans-axial sections of x-ray computed tomography (CT) volumes. A total of 24 imaging conditions are considered, comprising two target sizes, three levels of background variability, and four levels of frequency apodization. The goal of the study is to better understand how human observers perform forced-localization tasks in images with CT-like statistical properties. METHODS: The transfer properties of CT systems are modeled by a shift-invariant transfer function in addition to apodization filters that modulate high spatial frequencies. The images contain noise that is the combination of a ramp-spectrum component, simulating the effect of acquisition noise in CT, and a power-law component, simulating the effect of normal anatomy in the background, which are modulated by the apodization filter as well. Observer performance is characterized using two psychophysical techniques: efficiency analysis and classification image analysis. Observer efficiency quantifies how much diagnostic information is being used by observers to perform a task, and classification images show how that information is being accessed in the form of a perceptual filter. RESULTS: Psychophysical studies from five subjects form the basis of the results. Observer efficiency ranges from 29% to 77% across the different conditions. The lowest efficiency is observed in conditions with uniform backgrounds, where significant effects of apodization are found. The classification images, estimated using smoothing windows, suggest that human observers use center-surround filters to perform the task, and these are subjected to a number of subsequent analyses. When implemented as a scanning linear filter, the classification images appear to capture most of the observer variability in efficiency (r2 = 0.86). The frequency spectra of the classification images show that frequency weights generally appear bandpass in nature, with peak frequency and bandwidth that vary with statistical properties of the images. CONCLUSIONS: In these experiments, the classification images appear to capture important features of human-observer performance. Frequency apodization only appears to have a significant effect on performance in the absence of anatomical variability, where the observers appear to underweight low spatial frequencies that have relatively little noise. Frequency weights derived from the classification images generally have a bandpass structure, with adaptation to different conditions seen in the peak frequency and bandwidth. The classification image spectra show relatively modest changes in response to different levels of apodization, with some evidence that observers are attempting to rebalance the apodized spectrum presented to them.

Asunto(s)

Procesamiento de Imagen Asistido por Computador/métodos , Relación Señal-Ruido , Estadística como Asunto , Tomografía Computarizada por Rayos X

8.

Evaluation of Digital Breast Tomosynthesis as Replacement of Full-Field Digital Mammography Using an In Silico Imaging Trial.

Badano, Aldo; Graff, Christian G; Badal, Andreu; Sharma, Diksha; Zeng, Rongping; Samuelson, Frank W; Glick, Stephen J; Myers, Kyle J.

JAMA Netw Open ; 1(7): e185474, 2018 11 02.

Artículo en Inglés | MEDLINE | ID: mdl-30646401

RESUMEN

Importance: Expensive and lengthy clinical trials can delay regulatory evaluation of innovative technologies, affecting patient access to high-quality medical products. Simulation is increasingly being used in product development but rarely in regulatory applications. Objectives: To conduct a computer-simulated imaging trial evaluating digital breast tomosynthesis (DBT) as a replacement for digital mammography (DM) and to compare the results with a comparative clinical trial. Design, Setting, and Participants: The simulated Virtual Imaging Clinical Trial for Regulatory Evaluation (VICTRE) trial was designed to replicate a clinical trial that used human patients and radiologists. Images obtained with in silico versions of DM and DBT systems via fast Monte Carlo x-ray transport were interpreted by a computational reader detecting the presence of lesions. A total of 2986 synthetic image-based virtual patients with breast sizes and radiographic densities representative of a screening population and compressed thicknesses from 3.5 to 6 cm were generated using an analytic approach in which anatomical structures are randomly created within a predefined breast volume and compressed in the craniocaudal orientation. A positive cohort contained a digitally inserted microcalcification cluster or spiculated mass. Main Outcomes and Measures: The trial end point was the difference in area under the receiver operating characteristic curve between modalities for lesion detection. The trial was sized for an SE of 0.01 in the change in area under the curve (AUC), half the uncertainty in the comparative clinical trial. Results: In this trial, computational readers analyzed 31â¯055 DM and 27â¯960 DBT cases from 2986 virtual patients with the following Breast Imaging Reporting and Data System densities: 286 (9.6%) extremely dense, 1200 (40.2%) heterogeneously dense, 1200 (40.2%) scattered fibroglandular densities, and 300 (10.0%) almost entirely fat. The mean (SE) change in AUC was 0.0587 (0.0062) (P < .001) in favor of DBT. The change in AUC was larger for masses (mean [SE], 0.0903 [0.008]) than for calcifications (mean [SE], 0.0268 [0.004]), which was consistent with the findings of the comparative trial (mean [SE], 0.065 [0.017] for masses and -0.047 [0.032] for calcifications). Conclusions and Relevance: The results of the simulated VICTRE trial are consistent with the performance seen in the comparative trial. While further research is needed to assess the generalizability of these findings, in silico imaging trials represent a viable source of regulatory evidence for imaging devices.

Asunto(s)

Mamografía/métodos , Mamografía/normas , Mama/diagnóstico por imagen , Neoplasias de la Mama/diagnóstico por imagen , Calcinosis/diagnóstico por imagen , Simulación por Computador , Femenino , Humanos , Curva ROC

9.

Inter-laboratory comparison of channelized hotelling observer computation.

Ba, Alexandre; Abbey, Craig K; Baek, Jongduk; Han, Minah; Bouwman, Ramona W; Balta, Christiana; Brankov, Jovan; Massanes, Francesc; Gifford, Howard C; Hernandez-Giron, Irene; Veldkamp, Wouter J H; Petrov, Dimitar; Marshall, Nicholas; Samuelson, Frank W; Zeng, Rongping; Solomon, Justin B; Samei, Ehsan; Timberg, Pontus; Förnvik, Hannie; Reiser, Ingrid; Yu, Lifeng; Gong, Hao; Bochud, François O.

Med Phys ; 45(7): 3019-3030, 2018 Jul.

Artículo en Inglés | MEDLINE | ID: mdl-29704868

RESUMEN

PURPOSE: The task-based assessment of image quality using model observers is increasingly used for the assessment of different imaging modalities. However, the performance computation of model observers needs standardization as well as a well-established trust in its implementation methodology and uncertainty estimation. The purpose of this work was to determine the degree of equivalence of the channelized Hotelling observer performance and uncertainty estimation using an intercomparison exercise. MATERIALS AND METHODS: Image samples to estimate model observer performance for detection tasks were generated from two-dimensional CT image slices of a uniform water phantom. A common set of images was sent to participating laboratories to perform and document the following tasks: (a) estimate the detectability index of a well-defined CHO and its uncertainty in three conditions involving different sized targets all at the same dose, and (b) apply this CHO to an image set where ground truth was unknown to participants (lower image dose). In addition, and on an optional basis, we asked the participating laboratories to (c) estimate the performance of real human observers from a psychophysical experiment of their choice. Each of the 13 participating laboratories was confidentially assigned a participant number and image sets could be downloaded through a secure server. Results were distributed with each participant recognizable by its number and then each laboratory was able to modify their results with justification as model observer calculation are not yet a routine and potentially error prone. RESULTS: Detectability index increased with signal size for all participants and was very consistent for 6 mm sized target while showing higher variability for 8 and 10 mm sized target. There was one order of magnitude between the lowest and the largest uncertainty estimation. CONCLUSIONS: This intercomparison helped define the state of the art of model observer performance computation and with thirteen participants, reflects openness and trust within the medical imaging community. The performance of a CHO with explicitly defined channels and a relatively large number of test images was consistently estimated by all participants. In contrast, the paper demonstrates that there is no agreement on estimating the variance of detectability in the training and testing setting.

Asunto(s)

Procesamiento de Imagen Asistido por Computador , Laboratorios , Tomografía Computarizada por Rayos X , Variaciones Dependientes del Observador , Incertidumbre

10.

The Reproducibility of Changes in Diagnostic Figures of Merit Across Laboratory and Clinical Imaging Reader Studies.

Samuelson, Frank W; Abbey, Craig K.

Acad Radiol ; 24(11): 1436-1446, 2017 11.

Artículo en Inglés | MEDLINE | ID: mdl-28666723

RESUMEN

RATIONALE AND OBJECTIVES: In this paper we examine which comparisons of reading performance between diagnostic imaging systems made in controlled retrospective laboratory studies may be representative of what we observe in later clinical studies. The change in a meaningful diagnostic figure of merit between two diagnostic modalities should be qualitatively or quantitatively comparable across all kinds of studies. MATERIALS AND METHODS: In this meta-study we examine the reproducibility of relative measures of sensitivity, false positive fraction (FPF), area under the receiver operating characteristic (ROC) curve, and expected utility across laboratory and observational clinical studies for several different breast imaging modalities, including screen film mammography, digital mammography, breast tomosynthesis, and ultrasound. RESULTS: Across studies of all types, the changes in the FPFs yielded very small probabilities of having a common mean value. The probabilities of relative sensitivity being the same across ultrasound and tomosynthesis studies were low. No evidence was found for different mean values of relative area under the ROC curve or relative expected utility within any of the study sets. CONCLUSION: The comparison demonstrates that the ratios of areas under the ROC curve and expected utilities are reproducible across laboratory and clinical studies, whereas sensitivity and FPF are not.

Asunto(s)

Neoplasias de la Mama/diagnóstico por imagen , Mama/diagnóstico por imagen , Mamografía/métodos , Ultrasonografía Mamaria , Área Bajo la Curva , Femenino , Humanos , Curva ROC , Reproducibilidad de los Resultados

11.

A Utility/Cost Analysis of Breast Cancer Risk Prediction Algorithms.

Abbey, Craig K; Wu, Yirong; Burnside, Elizabeth S; Wunderlich, Adam; Samuelson, Frank W; Boone, John M.

Proc SPIE Int Soc Opt Eng ; 97872016 Feb 27.

Artículo en Inglés | MEDLINE | ID: mdl-27335532

RESUMEN

Breast cancer risk prediction algorithms are used to identify subpopulations that are at increased risk for developing breast cancer. They can be based on many different sources of data such as demographics, relatives with cancer, gene expression, and various phenotypic features such as breast density. Women who are identified as high risk may undergo a more extensive (and expensive) screening process that includes MRI or ultrasound imaging in addition to the standard full-field digital mammography (FFDM) exam. Given that there are many ways that risk prediction may be accomplished, it is of interest to evaluate them in terms of expected cost, which includes the costs of diagnostic outcomes. In this work we perform an expected-cost analysis of risk prediction algorithms that is based on a published model that includes the costs associated with diagnostic outcomes (true-positive, false-positive, etc.). We assume the existence of a standard screening method and an enhanced screening method with higher scan cost, higher sensitivity, and lower specificity. We then assess expected cost of using a risk prediction algorithm to determine who gets the enhanced screening method under the strong assumption that risk and diagnostic performance are independent. We find that if risk prediction leads to a high enough positive predictive value, it will be cost-effective regardless of the size of the subpopulation. Furthermore, in terms of the hit-rate and false-alarm rate of the of the risk-prediction algorithm, iso-cost contours are lines with slope determined by properties of the available diagnostic systems for screening.

12.

Comparison of semiparametric receiver operating characteristic models on observer data.

Samuelson, Frank W; He, Xin.

J Med Imaging (Bellingham) ; 1(3): 031004, 2014 Oct.

Artículo en Inglés | MEDLINE | ID: mdl-26158046

RESUMEN

The evaluation of medical imaging devices often involves studies that measure the ability of observers to perform a signal detection task on images obtained from those devices. Data from such studies are frequently regressed ordinally using two-sample receiver operating characteristic (ROC) models. We applied some of these models to a number of randomly chosen data sets from medical imaging and evaluated how well they fit using the Akaike and Bayesian information criteria and cross-validation. We find that for many observer data sets, a single-parameter model is sufficient and that only some studies exhibit evidence for the use of models with more than a single parameter. In particular, the single-parameter power-law model frequently well describes observer data. The power-law model has an asymmetric ROC curve and a constant mean-to-sigma ratio seen in studies analyzed with the bi-normal model. It is identical or very similar to special cases of other two-parameter models.

13.

Comparative statistical properties of expected utility and area under the ROC curve for laboratory studies of observer performance in screening mammography.

Abbey, Craig K; Gallas, Brandon D; Boone, John M; Niklason, Loren T; Hadjiiski, Lubomir M; Sahiner, Berkman; Samuelson, Frank W.

Acad Radiol ; 21(4): 481-90, 2014 Apr.

Artículo en Inglés | MEDLINE | ID: mdl-24594418

RESUMEN

RATIONALE AND OBJECTIVES: Our objective is to determine whether expected utility (EU) and the area under the receiver operator characteristic (AUC) are consistent with one another as endpoints of observer performance studies in mammography. These two measures characterize receiver operator characteristic performance somewhat differently. We compare these two study endpoints at the level of individual reader effects, statistical inference, and components of variance across readers and cases. MATERIALS AND METHODS: We reanalyze three previously published laboratory observer performance studies that investigate various x-ray breast imaging modalities using EU and AUC. The EU measure is based on recent estimates of relative utility for screening mammography. RESULTS: The AUC and EU measures are correlated across readers for individual modalities (r = 0.93) and differences in modalities (r = 0.94 to 0.98). Statistical inference for modality effects based on multi-reader multi-case analysis is very similar, with significant results (P < .05) in exactly the same conditions. Power analyses show mixed results across studies, with a small increase in power on average for EU that corresponds to approximately a 7% reduction in the number of readers. Despite a large number of crossing receiver operator characteristic curves (59% of readers), modality effects only rarely have opposite signs for EU and AUC (6%). CONCLUSIONS: We do not find any evidence of systematic differences between EU and AUC in screening mammography observer studies. Thus, when utility approaches are viable (i.e., an appropriate value of relative utility exists), practical effects such as statistical efficiency may be used to choose study endpoints.

Asunto(s)

Neoplasias de la Mama/diagnóstico por imagen , Competencia Clínica/estadística & datos numéricos , Mamografía/estadística & datos numéricos , Tamizaje Masivo/estadística & datos numéricos , Curva ROC , Interpretación de Imagen Radiográfica Asistida por Computador/métodos , Técnicas de Laboratorio Clínico/estadística & datos numéricos , Interpretación Estadística de Datos , Femenino , Humanos , Variaciones Dependientes del Observador , Reproducibilidad de los Resultados , Sensibilidad y Especificidad

14.

Inference based on diagnostic measures from studies of new imaging devices.

Samuelson, Frank W.

Acad Radiol ; 20(7): 816-24, 2013 Jul.

Artículo en Inglés | MEDLINE | ID: mdl-23643364

RESUMEN

RATIONALE AND OBJECTIVES: Before using a new diagnostic imaging device regularly in a clinic, it should be studied using patients and radiologists. Often such studies report diagnostic performance in terms of sensitivity, specificity, area under the receiver operating characteristic curve (AUC), or differences thereof. In this report we look at how these studies differ from actual future clinical practice and how those differences may affect reported performance measures. MATERIALS AND METHODS: We review signal detection (receiver operating characteristic) theory and decision theory. We compare diagnostic measures from several published studies in medical imaging and examine how they relate to theory and each other. RESULTS: We see that clinical decisions can be modeled using signal detection and decision theories. Sensitivity and specificity are inextricably linked with clinical factors, such as prevalence and costs. Imaging devices are used in many different ways in clinical practice, so that sensitivities, specificities, and AUCs measured in studies of new diagnostic imaging devices will differ from those in actual future clinical use. CONCLUSIONS: Measured sensitivities, specificities, and the directions of changes thereof are not necessarily consistent or reproducible across studies of new diagnostic devices. A change in the AUC, which should be independent of clinical costs or prevalence, is a consistent measure across similar studies, and a positive change in AUC is indicative of additional diagnostic information that will be available to radiologists in a future clinical environment.

Asunto(s)

Área Bajo la Curva , Neoplasias de la Mama/diagnóstico por imagen , Mamografía/estadística & datos numéricos , Procesamiento de Señales Asistido por Computador , Diagnóstico Diferencial , Diagnóstico por Imagen/métodos , Diagnóstico por Imagen/estadística & datos numéricos , Femenino , Humanos , Mamografía/métodos , Reproducibilidad de los Resultados , Sensibilidad y Especificidad

15.

Statistical power considerations for a utility endpoint in observer performance studies.

Abbey, Craig K; Samuelson, Frank W; Gallas, Brandon D.

Acad Radiol ; 20(7): 798-806, 2013 Jul.

Artículo en Inglés | MEDLINE | ID: mdl-23611439

RESUMEN

RATIONALE AND OBJECTIVES: The purpose of this investigation is to compare the statistical power of the most common measure of performance for observer performance studies, area under the ROC curve (AUC), to an expected utility (EU) endpoint. MATERIALS AND METHODS: We have modified a well-known simulation procedure developed by Roe and Metz for statistical power analysis in receiver operating characteristic (ROC) studies. Starting from a set of baseline simulations, we investigate the effects of three parameters that describe properties of the observers (iso-utility slope, unequal variance, and tendency to favor more aggressive or conservative actions) and three parameters that affect experimental design (number of readers, number of cases, and fraction of positive cases). RESULTS: The EU endpoint generally has good statistical power relative to AUC in our simulations. Of 396 total conditions simulated, EU had higher statistical power in 377 cases (95%). In 246 of these cases, EU power was 5 percentage points or more higher than AUC. In simulation runs evaluating the effect of the number of readers and cases on the baseline simulations, EU measure had equivalent power to AUC with fewer readers (9% to 28%) or fewer cases (18% to 41%). CONCLUSION: These simulation studies provide further motivation for considering EU in studies of screening mammography technology and they motivate investigations of utility in other diagnostic tasks.

Asunto(s)

Modelos Estadísticos , Curva ROC , Radiología/métodos , Radiología/estadística & datos numéricos , Estadística como Asunto/métodos , Simulación por Computador

16.

Investigation of reading mode and relative sensitivity as factors that influence reader performance when using computer-aided detection software.

Paquerault, Sophie; Samuelson, Frank W; Petrick, Nicholas; Myers, Kyle J; Smith, Robert C.

Acad Radiol ; 16(9): 1095-107, 2009 Sep.

Artículo en Inglés | MEDLINE | ID: mdl-19523855

RESUMEN

RATIONALE AND OBJECTIVES: The aim of this study was to investigate the effects of relative sensitivity (reader without computer-aided detection [CAD] vs stand-alone CAD) and reading mode on reader performance when using CAD software. MATERIALS AND METHODS: Two sets of 100 images (low-contrast and high-contrast sets) were created by adding low-contrast or high-contrast simulated masses to random locations in 100 normal mammograms. This produced a relative sensitivity, substantially less for the low-contrast set and similar for the high-contrast set. Seven readers reviewed every image in each set and specified location and probability scores using three reading modes (without CAD, second read with CAD, and concurrent read with CAD). Reader detection accuracy was analyzed using areas under free-response receiver operating characteristic curves, sensitivity, and the number of false-positive findings per image. RESULTS: For the low-contrast set, average differences in areas under free-response receiver operating characteristic curves, sensitivity, and false-positive findings per image without CAD were 0.02, 0.12, and 0.11, respectively, compared to second read and 0.05, 0.17, and 0.09 (not statistically significant), respectively, compared to concurrent read. For the high-contrast set, average differences were 0.002 (not statistically significant), 0.04, and 0.05, respectively, compared to second read and -0.004 (not statistically significant), 0.04, and 0.08 (not statistically significant), respectively, compared to concurrent read (all differences were statistically significant except as noted). Differences were greater in the low-contrast set than the high-contrast set. Differences between second read and concurrent read were not significant. CONCLUSIONS: Relative sensitivity is a critical factor that determines incremental improvement in reader performance when using CAD and appears to be more important than reading mode. Relative sensitivity may determine the clinical usefulness of CAD in different clinical applications and for different types of users.

Asunto(s)

Algoritmos , Artefactos , Neoplasias de la Mama/diagnóstico por imagen , Mamografía/métodos , Reconocimiento de Normas Patrones Automatizadas/métodos , Interpretación de Imagen Radiográfica Asistida por Computador/métodos , Programas Informáticos , Inteligencia Artificial , Femenino , Humanos , Variaciones Dependientes del Observador , Intensificación de Imagen Radiográfica/métodos , Reproducibilidad de los Resultados , Sensibilidad y Especificidad , Validación de Programas de Computación

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

Detalles de la búsqueda