RESUMO
PURPOSE: The authors developed scaling methods that monotonically transform the output of one classifier to the "scale" of another. Such transformations affect the distribution of classifier output while leaving the ROC curve unchanged. In particular, they investigated transformations between radiologists and computer classifiers, with the goal of addressing the problem of comparing and interpreting case-specific values of output from two classifiers. METHODS: Using both simulated and radiologists' rating data of breast imaging cases, the authors investigated a likelihood-ratio-scaling transformation, based on "matching" classifier likelihood ratios. For comparison, three other scaling transformations were investigated that were based on matching classifier true positive fraction, false positive fraction, or cumulative distribution function, respectively. The authors explored modifying the computer output to reflect the scale of the radiologist, as well as modifying the radiologist's ratings to reflect the scale of the computer. They also evaluated how dataset size affects the transformations. RESULTS: When ROC curves of two classifiers differed substantially, the four transformations were found to be quite different. The likelihood-ratio scaling transformation was found to vary widely from radiologist to radiologist. Similar results were found for the other transformations. Our simulations explored the effect of database sizes on the accuracy of the estimation of our scaling transformations. CONCLUSIONS: The likelihood-ratio-scaling transformation that the authors have developed and evaluated was shown to be capable of transforming computer and radiologist outputs to a common scale reliably, thereby allowing the comparison of the computer and radiologist outputs on the basis of a clinically relevant statistic.
Assuntos
Neoplasias da Mama/diagnóstico , Diagnóstico por Computador/métodos , Área Sob a Curva , Teorema de Bayes , Gráficos por Computador , Humanos , Funções Verossimilhança , Estudos Retrospectivos , Sensibilidade e EspecificidadeRESUMO
PURPOSE: To determine whether use of bone suppression (BS) imaging, used together with a standard radiograph, could improve radiologists' performance for detection of small lung cancers compared with use of standard chest radiographs alone and whether BS imaging would provide accuracy equivalent to that of dual-energy subtraction (DES) radiography. MATERIALS AND METHODS: Institutional review board approval was obtained. The requirement for informed consent was waived. The study was HIPAA compliant. Standard and DES chest radiographs of 50 patients with 55 confirmed primary nodular cancers (mean diameter, 20 mm) as well as 30 patients without cancers were included in the observer study. A new BS imaging processing system that can suppress the conspicuity of bones was applied to the standard radiographs to create corresponding BS images. Ten observers, including six experienced radiologists and four radiology residents, indicated their confidence levels regarding the presence or absence of a lung cancer for each lung, first by using a standard image, then a BS image, and finally DES soft-tissue and bone images. Receiver operating characteristic (ROC) analysis was used to evaluate observer performance. RESULTS: The average area under the ROC curve (AUC) for all observers was significantly improved from 0.807 to 0.867 with BS imaging and to 0.916 with DES (both P < .001). The average AUC for the six experienced radiologists was significantly improved from 0.846 with standard images to 0.894 with BS images (P < .001) and from 0.894 to 0.945 with DES images (P = .001). CONCLUSION: Use of BS imaging together with a standard radiograph can improve radiologists' accuracy for detection of small lung cancers on chest radiographs. Further improvements can be achieved by use of DES radiography but with the requirement for special equipment and a potential small increase in radiation dose.
Assuntos
Neoplasias Pulmonares/diagnóstico por imagem , Imagem Radiográfica a Partir de Emissão de Duplo Fóton/instrumentação , Radiografia Torácica/instrumentação , Adulto , Idoso , Idoso de 80 Anos ou mais , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Curva ROC , Interpretação de Imagem Radiográfica Assistida por Computador , Técnica de SubtraçãoRESUMO
PURPOSE: To provide a broad perspective concerning the recent use of receiver operating characteristic (ROC) analysis in medical imaging by reviewing ROC studies published in Radiology between 1997 and 2006 for experimental design, imaging modality, medical condition, and ROC paradigm. MATERIALS AND METHODS: Two hundred ninety-five studies were obtained by conducting a literature search with PubMed with two criteria: publication in Radiology between 1997 and 2006 and occurrence of the phrase "receiver operating characteristic." Studies returned by the query that were not diagnostic imaging procedure performance evaluations were excluded. Characteristics of the remaining studies were tabulated. RESULTS: Two hundred thirty-three (79.0%) of the 295 studies reported findings based on observers' diagnostic judgments or objective measurements. Forty-three (14.6%) did not include human observers, with most of these reporting an evaluation of a computer-aided diagnosis system or functional data obtained with computed tomography (CT) or magnetic resonance (MR) imaging. The remaining 19 (6.4%) studies were classified as reviews or meta-analyses and were excluded from our subsequent analysis. Among the various imaging modalities, MR imaging (46.0%) and CT (25.7%) were investigated most frequently. Approximately 60% (144 of 233) of ROC studies with human observers published in Radiology included three or fewer observers. CONCLUSION: ROC analysis is widely used in radiologic research, confirming its fundamental role in assessing diagnostic performance. However, the ROC studies reported in Radiology were not always adequate to support clear and clinically relevant conclusions.
Assuntos
Pesquisa Biomédica , Diagnóstico por Imagem/estatística & dados numéricos , Curva ROC , Radiologia , Humanos , Publicações Periódicas como Assunto , SoftwareRESUMO
PURPOSE: To retrospectively determine the sensitivity of and number of false-positive marks made by a commercially available computer-aided detection (CAD) system for identifying lung cancers previously missed on chest radiographs by radiologists, with histopathologic results as the reference standard. MATERIALS AND METHODS: Institutional review board approval was obtained for this HIPAA-compliant study; the requirement for informed patient consent was waived. A CAD nodule detection program was applied to 34 posteroanterior digital chest radiographs obtained in 34 patients (21 men, 13 women; mean age, 69 years). All 34 radiographs showed a nodular lung cancer that was apparent in retrospect but had not been mentioned in the report. Two radiologists identified these radiologist-missed cancers on the chest radiographs and graded them for visibility, location, subtlety (extremely subtle to extremely obvious on a 10-point scale), and actionability (actionable or not actionable according to whether the radiologists probably would have recommended follow-up if the nodule had been detected). The CAD results were analyzed to determine the numbers of cancers and false-positive nodules marked and to correlate the CAD results with the nodule grades for subtlety and actionability. The chi2 test or Fisher exact test for independence was used to compare CAD sensitivity between the very subtle (grade 1-3) and relatively obvious (grade > 3) cancers and between the actionable and not actionable cancers. RESULTS: The CAD program had an overall sensitivity of 35% (12 of 34 cancers), identifying seven (30%) of 23 very subtle and five (45%) of 11 relatively obvious radiologist-missed cancers (P = .21) and detecting two (25%) of eight missed not actionable and ten (38%) of 26 missed actionable cancers (P = .33). The CAD program made an average of 5.9 false-positive marks per radiograph. CONCLUSION: The described CAD system can mark a substantial proportion of visually subtle lung cancers that are likely to be missed by radiologists.
Assuntos
Diagnóstico por Computador , Neoplasias Pulmonares/diagnóstico por imagem , Idoso , Idoso de 80 Anos ou mais , Reações Falso-Positivas , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Radiografia Torácica , Estudos Retrospectivos , Sensibilidade e EspecificidadeRESUMO
RATIONALE AND OBJECTIVES: To investigate the effect of different reporting methods and performance measures on the assessment of the benefit of computer-aided diagnosis (CAD) in characterizing malignant and benign breast lesions on mammography and sonography. MATERIALS AND METHODS: In a previous study, 10 observers provided three types of reporting data (probability of malignancy [PM] estimates, Breast Imaging Reporting and Data System [BI-RADS] ratings, and biopsy decisions), both without and with CAD. The current study compares alternative performance measures computed from the three types of reporting data. The area under the receiver operating characteristic curve (AUC) was computed from both the PM estimates and the BI-RADS ratings, whereas sensitivity and specificity were computed from all three data types. Sensitivity and specificity values calculated from either the PM estimates or the BI-RADS ratings were determined by setting both constant and user-dependent thresholds. Student's t-tests were used to evaluate the statistical significance of the differences in the performance measures without and with CAD. RESULTS: The average AUC values of the 10 observers calculated from either PM estimates or BI-RADS ratings demonstrated statistically significant improvements in performance with CAD, increasing from 0.87 to 0.92 or 0.93, respectively. However, the statistical significance of improvements in sensitivity or specificity depended on the type of reporting data used. CONCLUSIONS: Use of different types of reporting data in the computation of sensitivity and specificity may result in different conclusions concerning the benefit of CAD. Meaningful determination of sensitivity and specificity from PM estimates require the use of user-dependent thresholds.
Assuntos
Neoplasias da Mama/diagnóstico por imagem , Diagnóstico por Computador/estatística & dados numéricos , Documentação/estatística & dados numéricos , Mamografia , Variações Dependentes do Observador , Ultrassonografia Mamária , Área Sob a Curva , Biópsia , Humanos , Curva ROC , Sensibilidade e EspecificidadeRESUMO
RATIONALE AND OBJECTIVES: The Dorfman-Berbaum-Metz (DBM) method has been one of the most popular methods for analyzing multireader receiver-operating characteristic (ROC) studies since it was proposed in 1992. Despite its popularity, the original procedure has several drawbacks: it is limited to jackknife accuracy estimates, it is substantially conservative, and it is not based on a satisfactory conceptual or theoretical model. Recently, solutions to these problems have been presented in three papers. Our purpose is to summarize and provide an overview of these recent developments. MATERIALS AND METHODS: We present and discuss the recently proposed solutions for the various drawbacks of the original DBM method. RESULTS: We compare the solutions in a simulation study and find that they result in improved performance for the DBM procedure. We also compare the solutions using two real data studies and find that the modified DBM procedure that incorporates these solutions yields more significant results and clearer interpretations of the variance component parameters than the original DBM procedure. CONCLUSIONS: We recommend using the modified DBM procedure that incorporates the recent developments.
Assuntos
Algoritmos , Curva ROC , Radiografia/estatística & dados numéricos , Adulto , Análise de Variância , Dissecção Aórtica/diagnóstico , Aneurisma Aórtico/diagnóstico , Simulação por Computador , Humanos , Recém-Nascido , Doenças do Recém-Nascido/diagnóstico , Sistemas de Informação em RadiologiaRESUMO
We have shown previously that an N-class ideal observer achieves the optimal receiver operating characteristic (ROC) hypersurface in a Neyman-Pearson sense. Due to the inherent complexity of evaluating observer performance even in a three-class classification task, some researchers have suggested a generally incomplete but more tractable evaluation in terms of a surface, plotting only the three "sensitivities." More generally, one can evaluate observer performance with a single sensitivity or misclassification probability as a function of two linear combinations of sensitivities or misclassification probabilities. We analyzed four such formulations including the "sensitivity" surface. In each case, we applied the Neyman-Pearson criterion to find the observer which achieves optimal performance with respect to each given set of "performance description variables" under consideration. In the unrestricted case, optimization with respect to the Neyman-Pearson criterion yields the ideal observer, as does maximization of the observer's expected utility. Moreover, during our consideration of the restricted cases, we found that the two optimization methods do not merely yield the same observer, but are in fact completely equivalent in a mathematical sense. Thus, for a wide variety of observers which maximize performance with respect to a restricted ROC surface in the Neyman-Pearson sense, that ROC surface can also be shown to provide a complete description of the observer's performance in an expected utility sense.
Assuntos
Algoritmos , Aumento da Imagem/métodos , Interpretação de Imagem Assistida por Computador/métodos , Imageamento Tridimensional/métodos , Curva ROC , Variações Dependentes do Observador , Reprodutibilidade dos Testes , Sensibilidade e EspecificidadeRESUMO
RATIONALE AND OBJECTIVES: Estimation of ROC curves and their associated indices from experimental data can be problematic, especially in multireader, multicase (MRMC) observer studies. Wilcoxon estimates of area under the curve (AUC) can be strongly biased with categorical data, whereas the conventional binormal ROC curve-fitting model may produce unrealistic fits. The "proper" binormal model (PBM) was introduced by Metz and Pan to provide acceptable fits for both sturdy and problematic datasets, but other investigators found that its first software implementation was numerically unstable in some situations. Therefore, we created an entirely new algorithm to implement the PBM. MATERIALS AND METHODS: This paper describes in detail the new PBM curve-fitting algorithm, which was designed to perform successfully in all problematic situations encountered previously. Extensive testing was conducted also on a broad variety of simulated and real datasets. Windows, Linux, and Apple Macintosh OS X versions of the algorithm are available online at http://xray.bsd.uchicago.edu/krl/. RESULTS: Plots of fitted curves as well as summaries of AUC estimates and their standard errors are reported. The new algorithm never failed to converge and produced good fits for all of the several million datasets on which it was tested. For all but the most problematic datasets, the algorithm also produced very good estimates of AUC standard error. The AUC estimates compared well with Wilcoxon estimates for continuously distributed data and are expected to be superior for categorical data. CONCLUSION: This implementation of the PBM is reliable in a wide variety of ROC curve-fitting tasks.
Assuntos
Algoritmos , Simulação por Computador , Modelos Estatísticos , Curva ROC , Área Sob a Curva , Humanos , Funções Verossimilhança , Reprodutibilidade dos Testes , SoftwareRESUMO
This article reviews the central issues that arise in the assessment of diagnostic imaging and computer-assist modalities. The paradigm of the receiver operating characteristic (ROC) curve--the dependence of the true-positive fraction versus the false-positive fraction as a function of the level of aggressiveness of the reader/radiologist toward a positive call--is essential to this field because diagnostic imaging systems are used in multiple settings, including controlled laboratory studies in which the prevalence of disease is different from that encountered in a study in the field. The basic equation of statistical decision theory is used to display how readers can vary their level of aggressiveness according to this diagnostic context. Most studies of diagnostic modalities in the last 15 years have demonstrated not only a range of levels of reader aggressiveness, but also a range of level of reader performance. These characteristics require a multivariate approach to ROC analysis that accounts for both the variation of case difficulty and the variation of reader skill in a study. The resulting paradigm is called the multiple-reader, multiple-case ROC paradigm. Highlights of historic as well as contemporary work in this field are reviewed. Many practical issues related to study design and resulting statistical power are included, together with recent developments and availability of analytical software.
Assuntos
Diagnóstico por Computador/instrumentação , Diagnóstico por Imagem/instrumentação , Diagnóstico por Computador/tendências , Diagnóstico por Imagem/tendências , Desenho de Equipamento , Humanos , Modelos Estatísticos , Variações Dependentes do Observador , Curva ROC , Sensibilidade e Especificidade , Software/tendências , Avaliação da Tecnologia Biomédica/tendênciasRESUMO
Receiver operating characteristic (ROC) analysis is well established in the evaluation of systems involving binary classification tasks. However, medical tests often require distinguishing among more than two diagnostic alternatives. The goal of this work was to develop an ROC analysis method for three-class classification tasks. Based on decision theory, we developed a method for three-class ROC analysis. In this method, the objects were classified by making the decision that provided the maximal utility relative to the other two. By making assumptions about the magnitudes of the relative utilities of incorrect decisions, we found a decision model that maximized the expected utility of the decisions when using log-likelihood ratios as decision variables. This decision model consists of a two-dimensional decision plane with log likelihood ratios as the axes and a decision structure that separates the plane into three regions. Moving the decision structure over the decision plane, which corresponds to moving the decision threshold in two-class ROC analysis, and computing the true class 1, 2, and 3 fractions defined a three-class ROC surface. We have shown that the resulting three-class ROC surface shares many features with the two-class ROC curve; i.e., using the log likelihood ratios as the decision variables results in maximal expected utility of the decisions, and the optimal operating point for a given diagnostic setting (set of relative utilities and disease prevalences) lies on the surface. The volume under the three-class surface (VUS) serves as a figure-of-merit to evaluate different data acquisition systems or image processing and reconstruction methods when the assumed utility constraints are relevant.
Assuntos
Algoritmos , Inteligência Artificial , Sistemas de Apoio a Decisões Clínicas , Técnicas de Apoio para a Decisão , Aumento da Imagem/métodos , Interpretação de Imagem Assistida por Computador/métodos , Curva ROC , Interpretação Estatística de Dados , Armazenamento e Recuperação da Informação/métodos , Variações Dependentes do ObservadorRESUMO
RATIONALE AND OBJECTIVES: Some computer-aided diagnosis (CAD) methods produce a quantitative diagnostic assessment (eg, likelihood of malignancy) based on computer image analysis that a radiologist who uses the computer aid must combine with his or her own assessment. Observer studies show that although CAD helps radiologists improve diagnostic performance, ad hoc use of computer aid can produce performance inferior to that of computer alone, indicating that radiologists are unable to incorporate computer assessment optimally into their final assessment. We describe a mathematical model for combining two correlated diagnostic assessments that may provide a basis for merging radiologists' ratings with computer assessments in a way that yields greater diagnostic accuracy than ad hoc merging by radiologists. MATERIALS AND METHODS: We calculate a likelihood ratio from the bivariate binormal model that describes joint probability density functions of latent decision variables of two correlated diagnostic assessments. To the extent that the bivariate binormal model is valid and that the model's parameters can be estimated reliably, results obtained in this way will be optimal because the likelihood ratio is the decision variable used by the ideal observer in any two-group classification task. We evaluated this method on two observer study datasets and in Monte Carlo simulations. RESULTS: This method produced better performance than achieved by radiologists when they incorporated computer assessment in an ad hoc way. Simulations show that with a large number of cases, this method can produce results indistinguishable from the ideal observer performance. CONCLUSIONS: This method can potentially help radiologists use quantitative computed diagnostic assessments optimally, thereby surpassing the computer in accuracy.
Assuntos
Diagnóstico por Computador , Modelos Teóricos , Radiologia , Interface Usuário-Computador , Análise de Variância , Neoplasias da Mama/diagnóstico por imagem , Simulação por Computador/estatística & dados numéricos , Diagnóstico por Computador/estatística & dados numéricos , Feminino , Humanos , Funções Verossimilhança , Mamografia , Método de Monte Carlo , Variações Dependentes do Observador , Curva ROC , Radiografia Torácica , Radiologia/estatística & dados numéricos , Projetos de Pesquisa , Análise e Desempenho de TarefasRESUMO
RATIONALE AND OBJECTIVES: The aim of the study is to compare independent double readings by radiologists and computer-aided diagnosis (CAD) in diagnostic interpretation of mammographic calcifications. MATERIALS AND METHODS: Ten radiologists independently interpreted 104 mammograms containing clustered microcalcifications. Forty-six of these were malignant and 58 were benign at biopsy. Radiologists read the images with and without a computer aid by using a counterbalanced study design. Sensitivity and specificity were calculated from observer biopsy recommendations, and receiver operating characteristic (ROC) curves were computed from their diagnostic confidence ratings. Unaided double-reading sensitivity and specificity values were derived post hoc by using three different objective rules and an additional rule of simulated-optimal double reading that assumed that consultations for resolving two radiologists' different independent diagnoses always produce the correct clinical recommendation. ROC curves of unaided double readings were obtained according to the literature. RESULTS: Single reading without computer aid yielded 74% sensitivity and 32% specificity, whereas CAD reading yielded 87% sensitivity and 42% specificity and appeared on a higher ROC curve (P < .0001). Three methods of formulating independent double readings generated sensitivities between 59% and 89%, specificities between 50% and 13%, and operating points that moved essentially along the average unaided single-reading ROC curve. ROC curves of unaided independent double readings showed small, statistically insignificant improvement over those of unaided single readings. Results of the simulated-optimal double reading were similar to CAD: 89% sensitivity and 50% specificity. CONCLUSION: Independent double readings of mammographic calcifications may not improve diagnostic performance. CAD reading improves diagnostic performance to an extent approaching the maximum possible performance.
Assuntos
Doenças Mamárias/diagnóstico por imagem , Calcinose/diagnóstico por imagem , Diagnóstico por Computador , Doenças Mamárias/patologia , Neoplasias da Mama/diagnóstico por imagem , Diagnóstico Diferencial , Feminino , Humanos , Processamento de Imagem Assistida por Computador , Mamografia , Variações Dependentes do Observador , Curva ROC , Sensibilidade e EspecificidadeRESUMO
We have shown previously, in the context of computer-aided diagnosis (CAD), that information derived from multiple images of the same patient can be used to improve diagnostic performance. In that work, we ignored the correlation among multiple images of the same patient. In the present study, we investigate theoretically, within the framework of receiver operating characteristic (ROC) analysis, the effect of correlation on three methods for combining quantitative diagnostic information from two images: taking the average, the maximum, and the minimum of a pair of normally distributed decision variables. We assume, as in our previous work, that the quantitative diagnostic information obtained from the two images of a given patient can be transformed monotonically to two latent decision variables that are normally distributed. Similar to the situation of uncorrelated images, we found that (1) the average always improves the area under the ROC curve (AUC) compared to the single-view image; (2) the maximum and the minimum can also, but not always, improve the AUC; and (3) each method can be the best method in certain situations. In addition, as the correlation strength increases, the average performs the best less often, whereas the maximum and the minimum perform the best more often. These theoretical results are illustrated with analysis of a mammography study.
Assuntos
Neoplasias da Mama/diagnóstico , Diagnóstico por Computador/métodos , Mamografia/métodos , Área Sob a Curva , Neoplasias da Mama/diagnóstico por imagem , Feminino , Humanos , Aumento da Imagem , Interpretação de Imagem Assistida por Computador , Modelos Estatísticos , Método de Monte Carlo , Distribuição Normal , Reconhecimento Automatizado de Padrão , Curva ROC , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Software , Técnica de Subtração , Ultrassonografia Mamária/métodosRESUMO
We are attempting to develop expressions for the coordinates of points on the three-class ideal observer's receiver operating characteristic (ROC) hypersurface as functions of the set of decision criteria used by the ideal observer. This is considerably more difficult than in the two-class classification task, because the conditional probabilities in question are not simply related to the cumulative distribution functions of the decision variables, and because the slopes and intercepts of the decision boundary lines are not independent; given the locations of two of the lines, the location of the third will be constrained depending on the other two. In this paper, we attempt to characterize those constraining relationships among the three-class ideal observer's decision boundary lines. As a result, we show that the relationship between the decision criteria and the misclassification probabilities is not one-to-one, as it is for the two-class ideal observer.
Assuntos
Algoritmos , Sistemas de Apoio a Decisões Clínicas , Técnicas de Apoio para a Decisão , Interpretação de Imagem Assistida por Computador/métodos , Reconhecimento Automatizado de Padrão/métodos , Curva ROC , Interpretação Estatística de Dados , Diagnóstico por Computador/métodos , Reprodutibilidade dos Testes , Sensibilidade e EspecificidadeRESUMO
We express the performance of the N-class "guessing" observer in terms of the N2-N conditional probabilities which make up an N-class receiver operating characteristic (ROC) space, in a formulation in which sensitivities are eliminated in constructing the ROC space (equivalent to using false-negative fraction and false-positive fraction in a two-class task). We then show that the "guessing" observer's performance in terms of these conditional probabilities is completely described by a degenerate hypersurface with only N-1 degrees of freedom (as opposed to the N2-N-1 required, in general, to achieve a true hypersurface in such a ROC space). It readily follows that the hypervolume under such a degenerate hypersurface must be zero when N > 2. We then consider a "near-guessing" task; that is, a task in which the N underlying data probability density functions (pdfs) are nearly identical, controlled by N-1 parameters which may vary continuously to zero (at which point the pdfs become identical). With this approach, we show that the hypervolume under the ROC hypersurface of an observer in an N-class classification task tends continuously to zero as the underlying data pdfs converge continuously to identity (a "guessing" task). The hypervolume under the ROC hypersurface of a "perfect" ideal observer (in a task in which the N data pdfs never overlap) is also found to be zero in the ROC space formulation under consideration. This suggests that hypervolume may not be a useful performance metric in N-class classification tasks for N > 2, despite the utility of the area under the ROC curve for two-class tasks.
Assuntos
Algoritmos , Inteligência Artificial , Interpretação de Imagem Assistida por Computador/métodos , Imageamento Tridimensional/métodos , Modelos Biológicos , Reconhecimento Automatizado de Padrão/métodos , Curva ROC , Simulação por Computador , Aumento da Imagem/métodos , Armazenamento e Recuperação da Informação/métodos , Modelos Estatísticos , Variações Dependentes do Observador , Reprodutibilidade dos Testes , Técnica de SubtraçãoRESUMO
Variance of diagnostic information contained in an image degrades diagnostic accuracy. Acquiring multiple images of the same patient (e.g., mediolateral oblique and craniocaudal view mammograms) can, in principle, help reduce this degradation. We demonstrate how this can be accomplished in the context of computer-aided diagnosis (CAD). Assuming that computer outputs obtained from multiple images of the same patient can be transformed monotonically to the same pair of truth-conditional normal distributions and, for simplicity, ignoring correlation among images, we investigate theoretically four methods of combining the computer outputs: taking the average, the median, the maximum, or the minimum. We found, as one would expect, that both the average and the median always produce an improved area under the receiver operating characteristic (ROC) curve (AUC) compared to the single-view images, while the average always produces better performance than the median. However, the maximum and minimum also can produce improved AUCs in some situations, and under certain conditions can outperform the average. Surprisingly, we found that the maximum and minimum of normally-distributed decision variables produce nearly binormal ROC curves. These results can be used as a guide in attempting to increase the efficacy of CAD when multiple images are available from the same patient.
Assuntos
Algoritmos , Aumento da Imagem/métodos , Interpretação de Imagem Assistida por Computador/métodos , Armazenamento e Recuperação da Informação/métodos , Reconhecimento Automatizado de Padrão/métodos , Curva ROC , Técnica de Subtração , Inteligência Artificial , Análise por Conglomerados , Simulação por Computador , Imageamento Tridimensional/métodos , Modelos Biológicos , Modelos Estatísticos , Análise Numérica Assistida por Computador , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Processamento de Sinais Assistido por ComputadorRESUMO
We have developed a model for FROC curve fitting that relates the observer's FROC performance not to the ROC performance that would be obtained if the observer's responses were scored on a per image basis, but rather to a hypothesized ROC performance that the observer would obtain in the task of classifying a set of "candidate detections" as positive or negative. We adopt the assumptions of the Bunch FROC model, namely that the observer's detections are all mutually independent, as well as assumptions qualitatively similar to, but different in nature from, those made by Chakraborty in his AFROC scoring methodology. Under the assumptions of our model, we show that the observer's FROC performance is a linearly scaled version of the candidate analysis ROC curve, where the scaling factors are just given by the FROC operating point coordinates for detecting initial candidates. Further, we show that the likelihood function of the model parameters given observational data takes on a simple form, and we develop a maximum likelihood method for fitting a FROC curve to this data. FROC and AFROC curves are produced for computer vision observer datasets and compared with the results of the AFROC scoring method. Although developed primarily with computer vision schemes in mind, we hope that the methodology presented here will prove worthy of further study in other applications as well.
Assuntos
Biofísica/métodos , Processamento de Imagem Assistida por Computador/métodos , Algoritmos , Humanos , Funções Verossimilhança , Neoplasias Pulmonares/patologia , Modelos Estatísticos , Curva ROC , Tomografia Computadorizada por Raios XRESUMO
We are using Bayesian artificial neural networks (BANNs) to classify mammographic masses in schemes for computer-aided diagnosis, and we are extending this methodology to a three-class classification task. We investigated whether a BANN can estimate ideal observer decision variables to distinguish malignant, benign, and false-positive computer detections. Five features were calculated for 63 malignant and 29 benign computer-detected mass lesions, and for 1049 false-positive computer detections, in 440 mammograms randomly divided into a training and testing set. A BANN was trained on the training set features and applied to the testing set features. We then used a known relation between three-class ideal observer decision variables and that used by a two-class ideal observer when two of three classes are grouped into one class, giving one decision variable for distinguishing malignant from nonmalignant detections, and a second for distinguishing true-positive from false-positive computer detections. For comparison, we grouped the training data into two classes in the same two ways and trained two-class BANNs for these two tasks. The three-class BANN decision variables were essentially identical in performance to the specifically trained two-class BANNs, with the average difference in area under the ROC curves being less than 0.0035 and no differences in area being statistically significant. Thus, the BANN outputs obey the same theoretical relationship as do the three-class and two-class ideal observer decision variables, which is consistent with the claim that the three-class BANN output can provide good estimates of the decision variables used by a three-class ideal observer.