RESUMO
OBJECTIVE: To determine which bones and which grades had the highest inter-rater variability when employing the Tanner-Whitehouse (T-W) method. MATERIALS AND METHODS: Twenty-four radiologists were recruited and trained in the T-W classification of skeletal development. The consistency and skill of the radiologists in determining bone development status were assessed using 20 pediatric hand radiographs of children aged 1 to 18 years old. Four radiologists had a poor concordance rate and were excluded. The remaining 20 radiologists undertook a repeat reading of the radiographs, and their results were analyzed by comparing them with the mean assessment of two senior experts as the reference standard. Concordance rate, scoring, and Kendall's W were calculated to evaluate accuracy and consistency. RESULTS: Both the radius, ulna, and short finger (RUS) system (Kendall's W = 0.833) and the carpal (C) system (Kendall's W = 0.944) had excellent consistency, with the RUS system outperforming the C system in terms of scores. The repeatability analysis showed that the second rating test, performed after 2 months of further bone age assessment (BAA) practice, was more consistent and accurate than the first. The capitate had the lowest average concordance rate and scoring, as well as the lowest overall concordance rate for its D classification. Moreover, the G classifications of the seven carpal bones all had a concordance rate less than 0.6. The bones with lower Kendall's W were likewise those with lower scores and concordance rates. CONCLUSION: The D grade of the capitate showed the highest variation, and the use of the Tanner-Whitehouse 3rd edition (T-W3) to determine bone age (BA) was frequently inconsistent. A more comprehensive description with a focus on inaccuracy bones or ratings and a modification to the T-W3 approach would significantly advance BAA.
Assuntos
Determinação da Idade pelo Esqueleto , Ossos da Mão , Variações Dependentes do Observador , Humanos , Determinação da Idade pelo Esqueleto/métodos , Criança , Adolescente , Pré-Escolar , Feminino , Masculino , Reprodutibilidade dos Testes , Lactente , Ossos da Mão/diagnóstico por imagemRESUMO
We present a model-based approach to the analysis of agreement between different raters in a situation where all raters have supplied ordinal ratings of the same cases in a sample. It is assumed that no "gold standard" is available. The model is an ordinal regression model with random effects-a so-called rating scale model. The model includes case-specific parameters that allow each case his or hers own level (disease severity). It also allows raters to have different propensities to score a given set of individuals more or less positively-the rater level. Based on the model, we suggest quantifying the rater variation using the median odds ratio. This allows expressing the variation on the same scale as the observed ordinal data. An important example that will serve to motivate and illustrate the proposed model is the study of breast cancer diagnosis based on screening mammograms. The purpose of the assessment is to detect early breast cancer in order to obtain improved cancer survival. In the study, mammograms from 148 women were evaluated by 110 expert radiologists. The experts were asked to rate each mammogram on a 5-point scale ranging from "normal" to "probably malignant."
Assuntos
Variações Dependentes do Observador , Análise de Regressão , Neoplasias da Mama/diagnóstico , Simulação por Computador , Detecção Precoce de Câncer , Feminino , Humanos , Funções Verossimilhança , Mamografia , Reprodutibilidade dos TestesRESUMO
OBJECTIVE: This study assesses inter-rater agreement and sensitivity of diagnostic criteria for amyotrophic lateral sclerosis (ALS). METHODS: Clinical and electrophysiological data of 399 patients with suspected ALS were collected by eleven experienced physicians from ten different countries. Eight physicians classified patients independently and blinded according to the revised El Escorial Criteria (rEEC) and to the Awaji Criteria (AC). Inter-rater agreement was assessed by Kappa coefficients, sensitivity by majority diagnosis on 350 patients with follow-up data. RESULTS: Inter-rater agreement was generally low both for rEEC and AC. Agreement was best on the categories "Not-ALS", "Definite", and "Probable", and poorest for "Possible" and "Probable Laboratory-supported". Sensitivity was equal for rEEC (64%) and AC (63%), probably due to downgrading of "Probable Laboratory-supported" patients by AC. However, AC was significantly more effective in classifying patients as "ALS" versus "Not-ALS" (pâ¯<â¯0.0001). CONCLUSIONS: Inter-rater variation is high both for rEEC and for AC probably due to a high complexity of the rEEC inherent in the AC. The gain of AC on diagnostic sensitivity is reduced by the omission of the "Probable Laboratory-supported" category. SIGNIFICANCE: The results highlight a need for initiatives to develop simpler and more reproducible diagnostic criteria for ALS in clinical practice and research.
Assuntos
Esclerose Lateral Amiotrófica/diagnóstico , Esclerose Lateral Amiotrófica/fisiopatologia , Eletromiografia/normas , Internacionalidade , Papel do Médico , Idoso , Eletromiografia/métodos , Feminino , Seguimentos , Humanos , Masculino , Pessoa de Meia-Idade , Variações Dependentes do Observador , Reprodutibilidade dos TestesRESUMO
PURPOSE: To explore the use of nonradiologists as a method to efficiently reduce bias in the assessment of radiologist performance using a hepatobiliary tumor board as a case study. MATERIALS AND METHODS: Institutional review board approval was obtained for this HIPAA-compliant prospective quality assurance (QA) effort. Consecutive patients with CT or MR imaging reviewed at one hepatobiliary tumor board between February 2016 and October 2016 (n = 265) were included. All presentations were assigned prospective anonymous QA scores by an experienced nonradiologist hepatobiliary provider based on contemporaneous comparison of the imaging interpretation at a tumor board and the original interpretation(s): concordant, minor discordance, major discordance. Major discordance was defined as a discrepancy that may affect clinical management. Minor discordance was defined as a discrepancy unlikely to affect clinical management. All discordances and predicted management changes were retrospectively confirmed by the liver tumor program medical director. Logistic regression analyses were performed to determine what factors best predict discordant reporting. RESULTS: Approximately one-third (30% [79 of 265]) of reports were assigned a discordance, including 51 (19%) minor and 28 (11%) major discordances. The most common related to mass size (41% [32 of 79]), tumor stage and extent (24% [19 of 79]), and assigned LI-RADS v2014 score (22% [17 of 79]). One radiologist had 11.8-fold greater odds of discordance (P = .002). Nine other radiologists were similar (P = .10-.99). Radiologists presenting their own studies had 4.5-fold less odds of discordance (P = .006). CONCLUSIONS: QA conducted in line with tumor board workflow can enable efficient assessment of radiologist performance. Discordant interpretations are commonly (30%) reported by nonradiologist providers.
Assuntos
Neoplasias do Sistema Biliar/diagnóstico por imagem , Competência Clínica , Neoplasias Hepáticas/diagnóstico por imagem , Imageamento por Ressonância Magnética/normas , Garantia da Qualidade dos Cuidados de Saúde , Radiologistas/normas , Tomografia Computadorizada por Raios X/normas , Neoplasias do Sistema Biliar/patologia , Humanos , Neoplasias Hepáticas/patologia , Estadiamento de Neoplasias , Estudos ProspectivosRESUMO
OBJECTIVE: To assess inter-rater agreement on EEG-reactivity (EEG-R) in comatose patients and compare it with a quantitative method (QEEG-R). METHODS: Six 30-s stimulation epochs (noxious, visual and auditory) were performed during EEG on 19 neurosurgical and 11 cardiac arrest patients. Six experts analysed EEGs for reactivity using their habitual methods. QEEG-R was defined as present if ≥2/6 epochs were reactive (stimulation/rest power ratio exceeding noise level). Three-months patient outcome was assessed by the Cerebral Performance Category Score (CPC) dichotomized in good (1-2) or poor (3-5). RESULTS: Agreement among experts on overall EEG-R varied from 53% to 83% (κ: 0.05-0.64) and reached 100% (κ: 1) between two QEEG-R calculators. For the experts, absence of EEG-R yielded sensitivities for poor outcome between 40-85% and specificities between 20-90%, for QEEG-R sensitivity was 40% (CI: 23-68%) and specificity 100% (CI: 69-100%). CONCLUSIONS: There is a large inter-rater variation among experts on EEG-R assessment in comatose patients. QEEG-R is a promising objective prognostic parameter with low inter-rater variation and a high specificity for prediction of poor outcome. SIGNIFICANCE: Clinicians should be cautious when using the traditional, qualitative method, in particular in end-of-life decisions. Implementation of the quantitative method in clinical practice may improve reliability of reactivity assessments.
Assuntos
Coma/diagnóstico , Coma/fisiopatologia , Eletroencefalografia/normas , Médicos/normas , Adulto , Idoso , Idoso de 80 Anos ou mais , Eletroencefalografia/métodos , Feminino , Parada Cardíaca/diagnóstico , Parada Cardíaca/fisiopatologia , Humanos , Masculino , Pessoa de Meia-Idade , Variações Dependentes do Observador , Reprodutibilidade dos TestesRESUMO
BACKGROUND: Observational data and funnel plots are routinely used outside of pathology to understand trends and improve performance. OBJECTIVE: Extract diagnostic rate (DR) information from free text surgical pathology reports with synoptic elements and assess whether inter-rater variation and clinical history completeness information useful for continuous quality improvement (CQI) can be obtained. METHODS: All in-house prostate biopsies in a 6-year period at two large teaching hospitals were extracted and then diagnostically categorized using string matching, fuzzy string matching, and hierarchical pruning. DRs were then stratified by the submitting physicians and pathologists. Funnel plots were created to assess for diagnostic bias. RESULTS: 3,854 prostate biopsies were found and all could be diagnostically classified. Two audits involving the review of 700 reports and a comparison of the synoptic elements with the free text interpretations suggest a categorization error rate of <1%. Twenty-seven pathologists each read >40 cases and together assessed 3,690 biopsies. There was considerable inter-rater variability and a trend toward more World Health Organization/International Society of Urologic Pathology Grade 1 cancers in older pathologists. Normalized deviations plots, constructed using the median DR, and standard error can elucidate associated over- and under-calls for an individual pathologist in relation to their practice group. Clinical history completeness by submitting medical doctor varied significantly (100% to 22%). CONCLUSION: Free text data analyses have some limitations; however, they could be used for data-driven CQI in anatomical pathology, and could lead to the next generation in quality of care.