RESUMO
Background: White blood cell (WBC) scintigraphy plays a major role in the diagnostic approach to periprosthetic infections. Although the procedure has been standardized by the publication of several guidelines, the interpretation of this technique may be susceptible to intra and inter-variability. We aimed to assess the reproducibility of interpretation between nuclear medicine physicians and by the same physician and to demonstrate that Cohen's coefficient is more unstable than Gwet's coefficient, as the latter is influenced by the prevalence rates. Methods: We enrolled 59 patients who performed a Technetium-99m WBC (99mTc-WBC) scintigraphy for suspected hip or knee prosthesis infection. Three physicians, blinded to all patient clinical data, performed two image readings. Each WBC study was assessed both visually and semi-quantitatively according to the guidelines of the European Association of Nuclear Medicine (EANM). For semi-quantitative analysis, readers drew an irregular Region of Interest (ROI) over the suspected infectious lesion and copied it to the normal contralateral bone. The mean counts per ROI were used to calculate lesion-to-reference tissue (LR) ratios for both late and delayed images. An increase in LR over time (LRlate> LRdelayed) of more than 20% was considered indicative of infection. Agreement between readers and between readings was assessed by the first-order agreement coefficient (Gwet's AC1). Reading time for each scan was compared between the three readers in both the first and the second reading, using the Generalized Linear Mixed Model. Results: An excellent agreement was found among all three readers: 0.90 for the first reading and 0.94 for the second reading. Both inter- and intra-variability showed values ≥0.86. Gwet's method demonstrated greater robustness than the Cohen coefficient when assessing the intra and inter-rater variability, since it is not influenced by the prevalence rate. Conclusions: These studies can contribute to improving the reliability of nuclear medicine imaging techniques and to evaluating the effectiveness of trainee preparation.
RESUMO
BACKGROUND: Direct oral anticoagulants (DOACs) may be involved in drug-drug interactions (DDIs) potentially increasing the risk of adverse drug reactions. This study aimed to evaluate the level of agreement among interaction checkers (ICs) and DOACs' summary of product characteristics (SPCs), in listing DDIs and in attributing DDIs' severity. RESEARCH DESIGN AND METHODS: The level of agreement among five ICs (i.e. INTERCheck WEB, Micromedex, Lexicomp, Epocrates, and drugs.com) in identifying potential DDIs and in attributing severity categories was evaluated using Gwet's AC1 on all five ICs and by comparing groups of four- and two-pair sets of ICs. RESULTS: A total of 486 potentially interacting drugs with dabigatran, 556 for apixaban, 444 for edoxaban, and 561 for rivaroxaban were reported. The level of agreement among the ICs in identifying potential DDIs was poor (range: 0.12-0.16). Similarly, it was low in 4 and 2 sets analyses. The level of agreement among the ICs in classifying the severity of potential DDIs was poor (range: 0.32-0.34), also in 4 and 2 sets analyses. CONCLUSIONS: The heterogeneity among different ICs and SPCs underscores the need to standardize DDI datasets and to conduct real-world studies to generate evidence regarding the frequency and clinical relevance of potential DOAC-related DDIs.
Assuntos
Anticoagulantes , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Humanos , Anticoagulantes/efeitos adversos , Interações Medicamentosas , Rivaroxabana , Dabigatrana/efeitos adversos , Administração OralRESUMO
An accurate assessment of intimate partner violence (IPV) is crucial to guide public policy and intervention. The Conflict Tactic Scales Revised (CTS-2) is one of the most widely used instruments to do so. Despite its good psychometric properties, research on interpartner agreement has pointed to low-to-moderate estimates, which generated some concerns about the validity of the results obtained through single-partner reports. This cross-sectional study introduces indexes that have not previously been used to assess interpartner agreement. Both partners' reports on perpetration and victimization were analyzed in a community sample of 268 different-sex couples. Our results generally pointed to better agreement levels on IPV occurrence than frequency, suggesting that the proxy method (i.e., using a single-partner report) could be a reliable method for assessing IPV occurrence but not its frequency in this population. Findings are discussed as well as the advantages and constraints of different IPV assessment practices.
RESUMO
Gwet's AC1 has been proposed as an alternative to Cohen's kappa in evaluating the agreement between two binary ratings. This approach is becoming increasingly popular, and researchers have been criticized for still using Cohen's kappa. However, a rigorous discussion of properties of Gwet's AC1 is still missing. In this paper several basic properties of Gwet's AC1 are investigated and compared with those of Cohen's kappa, in particular the dependence on the prevalence of positive ratings for a given agreement rate and the behaviour in case of no association or maximal disagreement. Both approaches compare the observed agreement rate with a comparative number. Cohen's kappa uses an expected agreement rate as comparator, whereas Gwet's AC1 uses an expected disagreement rate. Consequently, for a fixed agreement rate, Gwet's AC1 increases with increasing difference of the prevalence of positive ratings from 0.5. In contrast, Cohen's kappa decreases. Gwet's AC1 can take positive and negative values in the case of no association between the two raters, whereas Cohen's kappa is 0. Due to these fundamental differences, Gwet's AC1 should not be seen as a substitute for Cohen's kappa. In particular, the verbal classification of kappa values by Landis & Koch should not be applied to Gwet's AC1.
RESUMO
Gwet's first-order agreement coefficient (AC1) is widely used to assess the agreement between raters. This paper proposes several asymptotic statistics for a homogeneity test of stratified AC1 in large sample sizes. These statistics may have unsatisfactory performance, especially for small samples and a high value of AC1. Furthermore, we propose three exact methods for small pieces. A likelihood ratio statistic is recommended in large sample sizes based on the numerical results. The exact E approaches under likelihood ratio and score statistics are more robust in the case of small sample scenarios. Moreover, the exact E method is effective to a high value of AC1. We apply two real examples to illustrate the proposed methods.
RESUMO
Cohen's kappa coefficient was originally proposed for two raters only, and it later extended to an arbitrarily large number of raters to become what is known as Fleiss' generalized kappa. Fleiss' generalized kappa and its large-sample variance are still widely used by researchers and were implemented in several software packages, including, among others, SPSS and the R package "rel." The purpose of this article is to show that the large-sample variance of Fleiss' generalized kappa is systematically being misused, is invalid as a precision measure for kappa, and cannot be used for constructing confidence intervals. A general-purpose variance expression is proposed, which can be used in any statistical inference procedure. A Monte-Carlo experiment is presented, showing the validity of the new variance estimation procedure.
RESUMO
BACKGROUND: The quality of patient medical records is intrinsically related to patient safety, clinical decision-making, communication between health providers, and continuity of care. Additionally, its data are widely used in observational studies. However, the reliability of the information extracted from the records is a matter of concern in audit processes to ensure inter-rater agreement (IRA). Thus, the objective of this study is to evaluate the IRA among members of the Patient Health Record Review Board (PHRRB) in routine auditing of medical records, and the impact of periodic discussions of results with raters. METHODS: A prospective longitudinal study was conducted between July of 2015 and April of 2016 at Hospital Municipal Dr. Moysés Deutsch, a large public hospital in São Paulo. The PHRRB was composed of 12 physicians, 9 nurses, and 3 physiotherapists who audited medical records monthly, with the number of raters changing throughout the study. PHRRB meetings were held to reach a consensus on rating criteria that the members use in the auditing process. A review chart was created for raters to verify the registry of the patient's secondary diagnosis, chief complaint, history of presenting complaint, past medical history, medication history, physical exam, and diagnostic testing. The IRA was obtained every three months. The Gwet's AC1 coefficient and Proportion of Agreement (PA) were calculated to evaluate the IRA for each item over time. RESULTS: The study included 1884 items from 239 records with an overall full agreement among raters of 71.2%. A significant IRA increase of 16.5% (OR = 1.17; 95% CI = 1.03-1.32; p = 0.014) was found in the routine PHRRB auditing, with no significant differences between the PA and the Gwet's AC1, which showed a similar evolution over time. The PA decreased by 27.1% when at least one of the raters was absent from the review meeting (OR = 0.73; 95% CI = 0.53-1.00; p = 0.048). CONCLUSIONS: Medical record quality has been associated with the quality of care and could be optimized and improved by targeted interventions. The PA and the Gwet's AC1 are suitable agreement coefficients that are feasible to be incorporated in the routine PHRRB evaluation process.
Assuntos
Hospitais Gerais , Prontuários Médicos/normas , Brasil , Humanos , Estudos Longitudinais , Variações Dependentes do Observador , Exame Físico , Estudos Prospectivos , Sistema de Registros , Reprodutibilidade dos TestesRESUMO
BACKGROUND: There is a growing trend in the use of mobile health (mHealth) technologies in traditional Chinese medicine (TCM) and telemedicine, especially during the coronavirus disease (COVID-19) outbreak. Tongue diagnosis is an important component of TCM, but also plays a role in Western medicine, for example in dermatology. However, the procedure of obtaining tongue images has not been standardized and the reliability of tongue diagnosis by smartphone tongue images has yet to be evaluated. OBJECTIVE: The first objective of this study was to develop an operating classification scheme for tongue coating diagnosis. The second and main objective of this study was to determine the intra-rater and inter-rater reliability of tongue coating diagnosis using the operating classification scheme. METHODS: An operating classification scheme for tongue coating was developed using a stepwise approach and a quasi-Delphi method. First, tongue images (n=2023) were analyzed by 2 groups of assessors to develop the operating classification scheme for tongue coating diagnosis. Based on clinicians' (n=17) own interpretations as well as their use of the operating classification scheme, the results of tongue diagnosis on a representative tongue image set (n=24) were compared. After gathering consensus for the operating classification scheme, the clinicians were instructed to use the scheme to assess tongue features of their patients under direct visual inspection. At the same time, the clinicians took tongue images of the patients with smartphones and assessed tongue features observed in the smartphone image using the same classification scheme. The intra-rater agreements of these two assessments were calculated to determine which features of tongue coating were better retained by the image. Using the finalized operating classification scheme, clinicians in the study group assessed representative tongue images (n=24) that they had taken, and the intra-rater and inter-rater reliability of their assessments was evaluated. RESULTS: Intra-rater agreement between direct subject inspection and tongue image inspection was good to very good (Cohen κ range 0.69-1.0). Additionally, when comparing the assessment of tongue images on different days, intra-rater reliability was good to very good (κ range 0.7-1.0), except for the color of the tongue body (κ=0.22) and slippery tongue fur (κ=0.1). Inter-rater reliability was moderate for tongue coating (Gwet AC2 range 0.49-0.55), and fair for color and other features of the tongue body (Gwet AC2=0.34). CONCLUSIONS: Taken together, our study has shown that tongue images collected via smartphone contain some reliable features, including tongue coating, that can be used in mHealth analysis. Our findings thus support the use of smartphones in telemedicine for detecting changes in tongue coating.
Assuntos
Medicina Tradicional Chinesa , Fotografação , Smartphone , Telemedicina , Doenças da Língua/diagnóstico , COVID-19 , Infecções por Coronavirus , Técnica Delphi , Humanos , Variações Dependentes do Observador , Pandemias , Pneumonia Viral , Reprodutibilidade dos TestesRESUMO
BACKGROUND: Cohen's κ coefficient is often used as an index to measure the agreement of inter-rater determinations. However, κ varies greatly depending on the marginal distribution of the target population and overestimates the probability of agreement occurring by chance. To overcome these limitations, an alternative and more stable agreement coefficient was proposed, referred to as Gwet's AC1. When it is desired to combine results from multiple agreement studies, such as in a meta-analysis, or to perform stratified analysis with subject covariates that affect agreement, it is of interest to compare several agreement coefficients and present a common agreement index. A homogeneity test of κ was developed; however, there are no reports on homogeneity tests for AC1 or on an estimator of common AC1. In this article, a homogeneity score test for AC1 is therefore derived, in the case of two raters with binary outcomes from K independent strata and its performance is investigated. An estimation of the common AC1 between strata and its confidence intervals is also discussed. METHODS: Two homogeneity tests are provided: a score test and a goodness-of-fit test. In this study, the confidence intervals are derived by asymptotic, Fisher's Z transformation and profile variance methods. Monte Carlo simulation studies were conducted to examine the validity of the proposed methods. An example using clinical data is also provided. RESULTS: Type I error rates of the proposed score test were close to the nominal level when conducting simulations with small and moderate sample sizes. The confidence intervals based on Fisher's Z transformation and the profile variance method provided coverage levels close to nominal over a wide range of parameter combination. CONCLUSIONS: The method proposed in this study is considered to be useful for summarizing evaluations of consistency performed in multiple or stratified inter-rater agreement studies, for meta-analysis of reports from multiple groups and for stratified analysis.
Assuntos
Biometria/métodos , Metanálise como Assunto , Modelos Estatísticos , Variações Dependentes do Observador , Pesquisa Biomédica , Intervalos de Confiança , Interpretação Estatística de Dados , Humanos , Método de Monte Carlo , Reprodutibilidade dos TestesRESUMO
Cohen's kappa coefficient is commonly used for assessing agreement between classifications of two raters on a nominal scale. Three variants of Cohen's kappa that can handle missing data are presented. Data are considered missing if one or both ratings of a unit are missing. We study how well the variants estimate the kappa value for complete data under two missing data mechanisms-namely, missingness completely at random and a form of missingness not at random. The kappa coefficient considered in Gwet (Handbook of Inter-rater Reliability, 4th ed.) and the kappa coefficient based on listwise deletion of units with missing ratings were found to have virtually no bias and mean squared error if missingness is completely at random, and small bias and mean squared error if missingness is not at random. Furthermore, the kappa coefficient that treats missing ratings as a regular category appears to be rather heavily biased and has a substantial mean squared error in many of the simulations. Because it performs well and is easy to compute, we recommend to use the kappa coefficient that is based on listwise deletion of missing ratings if it can be assumed that missingness is completely at random or not at random.
RESUMO
AIMS: To assess the inter-observer agreement for detecting bovine digital dermatitis (BDD) lesions in digital colour photographs of the hind feet of cows, which had been taken while the animals were standing to be milked, between two trained observers. METHODS: Thirty-six photographs were selected from a total of 184 photographs held by the first author (R1), who had classified them as negative (n=11) or positive (n=25) for BDD. They were delivered to a technician (R2) who had previously visually inspected cattle for BDD lesions, and who then recorded the photographs as being either BDD-positive or BDD-negative. The percentage agreement between R1 and R2, and two other inter-observer agreement statistics, Cohen's κ and Gwet's first-order chance correction agreement coefficient (AC1), were calculated. The cumulative membership probabilities of Cohen's κ and Gwet's AC1 were then calculated for different benchmark ranges of κ. RESULTS: The percentage agreement between R1 and R2 was 33/36 (92%), Cohen's κ was 0.80 (95% CI=0.57-1.0) and Gwet's AC1 was 0.86 (95% CI=0.69-1.0). Based on the cumulative membership probabilities for Gwet's AC1, there was 75% probability that the two observers had almost perfect agreement (κ≥0.81). For both Cohen's κ and Gwet's AC1, there was >95% probability that the two observers had at least substantial agreement (κ≥0.61). CONCLUSIONS: The two trained observers had at least substantial agreement in identifying from a digital photograph as to whether BDD lesions were present or absent. Therefore results from the two could be used interchangeably. CLINICAL RELEVANCE: Visual assessment for BDD lesions in the milking parlour can be subjective. However a high agreement between these two trained BDD inspectors means BDD prevalence reported from different regions in New Zealand by these two can be directly compared.
Assuntos
Doenças dos Bovinos/diagnóstico , Dermatite Digital/diagnóstico , Fotografação/veterinária , Animais , Bovinos , Doenças dos Bovinos/epidemiologia , Doenças dos Bovinos/patologia , Dermatite Digital/epidemiologia , Dermatite Digital/patologia , Humanos , Nova Zelândia/epidemiologia , Variações Dependentes do ObservadorRESUMO
BACKGROUND: Cohen's Kappa is the most used agreement statistic in literature. However, under certain conditions, it is affected by a paradox which returns biased estimates of the statistic itself. OBJECTIVE: The aim of the study is to provide sufficient information which allows the reader to make an informed choice of the correct agreement measure, by underlining some optimal properties of Gwet's AC1 in comparison to Cohen's Kappa, using a real data example. METHOD: During the process of literature review, we have asked a panel of three evaluators to come up with a judgment on the quality of 57 randomized controlled trials assigning a score to each trial using the Jadad scale. The quality was evaluated according to the following dimensions: adopted design, randomization unit, type of primary endpoint. With respect to each of the above described features, the agreement between the three evaluators has been calculated using Cohen's Kappa statistic and Gwet's AC1 statistic and, finally, the values have been compared with the observed agreement. RESULTS: The values of the Cohen's Kappa statistic would lead to believe that the agreement levels for the variables Unit, Design and Primary Endpoints are totally unsatisfactory. The AC1 statistic, on the contrary, shows plausible values which are in line with the respective values of the observed concordance. CONCLUSION: We conclude that it would always be appropriate to adopt the AC1 statistic, thus bypassing any risk of incurring the paradox and drawing wrong conclusions about the results of agreement analysis.
RESUMO
The study was conducted to evaluate influence of the true within-herd prevalence of small ruminant lentivirus (SRLV) infection on agreement beyond chance between three different types of commercial serological ELISAs. Blood samples were collected from 865 goats from 12 dairy goat herds. Serum samples were tested using three commercial ELISA kits: whole-virus indirect ELISA (wELISA), indirect ELISA based on recombined TM and CA antigens (TM/CA-ELISA), and competitive-inhibition ELISA based on SU antigen (SU-ELISA). Herds were classed into three prevalence strata of high (>50%), moderate (10-50%) and low (<10%) true within-herd prevalence of SRLV infection. The latter was estimated on the basis of results of wELISA adjusted by its sensitivity and specificity. Agreement beyond chance between the three ELISAs was assessed at two levels. First, the general agreement was determined using two coefficients corrected for chance-agreement: Cohen's kappa and Gwet's AC1. Then, agreement between tests was evaluated using Gwet's AC1 separately in the three prevalence strata and compared between them by computing 95% confidence intervals for differences with a Bonferroni correction for multiple comparisons. The general agreement between the three tests was very good: wELISA and TM/CA-ELISA - Cohen's kappa of 81.8% (CI 95%: 77.9% to 85.7%), Gwet's AC1 of 82.7% (CI 95%: 79.0% to 86.4%); wELISA and SU-ELISA - Cohen's kappa of 83.2% (CI 95%: 79.4% to 86.9%), Gwet's AC1 of 83.9% (CI 95%: 80.4% to 87.5%); TM/CA-ELISA and SU-ELISA - Cohen's kappa of 86.0% (CI 95%: 82.6% to 89.5%), Gwet's AC1 of 86.9% (CI 95%: 83.6% to 90.1%). However, agreement between ELISAs was significantly related to the within-herd true prevalence - it was significantly lower (although still high) when within-herd true prevalence was moderate (Gwet's AC1 between 67.2% and 78.7%), whereas remained very high, when true within-herd prevalence was either >50% (Gwet's AC1 between 91.9% and 98.8%) or <10% (Gwet's AC1 between 94.7% and 98.4%). Concluding, the three different commercial ELISAs for SRLV infection in goats available on the market yield highly consistent results. However, their agreement is affected by the true within-herd prevalence in a tested population, and the worse (although still high) agreement should be expected, when the percentage of infected goats is moderate.
Assuntos
Ensaio de Imunoadsorção Enzimática/veterinária , Doenças das Cabras/epidemiologia , Infecções por Lentivirus/veterinária , Testes Sorológicos/veterinária , Animais , Ensaio de Imunoadsorção Enzimática/normas , Cabras , Infecções por Lentivirus/epidemiologia , Estudos Soroepidemiológicos , Testes Sorológicos/normasRESUMO
This article addresses the problem of testing the difference between two correlated agreement coefficients for statistical significance. A number of authors have proposed methods for testing the difference between two correlated kappa coefficients, which require either the use of resampling methods or the use of advanced statistical modeling techniques. In this article, we propose a technique similar to the classical pairwise t test for means, which is based on a large-sample linear approximation of the agreement coefficient. We illustrate the use of this technique with several known agreement coefficients including Cohen's kappa, Gwet's AC1, Fleiss's generalized kappa, Conger's generalized kappa, Krippendorff's alpha, and the Brenann-Prediger coefficient. The proposed method is very flexible, can accommodate several types of correlation structures between coefficients, and requires neither advanced statistical modeling skills nor considerable computer programming experience. The validity of this method is tested with a Monte Carlo simulation.
RESUMO
Objectives. To measure inter-rater agreement of overall clinical appearance of febrile children aged less than 24 months and to compare methods for doing so. Study Design and Setting. We performed an observational study of inter-rater reliability of the assessment of febrile children in a county hospital emergency department serving a mixed urban and rural population. Two emergency medicine healthcare providers independently evaluated the overall clinical appearance of children less than 24 months of age who had presented for fever. They recorded the initial 'gestalt' assessment of whether or not the child was ill appearing or if they were unsure. They then repeated this assessment after examining the child. Each rater was blinded to the other's assessment. Our primary analysis was graphical. We also calculated Cohen's κ, Gwet's agreement coefficient and other measures of agreement and weighted variants of these. We examined the effect of time between exams and patient and provider characteristics on inter-rater agreement. Results. We analyzed 159 of the 173 patients enrolled. Median age was 9.5 months (lower and upper quartiles 4.9-14.6), 99/159 (62%) were boys and 22/159 (14%) were admitted. Overall 118/159 (74%) and 119/159 (75%) were classified as well appearing on initial 'gestalt' impression by both examiners. Summary statistics varied from 0.223 for weighted κ to 0.635 for Gwet's AC2. Inter rater agreement was affected by the time interval between the evaluations and the age of the child but not by the experience levels of the rater pairs. Classifications of 'not ill appearing' were more reliable than others. Conclusion. The inter-rater reliability of emergency providers' assessment of overall clinical appearance was adequate when described graphically and by Gwet's AC. Different summary statistics yield different results for the same dataset.