irrCAC-benchmarking

library(irrCAC)

Abstract

The irrCAC is an R package that provides several functions for calculating various chance-corrected agreement coefficients (see overview.html for ageneral overview), their weighted versions using various weights (see weighting.html for a more detailed discusssion on the weighting of agreement coefficients). In this document, I like to show you how you would implement the benchmarking approach discussed in chapter 6 of Gwet (2014). This package closely follows the general framework of inter-rater reliability assessment presented by Gwet (2014).

In a nutshell, the problem consists of qualifying the magnitude of a given agreement coeffficient as poor, good, very good, or something else. There are essentially 3 bechmarking scales that have been mentioned in the inter-rater reliability, and which are covered in this package. These are the Landis-Koch, Altman, and Fleiss bechmarking scales defined as follows:

landis.koch
#>   lb.LK ub.LK      interp.LK
#> 1   0.8   1.0 Almost Perfect
#> 2   0.6   0.8    Substantial
#> 3   0.4   0.6       Moderate
#> 4   0.2   0.4           Fair
#> 5   0.0   0.2         Slight
#> 6  -1.0   0.0           Poor
altman
#>   lb.AL ub.AL interp.AL
#> 1   0.8   1.0 Very Good
#> 2   0.6   0.8      Good
#> 3   0.4   0.6  Moderate
#> 4   0.2   0.4      Fair
#> 5  -1.0   0.2      Poor
fleiss
#>   lb.FL ub.FL            interp.FL
#> 1  0.75  1.00            Excellent
#> 2  0.40  0.75 Intermediate to Good
#> 3 -1.00  0.40                 Poor

These are data frames that become available to you as soon as you you install the irrCAC package.

Interpreting the magnitude of agreement coeeficients

Suppose that you computed Gwet’s \(\mbox{AC}_1\) coefficient using raw ratings from the dataset cac.raw4raters. You now want to qualify the magnitude of this coefficient using one of the benchmarking scales. Although you would normally choose one of the 3 benchmarking scales, I will use all 3 for illustration purposes. You would proceed as follows:

  ac1 <- gwet.ac1.raw(cac.raw4raters)$est
  data.frame(ac1$coeff.val, ac1$coeff.se)
#>   ac1.coeff.val ac1.coeff.se
#> 1       0.77544      0.14295
  landis.koch.bf(ac1$coeff.val, ac1$coeff.se)
#>                 Landis-Koch CumProb
#> (0.8 to 1)   Almost Perfect 0.39674
#> (0.6 to 0.8)    Substantial 0.88336
#> (0.4 to 0.6)       Moderate 0.99542
#> (0.2 to 0.4)           Fair 0.99997
#> (0 to 0.2)           Slight       1
#> (-1 to 0)              Poor       1
  altman.bf(ac1$coeff.val, ac1$coeff.se)
#>                 Altman CumProb
#> (0.8 to 1)   Very Good 0.39674
#> (0.6 to 0.8)      Good 0.88336
#> (0.4 to 0.6)  Moderate 0.99542
#> (0.2 to 0.4)      Fair 0.99997
#> (-1 to 0.2)       Poor       1
  fleiss.bf(ac1$coeff.val, ac1$coeff.se)
#>                             Fleiss CumProb
#> (0.75 to 1)              Excellent 0.54414
#> (0.4 to 0.75) Intermediate to Good 0.99542
#> (-1 to 0.4)                   Poor       1

Each of the functions landis.koch.bf(ac1$coeff.val, ac1$coeff.se), altman.bf(ac1$coeff.val, ac1$coeff.se), and fleiss.bf(ac1$coeff.val, ac1$coeff.se) produces 2 columns: the agreement strength level, and the associated cumulative membership probability (CumProb). CumProb represents the probability that the true agreement strength level is the one associated with the probability or one that is better. I recommended in Gwet (2014) to retain an agreement strength level that is associated with ComProb that exceeds 0.95. * Landis-Koch Scale Since \(\mbox{AC}_1=0.775\) then according to the Landoch-Koch benchmarking scale, this agreement coefficient is deemed Moderate since the associated CumProb exceeds the thereshold of 0.95. It cannot be qualified as substantial due to its standard error of 0.14295 being high. See chapter 4 of Gwet (2014) for a more detailed discussion on this topic.

References:

  1. Gwet, K.L. (2014, ISBN:978-0970806284). “Handbook of Inter-Rater Reliability,” 4th Edition. Advanced Analytics, LLC