This post is mainly based on

The quality of a measuring instruments can be evaluated by:

  • Reliability: the extent to which the measures are consistent
  • Validity: the extent to which the measures are accurate

Inter-Rater Reliability (IRR) can be used to evaluate the consistency of ratings provided by multiple coders. The coder is used as a generic term for describing the individuals who assign ratings in a study.

The rating / score from a coder can be decomposed in the following way:

\[X = T + E\]

where $X$ denote the observed score, $T$ denote the ground truth score and $E$ denote the measurement error. If $T$ and $E$ are independent, then we have the variance:

\[Var(X) = Var(T) + Var(E)\]

And reliability is defined as:

\[\operatorname{Reliability} = \frac{Var(T)}{Var(X)}\]

i.e., the proportion of the observed variance originate from the measurement error. A low reliability implies measurement error is high, i.e., agreement between coders are low.

Since $T$ and $E$ are not observed, IRR can only be estimated by some estimators. For example: Kappa for categorical variables or Intra-Class Correlation for ordinal, interval, or ratio variables.

One may consider using Pearson correlation as a estimator of IRR. However, Pearson correlation requires ratings from a pair of coders. It cannot handle cases where (1) there are more than 2 coders or (2) different subjects are rated by different coders.

Kappa

Cohen’s kappa and its variants are commonly used for assessing IRR for nominal / categorical variables. The variants are developed for:

  • Bias/prevalence correction
  • Non-fully crossed designs

Cohen’s Kappa

\[\kappa = \frac{P(a)-P(e)}{1-P(e)}\]

where $P(a)$ denotes the observed percentage of agreement, and $P(e)$ denotes the probability of expected agreement due to chance. Consider the following agreement matrix for Coder A and B:

  Coder A - Absent Coder A - Present
Coder B - Absent 42 13
Coder B - Present 8 37

$P(a) = (42 + 37)/100 = 0.79$
$P(e) = 0.5 \times 0.45 + 0.5 \times 0.55 = 0.5$
Hence, $\kappa = (0.79-0.5)/(1-0.5) = 0.58$

Properties

  • $\kappa \in [-1,1]$
  • $\kappa \in [0.61, 0.80]$: substantial agreement
  • $\kappa \in [0.81, 1.0]$: almost perfect or perfect agreement
  • When the marginal distributions of observed ratings largely fall under one category, kappa estimates can be unrepresentatively low
  • When the marginal distributions of specific ratings are substantially different between coders, kappa estimates can be unrepresentatively high

Kappa Variants

kappa-variants

Intra-Class Correlation (ICC)

Intra-class correlation (ICC) is commonly used for assessing IRR for ordinal, interval, and ratio variables. ICCs are suitable for:

  • Studies with two or more coders
  • All subjects in a study are rated by multiple coders
  • Only a subset of subjects is rated by multiple coders and the rest are rated by one coder
  • Fully-crossed designs
  • Coders are randomly selected for each subject

Consider a rating $X_{ij}$ provide to subject $i$ by coder $j$:

\[X_{ij} = \mu + r_i + e_{ij}\]

where $\mu$ is the mean of the ground truth score, $r_i$ is the deviation from the ground truth mean or $Var(T)$, and $e_{ij}$ is the measurement error or $Var(E)$.

If we consider the possibility that some coder may make systematic error (e.g., biased to provide higher score), the above equation can be expand to:

\[X_{ij} = \mu + r_i + e_{ij} + c_j + rc_{ij}\]

where $c_j$ is the bias of coder $j$ (the degree that coder $j$ systematically deviates from the ground truth mean), and $rc_{ij}$ represents the interaction between subject derivation and coder error.

Variance of the term $e_{ij}, c_j, c_{ij}$ are used to compute ICC. There are many different variants of ICC and those variants are constructed using different combination of the above terms.

ICC is defined under two conventions: Shrout and Fleiss convention and McGraw and Wong convention. The detailed calculation and conversion between those two conversion can be found at Table 3 of the Koo and Li paper.

Koo and Li’s guideline for interpreting ICC is:

  • ICC < 0.50: poor
  • 0.50 < ICC < 0.75: moderate
  • 0.75 < ICC < 0.9: good
  • ICC > 0.90: excellent

A visualization of difference between ICC variants:

icc-variants

Detailed description can be found at the Wikipedia page.