Inter-Rater Reliability

This post is mainly based on

The quality of a measuring instruments can be evaluated by:

Reliability: the extent to which the measures are consistent
Validity: the extent to which the measures are accurate

Inter-Rater Reliability (IRR) can be used to evaluate the consistency of ratings provided by multiple coders. The coder is used as a generic term for describing the individuals who assign ratings in a study.

The rating / score from a coder can be decomposed in the following way:

\[X = T + E\]

where $X$ denote the observed score, $T$ denote the ground truth score and $E$ denote the measurement error. If $T$ and $E$ are independent, then we have the variance:

\[Var(X) = Var(T) + Var(E)\]

And reliability is defined as:

\[\operatorname{Reliability} = \frac{Var(T)}{Var(X)}\]

i.e., the proportion of the observed variance originate from the measurement error. A low reliability implies measurement error is high, i.e., agreement between coders are low.

Since $T$ and $E$ are not observed, IRR can only be estimated by some estimators. For example: Kappa for categorical variables or Intra-Class Correlation for ordinal, interval, or ratio variables.

One may consider using Pearson correlation as a estimator of IRR. However, Pearson correlation requires ratings from a pair of coders. It cannot handle cases where (1) there are more than 2 coders or (2) different subjects are rated by different coders.

Kappa

Cohen’s kappa and its variants are commonly used for assessing IRR for nominal / categorical variables. The variants are developed for:

Bias/prevalence correction
Non-fully crossed designs

Cohen’s Kappa

\[\kappa = \frac{P(a)-P(e)}{1-P(e)}\]

where $P(a)$ denotes the observed percentage of agreement, and $P(e)$ denotes the probability of expected agreement due to chance. Consider the following agreement matrix for Coder A and B:

	Coder A - Absent	Coder A - Present
Coder B - Absent	42	13
Coder B - Present	8	37

$P(a) = (42 + 37)/100 = 0.79$
$P(e) = 0.5 \times 0.45 + 0.5 \times 0.55 = 0.5$
Hence, $\kappa = (0.79-0.5)/(1-0.5) = 0.58$

Properties

$\kappa \in [-1,1]$
$\kappa \in [0.61, 0.80]$: substantial agreement
$\kappa \in [0.81, 1.0]$: almost perfect or perfect agreement
When the marginal distributions of observed ratings largely fall under one category, kappa estimates can be unrepresentatively low
When the marginal distributions of specific ratings are substantially different between coders, kappa estimates can be unrepresentatively high

Kappa Variants

kappa-variants

Intra-Class Correlation (ICC)

Intra-class correlation (ICC) is commonly used for assessing IRR for ordinal, interval, and ratio variables. ICCs are suitable for:

Studies with two or more coders
All subjects in a study are rated by multiple coders
Only a subset of subjects is rated by multiple coders and the rest are rated by one coder
Fully-crossed designs
Coders are randomly selected for each subject

Consider a rating $X_{ij}$ provide to subject $i$ by coder $j$:

\[X_{ij} = \mu + r_i + e_{ij}\]

where $\mu$ is the mean of the ground truth score, $r_i$ is the deviation from the ground truth mean or $Var(T)$, and $e_{ij}$ is the measurement error or $Var(E)$.

If we consider the possibility that some coder may make systematic error (e.g., biased to provide higher score), the above equation can be expand to:

\[X_{ij} = \mu + r_i + e_{ij} + c_j + rc_{ij}\]

where $c_j$ is the bias of coder $j$ (the degree that coder $j$ systematically deviates from the ground truth mean), and $rc_{ij}$ represents the interaction between subject derivation and coder error.

Variance of the term $e_{ij}, c_j, c_{ij}$ are used to compute ICC. There are many different variants of ICC and those variants are constructed using different combination of the above terms.

ICC is defined under two conventions: Shrout and Fleiss convention and McGraw and Wong convention. The detailed calculation and conversion between those two conversion can be found at Table 3 of the Koo and Li paper.

Koo and Li’s guideline for interpreting ICC is:

ICC < 0.50: poor
0.50 < ICC < 0.75: moderate
0.75 < ICC < 0.9: good
ICC > 0.90: excellent

A visualization of difference between ICC variants:

icc-variants

Detailed description can be found at the Wikipedia page.