Home Environment Research Methods in Anthropology: Qualitative and Quantitative Approaches

# INTERCODER RELIABILITY

It is quite common in content analysis to have more than one coder mark up a set of texts. The idea is to see whether the constructs being investigated are shared—whether multiple coders reckon that the same constructs apply to the same chunks of text. There is a simple way to measure agreement between a pair of coders: you just line up their codes and calculate the percentage of agreement. This is shown in table 19.5 for two coders who have coded 20 texts for a single theme, using a binary code, 1 or 0.

Both coders have a 0 for texts 1, 4, 5, 7, and 10, and both coders have a 1 for text 2. These two coders agree a total of 6 times out of 10—5 times that the theme, whatever it is, does not appear in the texts, and 1 time that the theme does appear. On 4 out of 10

Table 19.5 Measuring Simple Agreement between Two Coders on a Single Theme

 Units of Analysis (documents/observations) 1 2 3 4 5 6 7 8 9 10 Coder 1 0 1 0 0 0 0 0 0 1 0 Coder 2 0 1 1 0 0 1 0 1 0 0

texts, the coders disagree. On text 9, for example, coder 1 saw the theme in the text, but coder 2 didn’t. Overall, these two coders agree 60% of the time.

The total observed agreement, though, is not a good measure of intercoder reliability because people can agree that a theme is present or absent in a text just by chance. To adjust for this possibility, many researchers use a statistic called Cohen’s kappa (Cohen 1960), or k.

Cohen's kappa

Kappa is a statistic that measures how much better than chance is the agreement between a pair of coders on the presence or absence of binary (yes/no) themes in texts. Here is the formula for kappa:

When k is 1.0, there is perfect agreement between coders. When k is zero, agreement is what might be expected by chance. When k is negative, the observed level of agreement is less than what you’d expect by chance. And when k is positive, the observed level of agreement is greater than what you’d expect by chance. Table 19.6 shows the data in table 19.5 rearranged so that we can calculate kappa.

Table 19.6 The Coder-by-Coder Agreement Matrix for the Data in Table 19.5

 Coder 2 Coder 1 Yes No Coder 1 totals Yes 1 (a) 1 (b) 2 No 3(c) 5(d) 8 Coder 2 totals 4 6 10 (n)

The observed agreement between Coder 1 and Coder 2 is:

Here, Coder 1 and Coder 2 agreed that the theme was present in the text once (cell a) and they agreed that the theme was absent five times (cell d), for a total of 6, or 60% of the 10 texts.

The probability that Coder 1 and Coder 2 agree by chance is:

Here, the probability that Coder 1 and Coder 2 agreed by chance is .08 + .48 = .56. Using formula 19.1, we calculate kappa:

In other words, the 60% observed agreement between the two coders for the data in table 19.5 is about 9% better than we’d expect by chance. Whether we’re talking about agreement between two people who are coding a text or two people who are coding behavior in a time allocation study, 9% better than chance is nothing to write home about.

Carey et al. (1996) asked 51 newly arrived Vietnamese refugees in New York State 32 open-ended questions about tuberculosis. Topics included knowledge and beliefs about TB symptoms and causes as well as beliefs about susceptibility to the disease, prognosis for those who contract the disease, skin-testing procedures, and prevention and treatment methods. The researchers read the responses and built a code list based simply on their own judgment. The initial codebook contained 171 codes.

Then Carey et al. broke the text into 1,632 segments. Each segment was the response by 1 of the 51 respondents to 1 of the 32 questions. Two coders independently coded 320 of the segments, marking as many of the themes as they thought appeared in each segment. Segments were counted as reliably coded if both coders used the same codes on it. If one coder left off a code or assigned an additional code, then this was considered a coding disagreement.

On their first try, only 144 (45%) out of 320 responses were coded the same by both coders. The coders discussed their disagreements and found that some of the 171 codes were redundant, some were vaguely defined, and some were not mutually exclusive. In some cases, coders simply had different understandings of what a code meant. When these problems were resolved, a new, streamlined codebook was issued, with only 152 themes, and the coders marked up the data again. This time they were in agreement 88.1% of the time.

To see if this apparently strong agreement was a fluke, Carey et al. tested intercoder reliability with kappa. The coders agreed perfectly (k = 1.0) on 126 out of the 152 codes that they’d applied to the 320 sample segments. Only 17 (11.2%) of the codes had final k values ^0.89. As senior investigator, Carey resolved any remaining intercoder discrepancies himself (Carey et al. 1996).

How much intercoder agreement is enough? As with so much in real life, the correct answer, I think, is: It depends. It depends, for example, on the level of inference required. If you have texts from single mothers about their efforts to juggle home and work, it’s easier to code for the theme ‘‘works full time’’ (a low-inference theme) than it is to code for the theme ‘‘enjoys her job’’ (a high-inference theme).

It also depends on what’s at stake. X-rays are texts, after all, and I’d like a pretty high level of intercoder agreement if a group of physicians were deciding on whether a particular anomaly meant my going in for surgery or not. In text analysis, the standards are still evolving. Many researchers are satisfied with kappa values of around .70; others like to shoot for .80 and higher (Gottschalk and Bechtel 1993; Krippendorf 2004b) (Further Reading: interrater reliability).

Found a mistake? Please highlight the word and press Shift + Enter

Subjects