# Inter-Rater Reliability

As planners, we are constantly in search for data, after all, we celebrate the decennial census because it gives us the latest data to test our assumptions against. Unfortunately, data simply does not exist for much of what we want to study. When you are faced with this problem, you have little choice but to go out and create your own dataset.

However, in the process of gathering data from the world around us, we all look through the various lenses of our individual personalities, backgrounds, educations, and experiences. For instance, if five researchers were to observe kids playing in a park to determine how kids use parks and improvements thereon, each of the five people will observe different subtleties within the same environment. You could understand how a single college-aged male might gather different data from a mother of four.

There are steps that can be taken into account for this variation and make sure that it does not undermine the quality of our data. Inter-rater reliability is simply a measure of whether raters or observers produce similar or consistent observations of the same phenomena. There are two primary ways to address inter-rater reliability, and while they are both fairly rudimentary, it is important to understand both when you feel there could be inter-rater reliability issues.

**Joint Probability of Agreement: **A number of inter-rater reliability tests are available to the planning professional. When the data being created is nominal in nature, the inter-rater reliability can be tested using a joint probability of agreement test. This test is essentially the times each observation is given the same rating or score by two or more raters, this number is then divided by the number of categories available.

The weakness with this test is that it does not take into account random and chance agreements. If you and I are asked to grade whether a site plan for a proposed development is walkable, we would more likely agree if the choices were between *yes* and *no* than if we had eight different response categories to choose from. The other weakness in this test is that the data is assumed to be nominal.

**Cohen’s Kappa: **Cohen’s Kappa takes the joint probability of agreement test and improves it by accounting for the chance agreement that would take place with a limited number of choices available to the raters. This approach, while an improvement from the prior, does not solve the issue with nominal data. If your data is ordinal in nature (e.g., walkability rating on a scale of five), neither of these tests will do an adequate job of measuring inter-rater reliability.

**Correlation Coefficients: **Correlation is the subject of Chapter 9. If the selections available to your raters are ordinal or continuous in nature, you can use a correlation coefficient to measure the extent to which your raters agree. For ordinal data, a Spearman’s rank correlation coefficient can be used to measure the degree to which two raters agree or disagree. For continuous variables, a Pearson’s coefficient would be better. When more than two raters or observers are being used, the appropriate correlation coefficient will need to be calculated for every possible combination of raters. The mean coefficient then would be used in explaining how you have measured the degree to which your raters agree.

Perhaps better than either the Spearman or Pearson correlation coefficient for measuring inter-rater reliability is the intraclass correlation coefficient (ICC). Returning to the example of experts rating street scenes, Ewing et al. (2006) and Ewing and Handy (2009) had ten expert panelists rate 48 street scenes with respect to nine urban design qualities. ICCs were computed from their ratings, and then compared to nominal standards of reasonable agreement. From their ICC values, most urban design qualities demonstrated moderate inter-rater reliability among panelists (0.6 > ICCs > 0.4); the exceptions—linkage, coherence, and legibility—showed only fair reliability (0.4 > ICCs > 0.2) (Landis & Koch, 1977). The latter were dropped from further consideration. See Chapter 9 for more on the ICC.

# Equivalency Reliability

Another type of reliability considered by researchers is equivalency reliability. This measure of reliability considers the extent to which two variables measure the same construct. This is also known as equivalent form reliability or parallel form reliability.

In considering equivalency reliability, you look at the different measures of a construct and see if your results are similar using different approaches. If they are dissimilar, your test suffers from a lack of equivalency reliability. The challenge with testing equivalency reliability is you are forced to come up with numerous ways to define or test a complicated construct. This is no easy feat.

In the aforementioned study of urban design qualities and pedestrian activity, Ewing and Clemente (2013) wanted to test whether the field pedestrian counts were, in fact, reliable indicators of pedestrian activity. The counts had been taken through two passes up and down a block, rather than standardized counts for an extended period. Four consecutive counts represented a small sample, so they conducted a test of equivalency reliability.

They compared field counts for block faces in New York City to pedestrian counts from Google Street View, Bing, and Everyscape for the same block faces (Figure 6.4). To compare field counts to web-based counts, equivalency reliability was judged with Cronbach’s alpha. Cronbach’s alpha is widely used in the social sciences to see if items—questions, raters, indicators—measure the same thing. If independent counts—four based on field work and three based on street imagery—agreed, they could assume that the field counts were reliable measures of pedestrian activity. Some professionals require a reliability of 0.70 or higher before they will use an instrument. Their alpha values were consistent with these guidelines for two out of three websites (Google Street View and Bing).

# Internal Consistency

Internal consistency is similar to the concept of equivalency reliability. Internal consistency is the ability of a test to measure the same concept through different questions or approaches and yet yield consistent data. Said in another way, internal consistency involves measuring two different versions or approaches within the same test and determining the consistency of the responses.

As a researcher, you will want to approach a concept or question from various perspectives because each perspective will provide additional data that will help shape your opinions. However, if you begin to ask questions on the same concept that produce wildly differing responses, you know that your method has a problem.

Generally, the best way to address internal consistency is to divide the test in half and compare the responses from one half to the other half. A high score on the correlation coefficient between the two halves is desirable up to a point. If the test is completely correlated, you have not added any information by asking additional questions.

*Figure 6.4* Use of Google Street View, Bing, and Everyscape Imagery (from top to bottom) to Establish Equivalency Reliability (East 19th Street, New York, NY)

*Figure 6.4* Continued

There are three common tests to measure internal consistency; split-halves test, Kuder-Richardson test, and Cronbach’s alpha test. The split-halves test uses a random approach in splitting up the answers of the test and using a simple correlation to measure consistency. This was a more common technique decades ago, but with the proliferation of statistical software the ease of other methods along with their advantages have made this simple approach less common. The Kuder-Richardson test is a

*Validity and Reliability* 95 more complex version of the split-halves where a statistical software package will run ever}' possible split-half test possible and gives an average correlation. Cronbach’s alpha test is similar to the Kuder-Richardson test and gives the researcher the ability to do a sophisticated split-half test when the questions have different scoring weights.

For instance, if the survey asked, “do you walk to work daily” the possible answers are *yes* and *no* and the appropriate internal consistency test for this type of nominal data would be the Kuder-Richardson test. If the question was “how likely are you to walk to work" with a range of answers, the data would be ordinal and the Cronbach’s alpha test would be the best measure of internal consistency.