Statistical Methods for Evaluating AIG Items

Substantive methods can be used to assess the quality of the cognitive, item, and distractor models, as well as the generated items. Statistical analyses, by comparison, can only be used to evaluate the quality of the generated items. Hence in the third section of this chapter, we present methods that rely on statistical outcomes to assess the quality of the generated items. Three types of statistical analyses can be conducted using samples of generated items. The first type of analysis focuses on the quality of the correct option. The second type of analysis is used to evaluate the quality of the incorrect option. These analyses are also used to evaluate item quality using traditional development methods. The third type of analysis is used to assess the similarity of the generated items using the Cosine Similarity Index (CSI). This analysis is also used to evaluate item quality, but it is unique to AIG.

Statistical Analyses of the Correct Option

Item quality can be measured for the correct option using the difficulty and discrimination indices. In classical test theory (CTT), item difficulty is measured using the proportion of examinees who correctly answer the item. A higher proportion of correct scores indicates an easier test item. Difficulty does not typically have an item quality interpretation other than to claim that it is desirable to have items with a range of difficulty levels when creating a test. Item discrimination can be measured using either the point-biserial or the biserial correlation. The point-biserial correlation is a measure of the relationship between a dichotomous item score and a total test score that is assumed to be a continuous variable. The biserial correlation is a measure of the relationship between a dichotomous item score that is assumed to be continuously distributed and a total test score that is also assumed to be continuous. For both indices, a higher correlation indicates a stronger relationship between item and total test score performance, which is characteristic of a more discriminating and, therefore, higher-quality item. The index of discrimination, which is the difference between the proportion correct for examinees in the upper group subtracted from the proportion of the lower group, can also be computed. The index of discrimination should be positive for a high-quality correct option.

In item response theory (IRT), the model that is often used for correct option analysis with the selected-response format is the three-parameter logistic (Lord, 1980). This model is used to estimate the probability of choosing the correct response. It can be expressed as

where Р.(в) is the probability of choosing the correct option for item i given the examinee's ability в, a: is the discrimination estimate for correct option /, b. is the difficulty estimate for correct option /, and c. is the pseudo-guessing estimate for correct option /. In this model, item difficulty is estimated using the b-parameter where the majority of values range from -4.00 to 4.00. A high positive b-value indicates a more difficult item. As with the CTT interpretation, difficulty does not have an item quality interpretation in IRT other than to say that it is desirable to have items with a range of difficulty levels. Item discrimination is estimated using the a-parameter in which a majority of the values range from 0 to 4. A higher a-value indicates a more discriminating and, thus, higher-quality item.

Statistical Analyses of the Incorrect Options

Item quality can also be measured for the incorrect options. In CTT, the simplest distractor analysis focuses on the percentage of examinees who select each distractor. The purpose of this analysis is to identify low-frequency distractors. These distractors are considered to be non-functioning because they are selected by very few examinees. Haladyna and Downing (1993) claimed that when less than 5% of the examinees select a distractor, it is considered non-functioning. Trace line plots can be used to visualize and statistically test the relationship between total test performance and distractor selection percentages to identify the non-functioning distractors. A non-functioning distractor should produce a relatively flat trace line. A chi-square, goodness-of-fit test can be used to measure whether the slope of a trace line is significantly different from 0. The chi-square statistic is computed using the difference between the observed and expected frequencies of the distractors as follows:

where Ock is the observed frequency of examinees with ability level c who choose option к in the item, Eck is the expected frequency of examinees with ability level c who choose option k, df is the degrees of freedom, and C is the total number of examinee ability levels. If the chi-square value is statistically significant, then the distractor is considered to be discriminating and, therefore, of high quality.

As with the correct option, the point-biserial and biserial correlations can be computed for the incorrect options. These correlations should be negative for the distractors, suggesting that a higher proportion of responses are associated with lower-performing examinees, which means the item is discriminating and, therefore, of higher quality. The index of discrimination (i.e., difference between the proportion of the upper group that answered the item correctly subtracted from the proportion of the lower group) can also be computed for the incorrect options. The index of discrimination should be close to zero or negative for a high-quality item, indicating that a higher proportion of low- performing examinees selected the option compared to high-performing examinees.

In IRT, two models are commonly used for distractor analysis. The first is Bock's (1972) nominal response model. This model is used to analyze distractors in multiple-choice items. The nominal response model estimates the probability of choosing each option where it is assumed there is no ordering among the options. The nominal response model is expressed as

where P(x. = кв) is the probability of choosing option к in item j given the examinee's ability в, ak is the item discrimination for distractor k, bk is the difficulty of distractor k, and m. is the total number of options for item j. The second IRT model that is commonly used for distractor analysis is Samejima's (1979) graded response model. Unlike Bock's model, Samejima's graded response model assumes ordering among the options. The graded response model also takes into account the proportion of examinees who are assumed to guess the correct option and thereby allows for a better measure of how low-performing examinees respond to multiple-choice items. It is assumed that there exists a latent category of "don't know" (DK). The graded response is expressed as

where dk is fixed to 1 /mi- to represent the assumption that examinees will randomly guess if they belong to latent category DK, and

exp(a0(0-bo)) ...... . , ,

-^--— represents the probability that an examinee belongs

£>(*(&#))

to latent category DK. For both the nominal and graded response models, distractors are evaluating using the item characteristic curve (ICC) plot. As examinee ability increases, the probability of choosing the correct options should increase, and the probability of choosing the incorrect options should decrease for a high-quality item. Non-functioning distractors of low-quality yield an ICC that is relatively flat across all ability levels.

To illustrate the use of some of these statistical item analyses, Gierl, Lai, Pugh, Touchie, Boulais, and De Champlain (2016) conducted a study to evaluate the quality of generated medical items. The three-step AIG method with systematic distractor generation was used to create items. Models were written to generate items in three areas: neonatal jaundice, upper gastrointestinal bleed, and liver disease in adults (one cognitive and item model was created for each content area). Then a sample of the generated items was randomly assigned to nine different field test forms in which each form contained 52 items. The test was administered to 455 medical students, with 52 test items given in 9 test forms. The outcomes from CTT and IRT were used to analyze the performance of the correct and incorrect options using generated items from three different content areas.

The results indicate that the generated items produced a range of difficulty levels for the correct option while providing a consistently high level of discrimination. Moreover, using a single model for each of the three different content areas produced items with difficulty levels ranging from very easy to very difficult. This finding helped address one common misconception about AIG—namely, that a single model will produce items with the same statistical characteristics because the items generated are similar to one another. Gierl et al. (2016) demonstrated that one model can be used to produce items with a range of difficulty levels. The Gierl et al. (2016)) study also demonstrated the effectiveness of systematic distractor generation, as described in Chapter 5. One hundred and ten different distractors were generated for the items produced using the three medical cognitive models. Of the 110 generated distractors, only three were categorized as non-functioning based on the statistical results. This finding helped to demonstrate that systematic distractor generation could be used to produce large pools of plausible distractors that differentiate medical students at different ability levels.