Substantive Methods for Evaluating AIG Items
AIG versus Traditional Item Review: Item Quality
The second type of substantive review we describe focuses on the quality of the generated items (Pugh, De Champlain, Gierl, Lai, & Touchie, 2020; see also Embretson & Kingston, 2018). While we contend that model-based reviews serve as the best method for generating high-quality items, item reviews are also required. In our experience, two important questions arise when testing organizations consider AIG as a methodology for item development. The first question is, do AIG items meet the standards of quality required for an operational test administration? The implicit point of comparison in this question is the current set of operational items created using a traditional item development approach. Therefore, to answer this question using a substantive review, a sample of generated items is compared to a sample of operational items. This type of substantive review allows the SMEs to evaluate and compare items created using two different item writing methods. To ensure that the SMEs who conduct these reviews are not biased, the reviewers should be blind to the item development method. To implement his review, SMEs rate the quality of the items created using AIG and traditional methods. The sample should be carefully balanced so that the items represent the same content areas and cognitive skills in the test specifications. The SMEs should be carefully selected to ensure that they have adequate content expertise and at least some experience writing operational test items. These qualifications help ensure that the judgements of the SMEs who conduct the reviews are sound because they have the required experience to evaluate item quality. But most importantly, the SMEs should be blind to the item development process so that biases in favor of or against the AIG methodology do not affect their ratings. We recommend the use of a standardized rating scale to conduct the review, as described in the previous section. The only difference between the current and the prior review is that the focus is on item rather than model quality. Again, a 4-point rating scale could be used because it provides a succinct but relevant indicator of quality. It also yields results that are easily interpretable for researchers and practitioners. The 4-point scale ranges from a low of 1 (item is fine, no changes needed) to a high of 4 (reject, item is flawed, reject item outright). The mid-range values are 2 (minor revisions, item requires minor revisions, revisions can be made in-house) and 3 (major revisions, item requires major revisions, return to item developer for changes). This rating scale has the benefit of classifying items into two categories—acceptable or unacceptable. Items judged to be outright acceptable or requiring only minor revisions are judged to be acceptable. Conversely, items that require major revisions or that are rejected are judged to be unacceptable.
An example of a study that used this type of substantive evaluation to assess item quality was conducted by Gierl, Latifi, Lai, Matovinvic, and Boughton (2016). They used this 4-point scale to evaluate the quality of 16 generated items and 16 traditionally created items in junior high school science. To conduct the review, AIG was used to create items in eight different Physics content areas. Two items from each of the eight content areas were randomly selected from a generated item set containing more than 9,000 items. A random sample of 16 traditional items was also drawn from the eight Physics areas using a bank of existing created at a large commercial testing company. The items in this bank were written using the traditional approach by experienced SMEs and then reviewed by a committee of SMEs to ensure that the items met the content and item development standards at the company. The items created using the AIG and traditional method were combined and their order in the form was randomized to create a 32-item evaluation test. A three-member panel of science content and test development specialists independently rated each item using the 4-point scale described in the previous paragraph. Gierl et al. (2016) first conducted an independent sample f-test using the 4-point item quality indicator as the dependent variable, while the item development method (AIG vs. Traditional) served as the independent variable. A lower mean score indicates higher item quality. The means for the AIG and traditional condition were 2.12 and 2.33, respectively, which were not statistically different from one another. Hence the overall rating by the panelists indicates that the AIG and traditional items were of comparable quality. The rating in the Gierl et al. (2016) study was also evaluated by item type as a function of the three panelists. Overall, 65% (31 of 48) of the AIG items were considered acceptable (i.e., rating of 1 or 2) while 52% (or 25 of the 48 ratings) of the traditional items were judged to be acceptable. Conversely, 35% (17 of 48) of the AIG items were rated as unacceptable, while 48% (23 of 48) of the traditional items were found to be unacceptable despite the extensive development and review process that items must satisfy in order to qualify for inclusion in the operational item bank for the commercial testing company.
AIG versus Traditional Item Review: Predictive Accuracy
The second question that often arises when testing organizations consider AIG as a methodology for developing items is, Can AIG items be differentiated from traditional items? The root of this question resides in an implicit concern about the quality of the generated items. As we noted in Chapter 3, generated items—more specifically, generated items from 1 -layer models— are sometimes described pejoratively as "clones" because they are considered to be too similar to a parent item. As a result, generated items are seen as easily produced and overly simplistic. To evaluate this concern, predictive accuracy can be assessed by asking SMEs to classify the generated and traditional items within a randomized sample. For this type of substantive review, SMEs must be familiar with the AIG methodology. Hence a workshop or training session that describes AIG should be delivered to each participant in this type of substantive evaluation because each SME should clearly understand the process used to generate AIG items, as well as the process used to create the traditional items. The items included in this type of review must also be presented to the SMEs in a random order. The dependent variable for this analysis is percent correct, meaning the number of times the SME correctly identifies the item development method with the associated item. Rater reliability is also an important measure because it provides information on the consistency of this judgement across the SMEs.
Gierl et al. (2016) also evaluated the predictive accuracy of the three-member SME panel in a junior high school science study. Once the panelists were debriefed on the purpose of the study and provided with an AIG workshop, they reviewed all 32 items with the goal of differentiating the AIG from the traditional items. The panelists correctly identified 57% of the items but incorrectly identified 43% of the items. In addition, rater reliability was low at 0.13, indicating that there was little agreement across the three SMEs on which items were created using which method.