Methods for Writing Items to Performance Standards
Targeting items to discriminate along specific points of the scale—namely, between two performance levels—is difficult work that is gaining more traction in operational practice, particularly by the College Board and the two Race to the Top Assessment consortia. One process used by the College Board in AP history was to first map the claims onto a performance continuum. The content experts noted that the claims themselves varied in terms of difficulty and could be modified to further change the difficulty. They then identified “difficulty drivers” within each claim that could be used in item writing. In this way, they ensured that multiple items written to a claim would vary across difficulty and thus across the PLDs. (See Plake, Huff & Reshetar, 2010, for more details.)
In the consortia and with other state-level assessments written to distinguish among performance levels, item writers categorize the items to the claim, standard(s) and PLDs to which they are written. Once the items are administered, the alignment between the expected location of the item on the score scale and its actual location can be examined. Oftentimes, intended difficulty and actual difficulty can vary due to differential opportunity to learn or item characteristics unrelated to the key content measured. Test design and item development are thus a fluid process as information from one year’s administration can inform the next year’s development.
The challenge is having a learning model that is salient throughout the claims, evidence, task model and PLDs such that when item writers are working with a particular task model, there is a level of confidence that the particular features of that task model were purposefully selected to discriminate performance between, say, a Proficient and an Advanced student. What has been described in this chapter is a potential coherence among the targets of measurement (as embodied by the claims, evidence and PLDs), the task models and items, and the desired scale characteristics (e.g., maximize measurement information at the intended cut scores) that is the goal of all good assessment design but is not always realized in practice without an explicit design model.