The Problem of Scaling Item Development
As educational testing undergoes a noteworthy period of transition, CBT is replacing paper-based testing, thereby creating the foundation for the wide-spread use of technology-based systems. Computer delivery systems are being used to implement test designs that permit educators to collect information that support both formative and summative inferences as students acquire 21 st-century skills. These developments are unfolding on a global stage, which means that educational testing is being used to expand our assessment practices to accommodate students who speak different languages as they become educated in diverse cultures, geographic regions, and economic systems.
But these transitions are also accompanied by formidable new challenges, particularly in the area of item development. Educators must have access to large numbers of diverse, multilingual, high-quality test items to implement CBT given that the items are used to produce tests that serve multiple purposes and cater to large numbers of students who speak many different languages. Hence, thousands or, possibly, millions of new items are needed to develop the banks necessary for CBT so that testing can be conducted in these different testing conditions and across these diverse educational environments. A bank is a repository of test items, which includes both the individual items and data about their characteristics. These banks must be developed initially from scratch and then replenished constantly to ensure that examinees receive a continuous supply of new items during each test administration. Test items, as they are currently created, are time-consuming and expensive to develop because each individual item is written by a subject matter expert (SME; also called test developer, content specialist, or item writer). Hence, item development can easily be identified as one of the most important problems that must be solved before we can migrate to a modern testing system capable of different purposes, like formative and summative assessment, and suitable for a large and diverse population composed of students from different cultural and linguistic groups.
As of today, these large, content-specific, multilingual item banks are not available. Moreover, the means by which large numbers of new items can be quickly developed to satisfy these complex banking requirements is unclear (Karthikeyan, O'Connor, & Hu, 2019). The traditional approach used to create the content for these banks relies on a method in which the SME creates items individually. Under the best condition, traditional item development is an iterative process where highly trained groups of SMEs use their experiences and expertise to produce new items. Then, after these new items are created, they are edited, reviewed, and revised by another group of highly trained SMEs until they meet the appropriate standard of quality (Lane, Raymond, Haladyna, & Downing, 2016). Under what is likely the more common condition—particularly in classroom assessment at both the K-12 and post-secondary education levels—traditional item development is a solitary process where one SME with limited training uses her or his experiences to produce new test items, and these items, in turn, are administered to examinees with little, if any, additional review or refinement. In both conditions, the SME bears significant responsibility for identifying, organizing, and evaluating the content required for this complex and creative process. Item development is also a subjective practice because an item is an expression of the SME's understanding of knowledge and skill within a specific content area. This expression is distinctive for each SME and, as a result, each item is unique. For this reason, traditional item development has often been described as an "art" because it relies on the knowledge, experience, and insight of the SME to produce unique test items (Schmeiser & Welch, 2006).
But the traditional approach to item development has two noteworthy limitations. First, item development is time-consuming and expensive because it relies on the item as the unit of analysis (Drasgow et al., 2006). Each item in the process is unique, and, therefore, each item must be individually written and, under the best condition, edited, reviewed, and revised. Many different components of item quality can be identified. Item quality can focus on content. For example, is the content in the item appropriate for measuring specific outcomes on the test? Item quality can focus on logic. For example, is the logic in the item appropriate for measuring the knowledge and skills required by examinees to solve problems in a specific domain? Item quality can also focus on presentation. For example, is the item presented as a task that is grammatically and linguistically accurate? Because each element in an item is unique, each component of item quality must be reviewed and, if necessary, revised. This view of an item where every element is unique, both within and across items, was highlighted by Drasgow et al. (2006, p. 473) when they stated,
The demand for large numbers of items is challenging to satisfy because the traditional approach to test development uses the item as the fundamental unit of currency. That is, each item is individually hand-crafted—written, reviewed, revised, edited, entered into a computer, and calibrated—as if no other like it had ever been created before.
In high-stakes testing situations, writing and reviewing are conducted by highly trained SMEs using a comprehensive development and evaluation process. As a result, the traditional approach to item development is expensive. Rudner (2010) estimated that the cost of developing one operational item for a high-stakes test using the traditional approach ranged from US$1,500 to $2,500.
Second, the traditional approach to item development is challenging to scale efficiently and economically. The scalability of the traditional approach is linked, again, to the item as the unit of analysis. When one item is required, one item is written by the SME because each item is unique. When 100 items are required, 100 items must be written by the SMEs. Hence, large numbers of SMEs who can write unique items are needed to scale the process. Using a traditional approach can result in an increase in item production when large numbers of SMEs are available. But item development is a time-consuming process due to the human effort needed to create large numbers of new items. As a result, it is challenging to meet the content demands of modern testing systems using the traditional approach because it is not easily scaled.