Mark J. Gierl and Hollis Lai

The principles and practices that guide the design and development of test items are changing because our assessment practices are changing. Educational visionary Randy Bennett (2001) anticipated that computers and the Internet would become two of the most powerful forces of change in educational measurement. Bennett’s premonition was spot-on. Internet-based computerized testing has dramatically changed educational measurement because test administration procedures combined with the growing popularity of digital media and the explosion in Internet use have created the foundation for different types of tests and test items. As a result, many educational tests that were once given in a paper format are now administered by computer using the Internet. Many common and well- known exams in the domain of certification and licensure testing can be cited as examples, including the Graduate Management Achievement Test (GMAT), the Graduate Record Exam (GRE), the Test of English as a Foreign Language (TOEFL iBT), the American Institute of Certified Public Accountants Uniform CPA examination (CBT-e), the Medical Council of Canada Qualifying Exam Part I (MCCQE I), the National Council Licensure Examination for Registered Nurses (NCLEX-RN) and the National Council Licensure Examination for Practical Nurses (NCLEX-PN). This rapid transition to computerized testing is also occurring in K—12 education. As early as 2009, Education Week’s “Technology Counts” reported that educators in more than half of the U.S. states—where 49 of the 50 states at that time had educational achievement testing—administer some form of computerized testing. The move toward Common Core State Standards will only accelerate this transition given that the two largest consortiums, PARCC and SMARTER Balance, are using technology to develop and deliver computerized tests and to design constructed-response items and performance-based tasks that will be scored using computer algorithms.

Computerized testing offers many advantages to examinees and examiners compared to more traditional paper-based tests. For instance, computers support the development of technology-enhanced item types that allow examiners to use more diverse item formats and measure a broader range of knowledge and skills. Computer algorithms can also be developed so these new item types are scored automatically and with limited human intervention, thereby eliminating the need for costly and timeconsuming marking and scoring sessions. Because items are scored immediately, examinees receive instant feedback on their strengths and weaknesses. Computerized tests also permit continuous and on-demand administration, thereby allowing examinees to have more choice about where and when they write their exams.

But the advent of computerized testing has also raised new challenges, particularly in the area of item development. Large numbers of items are needed to support the banks necessary for computerized testing when items are continuously administered and, therefore, exposed. As a result, banks must be frequently replenished to minimize item exposure and maintain test security Breithaupt, Ariel and Hare (2010) claimed that a high-stakes 40-item computerized adaptive test, which is a commonly used administrative format for certification and licensure testing, with two administrations per year would require, at minimum, a bank with 2,000 items. The costs associated with developing banks this size are substantial. For instance, Rudner (2010) estimated that the cost of developing one operational item using the traditional approach where content experts use test specifications to individually author each item ranged from $1,500 to $2,500. If we combine the Breithaupt et al. (2010) bank size estimate with Rudner’s cost-per-item estimate, then we can project that it would cost between $3,000,000 to $5,000,000 alone just to develop the item bank for a computerized adaptive test.

One way to address the challenge of creating more items is to hire large numbers of developers who can scale up the traditional, one-item-at-a-time content specialist approach to ensure more items are available. But we know this option is costly. An alternative method that may help address the growing need to produce large numbers of new testing tasks is through the use of automatic item generation (AIG). AIG (Embretson & Yang, 2007; Gierl & Haladyna, 2013; Irvine & Kyllonen, 2002) is an evolving research area where cognitive and psychometric theories are used to produce tests that contain items created using computer technology. AIG, an idea described by Bormuth (1969) more than four decades ago, is gaining renewed interest because it addresses one of the most pressing and challenging issues facing educators today—the rapid and efficient production of high-quality, content-specific test items. This production is needed, in part, to support the current transition to computerized testing.

AIG has at least four important benefits for test developers. First, AIG permits the test developer to create a single item model that, in turn, yields many test items. An item model is a template that highlights the features in an assessment task that can be manipulated to produce new items. Multiple models can be developed that will yield hundreds or possibly thousands of new test items. These items are then used to populate item banks. Computerized tests draw on a sample of the items from the bank to create new tests.

Second, AIG can lead to more cost-effective development because the item model is continually reused to yield many test items compared with developing each item individually and, often, from scratch. In the process, costly yet common errors in item development (e.g., including or excluding words, phrases or expressions along with spelling, grammatical, punctuation, capitalization, typeface and formatting problems) can be avoided because only specific elements in the stem and options are manipulated across large numbers of items (Schmeiser & Welch, 2006). In other words, the item model serves as a template for which the test developer manipulates only specific, well-defined elements. The remaining elements are not altered during development. The view of an item model as a template with both fixed and variable elements contrasts with the more conventional view of a single item where every element is unique, both within and across items. Drasgow, Luecht and Bennett (2006, p. 473) provide this description of the traditional content specialist approach to item development:

The demand for large numbers of items is challenging to satisfy because the traditional approach to test development uses the item as the fundamental unit of currency That is, each item is individually hand-crafted—written, reviewed, revised, edited, entered into a computer, and calibrated—as if no other like it had ever been created before.

Third, AIG treats the item model as the fundamental unit of currency, where a single model is used to generate many items, compared with a more traditional approach, where the item is treated as the unit of analysis, as noted by Drasgow et al. (2006). Hence, AIG is a scalable process because one item model can generate many test items. With a more traditional approach, the test item is the unit of analysis where each item is created individually Because of this unit of analysis shift, the cost per item should decrease because test developers are producing models that yield multiple items rather than producing single unique items. The item models can also be reused, particularly when only a small number of the generated items are used on a specific test form, which, again, could yield economic benefits.

Fourth, AIG may enhance test security Security benefits could be realized when large numbers of items are available, simply by decreasing the per-item exposure rate. In other words, when item volume increases, item exposure decreases, even with continuous testing, because a large bank of operational items is available during test assembly Security benefits can also be found within the generative logic of item development because the elements in an item model are constantly manipulated and, hence, varied, thereby making it difficult for the examinees to memorize and reproduce items.

< Prev   CONTENTS   Source   Next >