Automatic Item Generation: An Augmented Intelligence Approach to Item Development
Researchers and practitioners require an efficient and cost-effective method for item development. The solution will not be found using the traditional approach because of its two inherent limitations. Consequently, an alternative is needed. This alternative is required to support our modern test delivery and design initiatives, which will be used to evaluate large numbers of students who are educated in different educational systems and who speak different languages. One approach that can help address the growing need to produce large numbers of new test items in an efficient and economical manner is with the use of automatic item generation (AIG; Gierl & Haladyna, 2013; Irvine & Kyllonen, 2002).
AIG is the process of using models to generate items using computer technology. It can be considered a form of augmented intelligence (Zheng et al., 2017). Augmented intelligence is an area within artificial intelligence that deals with how computer systems emulate and extend human cognitive abilities, thereby helping to improve human task performance. The interaction between a computer system and a human is required for the computer system to produce an output or solution. Augmented intelligence combines the strength of modern computing using computational analysis and data storage with the human capacity for judgment to solve complex unstructured problems. Augmented intelligence can, therefore, be characterized as any process or system that improves the human capacity for solving complex problems by relying on a partnership between a machine and a human (Pan, 2016).
AIG can be distinguished from traditional item development in two important ways. The first distinction relates to the definition of what constitutes an item. Our experience working with SMEs and other testing specialists has demonstrated that many different definitions and conceptualizations surround the word "item". If we turn to the educational testing literature, it is surprising to discover that the term "item" is rarely, if ever, defined. When a definition is offered, it tends to be a "black box", meaning that an input and output are presented with no description of the internal mechanism for transforming the input to the output. For example, Ebel (1951, p. 185), in his chapter titled "Writing the Test Item" in the first edition of the famous handbook Educational Measurement, offered an early description where an item was merely referred to as "a scoring unit". Osterlind (2010), also noting the infrequency with which the term "item" was defined in the literature, offered this definition:
A test item in an examination of mental attributes is a unit of measurement with a stimulus and a prescriptive form for answering;
and, it is intended to yield a response from an examinee from which performance in some psychological construct (such as ability, predisposition, or trait) may be inferred.
One of the most recent definitions is provided in the latest edition of the Standards for Educational and Psychological Testing (2014). The Standards—prepared by the Joint Committee of the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education—serves the most comprehensive statement on best practices in educational and psychological testing that is currently available. The Standards define the term "item" as "a statement, question, exercise, or task on a test for which the test taker is to select or construct a response, or perform a task" (p. 220). The authors of the Standards also direct the reader to the term "prompt", which is described as "the question, stimulus, or instruction that elicits a test taker's response" (p. 222). The definitions from Osterlind and the Standards share common characteristics. An item contains an input in the form of a statement, question, exercise, task, stimulus, or instruction that produces an output which is the examinee's response or performance. In addition, the output of the examinee is in a prescriptive form that is typically either selected or constructed. But no description of the internal workings or characteristics of the item is included. Haladyna and Rodriguez (2013), in their popular text Developing and Validating Test Items, offer a different take on the term by stating that "a test item is a device for obtaining information about a test taker's domain of knowledge and skills or a domain of tasks that define a construct" (p. 3). They claim that one of the most important distinctions for this term is whether the item is formatted as selected or constructed. To overcome the limitations of these definitions, we offer a definition of the term "item" that will be used to guide AIG in our book. An item is an explicit set of properties that include the parameters, constraints, and instructions used to elicit a response from the examinee. Our definition specifies the contents in the black box, thereby overcoming the limitations in previous definitions by describing the input as a set of parameters, constraints, and instructions. In addition, we assert that the input be represented in a way that it can be replicated and evaluated. Replication is an important requirement because it means that the properties of the item are so explicit, detailed, and clear that it can be independently reproduced. Evaluation is an important requirement because it means that the properties used to produce the item for addressing a specific purpose can be scrutinized. Our definition does not include a format requirement, and it does not specify the representation for the parameters, constraints, and instructions.
The second distinction relates to the workflow required to create an item. Item development is one piece within the much larger test development puzzle. Lane et al. (2016), for example, described 12 components of test development in their introductory chapter to the Handbook of Test Development (2nd edition). AIG occurs in the item writing and review subcomponent, which are within the item development component, where item development is component 4 of 12. While not explicitly stated, item writing and review in Lane et al. are synonymous with the traditional approach, as described earlier in this chapter. The traditional approach relies on a method where the SME creates each item individually using an iterative process with highly trained groups of SMEs who produce new items, as well as review and revise existing items until the items all meet specific standards of quality. Traditional item development relies heavily on the SMEs to identify, organize, and evaluate content using their knowledge, experience, and expertise. AIG, by way of comparison, uses an augmented intelligence workflow that combines the expertise of the SME with the power of modern computing to produce test items. AIG is characterized as a three-step process in which models are first created by SMEs, a template for the content is then specified by the SMEs, and, finally, the content is placed in the template using computer-based assembly. AIG can, therefore, be characterized as an augmented intelligence approach because large numbers of new items can be manufactured using the coordinated inputs created by humans with outputs produced from computers.
Gierl and Lai (2013) described a three-step workflow for generating test items. This workflow differs from the traditional approach to item development because it requires the coordinated efforts of humans and computers to create items. In step 1, the SME identifies the content that will be used to produce new items. This content is identified using a framework that highlights the knowledge, skills, and abilities required to solve problems in a specific domain. Gierl, Lai, and Turner (2012) called this framework a cognitive model for AIG. A cognitive model is used as the first step to highlight the knowledge, skills, and abilities required by examinees to solve a problem in a specific domain. This model also organizes the cognitive- and content-specific information into a coherent whole, thereby presenting a succinct representation of how examinees think about and solve problems.
With the content identified in step 1, it must then be positioned within an item model in step 2. An item model (LaDuca, Staples, Templeton, & Holzman, 1986) is like a mould, template, or rendering of the assessment task that specifies which parts and which content in the task can be manipulated to create new test items. The parts include the stem, the options, and the auxiliary information. The stem contains the content or question the examinee is required to answer. The options include a set of alternative answers with one correct option and one or more incorrect options. The stem and correct option are generated for a constructed-response item. The stem, correct option, and incorrect options are generated for the selected-response item. Auxiliary information includes any material, such as graphs, tables, figures, or multimedia exhibits that supplement the content presented in the stem and/or options. The content specified in the cognitive model highlights the knowledge, skills, and abilities required to solve problems in a specific domain. The item model in step 2 provides a template for the parts of an assessment task that can be manipulated using the cognitive model so that new items can be created.
After the SME identifies the content in the cognitive model and creates the item model, the outcomes from steps 1 and 2 are combined to produce new items in step 3. This step focuses on item assembly using the instructions specified in the cognitive model. Assembly can be conducted manually by asking the SME to place the content from step 1 into the model created for step 2 (e.g., Pugh, De Champlain, Gierl, Lai, &Touchie, 201 6). But a more efficient way to conduct the assembly step is with a computer-based assembly system because it is a complex combinatorial task. Different types of software have been written to assemble test items. For instance, Singley and Bennett (2002) introduced the Math Test Creation Assistant to generate items involving linear systems of equations. Higgins (2007) used Item Distiller as a tool that could be used to generate sentence-based test items. Gierl, Zhou, and Alves (2008) described software called ICOR (7tem CeneratOR) designed to assemble test items by placing different combinations of elements specified in the cognitive model into the item model. While the details for test assembly may differ across these programs, the task remains the same: Combine the content from the cognitive models into specific parts of an item model to create new test items subject to rules and content constraints which serve as the instructions for the assembly task.
Taken together, this three-step process serves as a workflow (see Figure 1.1) that can be used to systematically generate new items from a model of thinking, reasoning, and problem solving. It requires three steps where the data in each step is transformed from one state to another. We consider this workflow to be an item production system. The system is used to create a product that is consistent with our definition of "item". Step 1( the content required for item generation is identified. The content is specified as a cognitive model. Step 2, the content is positioned in the item model. The content is extracted from the cognitive model and placed as individual values in an item model. Step 3, the instructions for assembling the content are implemented. The individual values from the item model as specified in the cognitive model are assembled using rules to create new test items. Because each step is explicit, the input and outcome from the system can be replicated and evaluated. The importance of this workflow in the item development component described by Lane et al. (2016) is also quite clear: It can be used to generate hundreds or thousands of new test items.