Item Development

Test developers are responsible for designing tests and items that are accessible to all test takers, using principles such as universal design (AERA et al., 2014; Elliott & Kettler, this volume). Universal design strives “to minimize access challenges by taking into account test characteristics that may impede access to the construct for certain test takers, such as the choice of content, test tasks, response procedures, and testing procedures” (AERA et al., 2014, p. 58). Fairness begins with the design of items and is verified through review procedures once they have been developed. The chapters by Abedi

(this volume) and Elliott and Kettler (this volume) discuss test and item development procedures that allow for English language learners and students with disabilities, respectively, to have better access to the construct being measured.

The item formats to be used are specified during the overall planning of the test and depend on the domain to be measured and the claims to be made on the basis of test scores. While many KSAs can be measured using selected-response items, the measurement of other KSAs requires the use of con- structed-response items and performance tasks to sample the domain with sufficient fidelity. The challenge for the test developer is to select an item format that faithfully measures the construct of interest.

Selected-response items are used in most testing programs, and for good reason: They are efficient. That is, they permit the reliable measurement of a broad range of cognitive skills in a limited amount of time—assuming that they are carefully developed. The chapter by Rodriguez (this volume) identifies several common selected-response formats and offers guidelines for ensuring that test items measure the intended KSAs while minimizing factors irrelevant to the construct of interest (e.g., test-wiseness). With more tests being administered on the computer, variants of multiple-choice item types are becoming more prominent. For example, the chapter by Sireci and Zenisky (this volume) describes item formats that require examinees to select text in a reading passage in response to a question, and then complete a summary of a passage by selecting additional text from a pool of sentences. In many cases these variants are measuring aspects of the domain that are not easily measured by typical multiple-choice items. An attractive feature of these items is that they can be objectively scored. Other variants of multiple-choice item types that can be delivered on paper or online include items that ask students to choose more than one response.

Short constructed-response items that can be administered and scored relatively easily by the computer are also used by some testing programs. Such items may require students to provide a numeric value for a mathematical problem or a brief written response. There are occasions for which selected- response items and short constructed-response items will not reproduce with sufficient fidelity the tasks required to elicit the constructs of interest. In such instances testing programs may decide to develop performance tasks. Performance assessments that are used by both educational and credentialing programs may range from students explaining their mathematical reasoning on a constructed-response item to case-based simulations where examinees respond to a “live” or “computerized standardized” patient. As indicated in the Standards (AERA et al., 2014), “Performance assessments require examinees to demonstrate the ability to perform tasks that are often complex in nature and generally require the test takers to demonstrate their abilities or skills in settings that closely resemble real-life situations” (p. 77). The use of the computer for the administration of some performance tasks is attractive because of capabilities that, for example, allow students to construct a graph or write a response to a question for which they need to integrate a number of texts or other information that can be stored online.

And for yet other occasions, more complex performance tasks may be required to ensure that the construct of interest is adequately represented. The chapter by Lane and Iwatani (this volume) discusses the development of performance tasks in achievement testing, including writing samples, scientific inquiry items and other types of tasks that allow students to demonstrate not only what they know but also what they can do. Meanwhile, the chapter by Swygert and Williamson (this volume) describes performance testing in credentialing, where the tasks oftentimes simulate or sample the types of tasks encountered in the workplace. Both chapters offer recommendations on matters such as rater training and strategies for determining the number of tasks to sample. For items that require scoring rubrics, the design of the scoring rubrics should coincide with the design of the items.

The rationale for the choice of item formats is a critical source of validity evidence and should include theoretical and empirical evidence. In practice, the choice of selected-response items, short constructed-response items or performance tasks, including extended constructed-response items, will be influenced by practical reasons, such as cost in scoring and the amount of time allocated for test administration.

 
Source
< Prev   CONTENTS   Source   Next >