Generating Target Test Information Functions (TIF)
Using a TIF implies rather specific and intentional placement of prescribed amounts of measurement precision along the scale. The target can then be used to build test forms as described in van der Linden (this volume). In the present section, we show how appropriate target TIFs can be generated using some of the analytical IRT methods developed in the previous section. These methods have the benefit of allowing us to explore a wide variety of different TIF targeting scenarios to evaluate which ones feasibly meet our needs and keep the magnitude of decision errors within boundaries that we are willing to tolerate given the purpose of the test and other policy considerations.
Some Considerations for TIF Targeting
Measurement information needs to be viewed as a valuable commodity that can and should be placed where it is most needed. This is where engineering design comes into play. We can engineer a statistical test design using one of two targeting strategies. The first strategy is to explicitly specify the amount of measurement precision we desire along the proficiency scale—at the cut score and elsewhere along the scale. This absolute targeting approach typically uses one or more explicit target test information functions to define the core item selection demands. Absolute targeting is most commonly used for fixed forms. It can also be implemented with and for multistage test designs (Luecht, 2014; Luecht & Burgin, 2003; Luecht & Nungester, 1998; Zenisky, Hambleton & Luecht, 2010). Absolute targeting is the primary strategy discussed in this chapter. A relative targeting strategy attempts to maximize the measurement precision until some stopping rule is reached (e.g. a fixed test length or a prescribed level of decision accuracy). Under a relative targeting strategy, examinees might even be administered different-length tests. These two targeting strategies can be extended to a variety of different fixed and computerized adaptive test delivery models and test designs (e.g., Luecht, 2005).
There are two types of targeting strategies: (1) absolute targeting and (2) relative targeting (van der Linden, 2005; also see van der Linden, this volume). Absolute targeting employs a specific TIF as the primary statistical specification for building parallel test forms, all of which ideally will match the target TIF. Relative targeting maximizes a TIF at a specific 0 or minimizes the decision errors at a fixed score relative to the items or assessment tasks available in the item bank. Most of the methods presented in this chapter will create absolute targets.
There are various ways to generate absolute target test information functions (Kelderman, 1987; Luecht, 1992). One rather obvious option is to select a previously administered test form and use its TIF as the target for generating all future forms. However, that strategy can ignore various quality limitations of the prior forms and the item bank. For example, what if the item bank is not as discriminating as we would like? What if the average item difficulty of the prior form is off-target from the mastery cut score? What if a better target could be devised? If the usual assumptions of IRT modeling hold—that is, local independence of the response functions and unidimensionality—and if we successfully calibrate all of the items in the bank to a common metric, there is certainly no technical reason to limit ourselves to continually reusing a substandard target TIF, especially if the item bank will support the construction of improved test forms.
Relative targeting is often used in computer-adaptive mastery testing, such as the NCLEX (National Council of State Boards of Nursing), which selects items to maximize information at the examinee’s apparent proficiency. Another example is sequential mastery testing (Lewis & Sheehan, 1990; Luecht & Nungester, 1998; Spray & Reckase, 1996; Vos & Glas, 2010; Wald, 1947), which typically administers items or item modules until a prescribed level of decision accuracy is obtained. There is no explicit target. Rather, we choose the items to be maximally informative either at a fixed cut score or at a provisional estimate of the examinee’s score. Because of the tendency of these decision theoretic optimization algorithms to choose the same test items, most relative targeting strategies must be combined with item exposure controls based on simple random sampling or conditional randomization mechanisms to better balance the exposure risks across an entire item bank.
A hybrid targeting strategy is to use a relative targeting mechanism to construct an absolute target TIE To implement this strategy we mimic a computerized adaptive test (CAT) and draw a fixed number of test forms without replacement from an item bank to provide maximum item information at one or more values of 0. The first test forms will typically have the choice of the statistically best items. Due to the “no replacement” rule, forms constructed later in the sequence may have less information at the specified score points. Nonetheless, we can then average the resulting TIFs across all of the forms to get the target. The number of forms drawn should approximate the number of unique forms we might expect to create in practice. The rationale for this approach is to generate a TIF that will provide maximum precision relative to the item bank. Van der Linden (this volume) introduced a somewhat more elegant relative strategy using automated test assembly to directly build the forms relative to the maximum information in an item bank and avoid altogether the use of absolute target TIFs.
Of course, there are also many situations where a TIF or similar statistical test specification simply does not exist. In the case of a start-up testing program or where an improved TIF-targeting strategy can be entertained, we cannot always depend on the existing item bank or on the characteristics of previously administered test forms. The methods described ahead can be applied in those situations.
If we do not have or want to use an existing test form, how can we go about getting a target TIF to achieve a particular degree of decision accuracy? There is a direct relationship between test information functions (TIFs) and decision accuracy on a mastery test. If the value of the TIF increases in the neighborhood of a cut score, 0ut, decision accuracy also goes up. For example, a test with I(0cut) = 10 will produce more accurate scores than a test with I(0cut) = 7, regardless of the test length. It would therefore seem to make psychometric good sense to always select for a mastery test the items that provided the most information at the cut score. However, that statistically motivated strategy ignores some important item overexposure and related security risks for testing over time (e.g., continually reusing a relatively small number of the most discriminating items in the item bank overexposes those items to examinee collaboration networks intent on memorizing and sharing as much of the item bank as possible). There is also inherent waste in terms of item production if large segments of an item bank—items otherwise properly vetted for quality using other criteria—are underutilized in building mastery tests based solely on their limited contribution to measurement precision at the cut score. Finally, the role of content and other nonstatistical test specifications, or other demands for measurement information elsewhere than at the cut score (e.g., having multiple cut scores, wanting to report the most reliable scores possible, in addition to making a mastery decision), can compete during item selection with the goal of maximizing measurement precision at the cut score.
Of course, conspicuously missing from this discussion of measurement information targeting strategies is test content and the role of subject-matter expert (SME) committees in the test construction process. Fortunately, our test designs can incorporate whatever content-based test specifications we care to include (e.g., item counts meeting various frequency distribution requirements). Those types of specifications can be readily incorporated in the test assembly process (see the section on test assembly). Therefore, while certainly not intentionally trivializing the role of content in test design and assembly, the focus in this chapter is on the psychometric aspects of test design and the associated impact on decision accuracy.