Relationship between Testing Time and Testing Outcomes

In many testing situations, the primary reasons for imposing time limits are administrative convenience and reduced cost. In these cases, whether test takers can work quickly as well as accurately is not assumed to be a part of the construct that is being assessed, and the impact of the time limit on scores would be considered a source of construct-irrelevant variance. Alternatively, if the ability to work both quickly and accurately is part of the construct being assessed (e.g., on an educational or psychological test), it is to be expected that time limits can and should impact test scores. Descriptions of the constructs to be assessed by high-stakes admissions tests such as the SAT’, ACT', and Graduate Record Examinations General Test (GRE) suggest that speed is at best a minimal part of the construct being assessed. The claim for the current version of the SAT Reading Test is as follows:

The redesigned SAT’s Reading Test is intended to collect evidence in support of a broad claim about student performance: Students can demonstrate college and career readiness proficiency in reading and comprehending a broad range of high-quality, appropriately challenging literary and informational texts in the content areas of U.S. and world literature, history/social studies, and science (College Board, 2018a, pp. 41).

For the SAT Math Test, the claim is:

The redesigned SAT’s Math Test is intended to collect evidence in support of the following claim about student performance: Students have fluency with, understanding of, and the ability to apply the mathematical concepts, skills, and practices that are most strongly prerequisite and central to their ability to progress through a range of college courses, career training, and career opportunities (College Board, 2018a, pp. 132).

Although the word “fluency” in the Math claim might suggest a speed component, it is not addressed in any of the subsequent descriptions.

Similarly, the ACT Technical Manual says very little about the importance of rapid responding, but in the section on item tryouts it indicates that, “The time limits for the tryout units per?mit the majority of students to respond to all items.” This suggests that speed of responding is not a part of the intended construct (ACT, 2019). A description of the version of the GRE that was introduced in 2011 notes, "... it is specified that no test section should be delivered under speeded conditions” (Robin & Steffen, 2014). Given that (1) speed is not part of the intended construct for high-stakes admissions tests but that (2) these tests have time limits such that at least some students struggle to finish within the time allowed for the test, it is critical for test publishers to establish the effects of the time limits on various item statistics and most crucially on students’ scores.

This chapter focuses primarily on the effects of time limits on admissions tests and K-12 accountability assessments in which rapid responding is not part of the construct that the test is intended to measure. Different issues can be at play in licensing tests in which speed of responding can be a legitimate issue (e.g., would you want to give a pilot’s license to someone who came up with the appropriate action in an emergency situation only after taking a few minutes to respond?), and such tests are not covered here. The complex issues in a medical licensing context also are not addressed here, but see Chapter 6 for a discussion of the relevant issues in this context. Similarly, speededness concerns for essay tests are discussed in Chapter 7.

The organization of this chapter is essentially chronological. It begins in the 1940s when completion statistics were the primary tool for addressing speededness, moves on to the 1970s, when concerns about group fairness with speeded tests became a major issue, and then proceeds to discuss the new speededness issues that emerged with the introduction of computerized adaptive tests (CATs). Issues with the speededness of state accountability assessments are then briefly discussed. Finally, some suggestions for future research efforts are presented.

Early Research on the Impact of Time Limits on Scores

Research from the 1940s and 1950s

Educational Testing Service (ETS) was founded in 1947, and one of the first research reports that was produced was entitled Item-Analysis Data from an Experimental Study of the Effects on Item-Analysis Data of Changing Item Placement and Test Time Limit (Mollenkopf, 1949). The study reached the wholly unremarkable conclusion that items in speeded tests are more difficult (i.e., have a lower proportion correct) compared to the difficulty for those same items in unspeeded test administrations. Specifically, the author concluded, “The proportion right of those attempting the item, the Delta index, and the biserial r were all found to have undesirable characteristics for items appearing late in a speeded test.” Although today this conclusion is obvious, such a study was likely important in the early days of large-scale testing; at that time, there may have been a belief by some that parameters such as item difficulty were inherent in items in specific populations rather than dependent on where in the test that item was placed. This issue of parameter determination and item placement remains equally relevant today and is a concern because the incorrect specification of item difficulty can affect final examinee scores.

Another relatively early ETS research report (Lord, 1954) explored the impact of speed factors on test validity. Tests of vocabulary, spatial relations, and arithmetic reasoning were administered to students at the U.S. Naval Academy. Included in the battery were speeded and unspeeded—but otherwise parallel—tests of vocabulary, spatial ability, and arithmetic reasoning. The unspeeded vocabulary test included 15 items in 7 minutes, while the most highly speeded version had 75 items in 5 minutes. The percent of examinees finishing was 97% for the unspeeded version and 2% for the speeded version, but Lord pointed out that these finishers likely included rapid random guessers. (Note that with modern computer administrations, such rapid random responding can be accurately tracked; this technology was not available in 1954.) The speeded and unspeeded tests were analyzed together in a maximum-likelihood factor analysis and ten factors were extracted. Most factors did not have a speed component (e.g., unspeeded tests of verbal reasoning and mathematical reasoning), but factors containing the speeded verbal tests and speeded perceptual tests were identified. Small positive correlations were found between the speed factors and grades, suggesting that speededness could produce a small improvement in predictive validity of grades in the Naval Academy.

 
Source
< Prev   CONTENTS   Source   Next >