Time Limits as a Source of Systematic and Random Errors
Tests used to make high-stakes decisions tend to be highly standardized in the kinds of tasks included, in the response modes, in the contexts in which the test is taken, and in the rubrics and procedures used for scoring. They also are administered under fixed time limits. Standardization promotes fairness and the appearance of fairness by subjecting all test takers to essentially the same challenges (Porter, 2003), and it helps to control errors of measurement by eliminating the irrelevant variability in scores that would result if different test takers had to perform under different conditions. That being said, standardization also introduces various kinds of systematic error, because even fixed testing conditions can have differential effects on test takers’ performance (Kane, 1982).
If the time limits are tight enough that some test takers do not have time to complete tasks that they otherwise could complete successfully, the competence level of these test takers will be systematically underestimated. Other test takers may not suffer any disadvantage from the time limits, because they tend to work fast (or because they have practiced working quickly in preparing for the test). Chapter 5 provides a particularly interesting review of research on the effects of speededness on test scores and concludes that the impacts of time limits depend in complicated ways on the context, the severity of the time limits, and on test formats, and therefore that simple generalizations about the impact of time limits are not possible. Interestingly, he notes that it is not necessarily the lowest scoring test takers or the highest scoring test takers who are most affected by time limits.
I will refer to the differences between hypothetical unlimited-time scores and the corresponding time-limited scores as time-limit errors (TLEs). These TLEs do not generally have a zero mean, and to the extent that speed is an enduring characteristic, the TLEs can be correlated across test forms; they therefore are systematic errors rather than random errors (by definition, random errors have a mean of zero and are uncorrelated with each other and with other variables).
The average TLE for a population, the TLEp, is a general systematic error for the time limit, the test, and the population. The average TLE for a group (e.g., racial or ethnic groups, students with a disability, gender) is a group-level systematic error, or TLE(;. To the extent that it is consistent across test administrations, the TLE for an individual test taker, the TLEp is a specific systematic error. As a practical matter, it is not generally possible to estimate TLEs for individual test takers. To the extent that an individual’s speed in completing test tasks varies from one test administration to another, the errors associated with the speed of performance would be considered random (e.g., in the context of test-retest reliability), while more stable differences in speed would be systematic. Systematic errors tend to be more serious than random errors, because they do not cancel out over replications of the assessment.
As the comment following Standard 4.14 (AERA, APA, & NCME, 2014) suggests:
... When speed is not a meaningful part of the target construct, time limits should be
determined so that examinees will have adequate time to demonstrate the targeted
knowledge and skill, (p. 90)
We want the time limits to be loose enough that they do not interfere too much with the intended interpretation and use of the scores.
The impacts of these systematic errors tend to depend on a number of factors, including how tight or loose the limits are, the content and task types in the assessment, the assessment design, the population being assessed, and test takers’ levels of motivation and test preparation. Further, in Chapter 8, Camara and Harris conclude that the modes and devices in technologically supported tests can have substantially different time requirements. So the impact of time limits will generally need to be evaluated separately for each testing program (see Chapters 5 and 8).
Impact of Population-Level Time-Limit Errors (TLEps) on Validity
The TLEp is a general systematic error estimated as the average value of the TLE over the population and is taken to have the same value for all members of the population. For norm- referenced interpretations, a test taker’s score is interpreted in terms of how it compares to the distribution of scores in some population, or equivalently, in terms of the differences between the test taker’s score and the scores of other members of the population. The average time-limit error for the population, TLEp, is irrelevant for norm-referenced interpretations, because it has the same effect on all scores. If we subtract some number of points from everyone’s score, the difference between any two scores remains the same. Note that there would be a problem if the time limits were changed and scores were compared across administrations with different time limits, but as long as time limits and the time-limit effect for the population are the same, the TLEp would not interfere with norm-referenced interpretations.
For criterion-referenced interpretations, a test taker’s score is interpreted in absolute terms as indicating some level of performance (against some performance criteria, defined, for example, as a learning progression or as performance benchmarks). The TLEp is a general systematic error that reflects the average decrease in score levels associated with the time limit on the test.
If the test scores are used to make pass/fail decisions by comparing scores to a fixed passing score (or cut score), the TLEp tends to cause fewer test takers to pass by depressing the average score for the population. If the magnitude of the TLEp were known, the observed scores could be adjusted (increased) or the passing score could be adjusted (decreased) to correct for the TLEp. More generally, if we have multiple cut scores, the TLEp effect could be mitigated by adjusting either the scores or the cut scores if the TLEp were known with sufficient confidence and precision (see Chapter 8).
The TLEp can be estimated directly, for example, by having a sample of test takers complete the assessment under both standard and substantially extended time conditions (with counterbalancing to control for order effects) and then comparing their performances. In this single group design, the average value of the differences between these two scores would provide a reasonable estimate of the TLEp (see Chapter 5).
Alternately, the TLEp could be estimated by dividing a sample of test takers into two randomly equivalent subsamples and either (1) administering the test to one subgroup under the standard time limit while administering the test to the other subgroup under a substantially extended time limit, or (2) administering the standard test and a shortened version of the test under the standard time limit. The difference between the average scores for the two subgroups also would provide a reasonable estimate of the TLEp (see Chapter 5).
We can also get a less direct indication of whether time limits are having any substantial impact on scores by analyzing patterns of performance across the tasks/items on the test. For example, if a substantial number of test takers has a string of omitted items, apparently random responses, or partially informed rapid responses at the end of an objective test, it is reasonable to suspect that these test takers ran out of time before completing the test (see Chapter 6).
The TLEp typically is not a major problem. First, if the interpretation is norm referenced, the TLEp does not generally interfere with the proposed interpretation. Second, if the interpretation is criterion referenced and the TLEp is found to be negligible (i.e., it does not interfere with the proposed interpretation), it can be ignored. If the TLEp is found to be significant, the time limits can be increased, the test can be shortened, and/or it may be possible to adjust the scores (or the cut scores) if the magnitude of the TLEp is known (see Chapter 8).