Timing Comparability across Testing Modes and Devices
Our goal in the previous sections was to provide an introduction to some of the considerations associated with the use of technology in testing and the extent to which scores across testing modes and devices can be considered comparable when differences in digital assessment features across modes and devices exist. We now turn our attention to providing a more focused discussion of one specific aspect of digital assessment that may impact testing outcomes and the ability to compare scores across modes and devices: timing.
Mead and Drasgow (1993) conducted a meta-analysis of studies comparing computer-based and paper-and-pencil versions of 123 timed power tests and 36 speeded tests. After correcting for measurement error across 159 correlations, they reported an estimated cross-mode correlation of 0.97 for power tests and 0.72 for speeded tests, concluding that mode affected speeded tests, probably due to the additional time required to read text from screens. Similar results have been reported in other studies of speeded tests, but there have been exceptions which report no differences between speeded and power tests (Lesson, 2009).
Response Time Research
As discussed earlier, all quantitative studies of devices have been conducted in untimed or generously timed conditions, and studies reporting item response time, latency, or rapid guessing behaviors primarily have used these factors as a covariate in examining performance differences. Therefore, the interaction of these timing factors by item type, item difficulty, device familiarity, scrolling or content display, and input options is rare. Testing variations that result from different devices clearly impact the experience of test takers in terms of the presentation, display, input, and processing of information. As noted earlier, some variations will increase cognitive load and require greater recall, while other variations require different fine motor skills; all of these factors can interact with test timing. Because so much research on devices has been conducted on untimed assessments and because score differences have been the primary outcome of interest, timed testing programs cannot rely on these results as evidence of score comparability.
Response time—the difference in time (seconds) between when an item is presented and when it is responded to by a test taker—is an important outcome for studies associated with timed tests. Rapid guessing occurs when a test taker responds to an item so quickly that she/he could not have had adequate time to have read and considered the item (Schnipke & Scrams, 1997). In high-stakes testing, rapid guessing is considered a reliable indicator of speededness for computer-based tests, but for low-stakes or untimed tests it is often associated with lack of effort or motivation (see Chapter 11).
Kong et al. (2017) used a random-equivalent groups design to examine the results from 964 high school students completing a low-stakes assessment on either a tablet or a computer. Response time effort (RTE) measures the percent of items where students exhibited solution behavior and was used to measure student engagement; because the test allowed 80 minutes for 59 items, it was considered generously timed and not speeded. Overall, no significant or practical differences in RTE between devices for any of six different item types were found. That being said, there was a decrease in RTE for sections administered at the end of the test, which may be attributed to fatigue, and there was a gender effect in that males were twice as likely to be excluded from RTE analyses because of much lower engagement levels. Students testing on tablets did require approximately 1 minute longer for each section—about 3-4 seconds per item—than students testing on computer. Effect size differences favoring computers over tablets ranged from 0.29 for hot spots and 0.25 for multiple-choice items to 0.08 for drag-and-drop items. The authors concluded that “it appears that the reduced precision resulting from using the finger as the input device rather than a mouse may have created a small degree of challenge for working with on-screen objects” (2017, p. 22).
Davis et al. (2016) examined response time differences for a mix of item types administered on tablets and computers across reading, science, and math content areas and found that students testing on tablets consistently used more time to respond to items. Effect sizes were calculated based on data reported in the study and were 0.14, 0.22, and 0.18 across reading, science, and math, respectively. Ling (2016) found no main effect differences on scores or response times of eighth grade students on iPads versus computers across multiple-choice or constructed-response items.
Measuring response time is not always straightforward. ACT found that response time may be measured and reported differently across different platforms and interfaces, which can result in small but systematic differences that impact calculation of latency and response time. For example, response time can be captured in different ways, such as (a) timers (captured using the count-down clock from the client device), (b) server time (which requires all servers to be continuously synchronized to get accurate response latency), or (c) client time (captured by the client device as a recorded timestamp). Capturing examinee responses using countdown clocks (timers) on the client device avoids the pauses caused by servers, proctors, and internet connectivity and the need for network synchronization (R. Zhu, personal communication, March 21, 2018). Assessment programs seeking to understand item response latency and speededness need to understand how response times are captured across devices to determine if they are comparable and accurate.
There are numerous ways to evaluate the comparability of test scores across testing modes and devices; these range from comparisons of mean differences and correlations to IRT-based approaches. When mode or device effects are found to be significant, alternative scoring tables are usually generated to put scores from different conditions on the same scale (Way et al., 2016).
The next section reviews: (1) analyses and approaches employed by ACT to examine the comparability of scores for large-scale testing programs administered across different modes and devices, and (2) statistical adjustments that may be used to mitigate differences. Analyses are conducted at both the total test score level and the item level using classical statistics (e.g., p-values) and also include some IRT approaches (e.g., comparison of item parameters, differential item functioning [DIF]).
Methods Used in ACT Analyses of Testing Time
When mode comparisons are made in large-scale assessment programs, the typical scenario is that the assessment program was established in one mode or on one device and a second mode or device then was added at a later time. In such instances, there are many issues that need to be addressed. The first issue is ensuring that the construct being assessed has not changed. The user experience likely will be different across modes and some devices, but research should ensure that the construct is not altered.
The ACT has been a paper-based test since its inception in 1959, with the current battery being introduced in 1989 (ACT, 2017). Online administration of the ACT for states and districts participating in school-based testing was desired, and preliminary studies for rendering (how the items look on a computer screen) and timing were conducted. The timing study included about 3,000 examinees from 58 different schools and included multiple timings for each of the four tests in the ACT battery. The selected time limits were implemented in a large- scale mode comparability study involving more than 5,500 students from 80 high schools.