Random Equivalent Groups versus Common Students
Two basic data collection designs are used in examining timing research. The single-group design tests the same students under multiple conditions, ideally using counterbalancing and creating an environment where students are equally motivated under all conditions; the equivalent-groups design establishes multiple groups, one for each condition, that are comparable across all characteristics likely to interact with timing. In practice, it is probable that neither method will work ideally. Random assignment within classroom over a large number of classrooms is likely to come close to randomly equivalent groups, but at times that is impractical in operational settings; covariates therefore are used to try to account for any differences between intact groups (see Maxwell, O’Callaghan, & Delaney, 1993).
ACT typically has adopted a randomly equivalent groups design for studying mode differences. A participating test site sends in a roster of students, and examinees are assigned to a mode. This information is communicated back to the test centers, which then ensure that students test in the assigned mode. Motivation was a concern for some studies, in that examinees who saw no value in an assessment might perform equally poorly across mode whereas examinees who were motivated would do their best; this would introduce potential mode differences. For this reason, some of the studies involving the ACT resulted in operational (college-reportable) scores for the participants (see, e.g., Li, Yi, & Harris, 2017, for additional details). In his synthesis of 81 studies dealing with comparability across modes, Kingston (2008) states that the study design that was most informative was the one in which students were randomly assigned to different modes and had similar motivation to do their best. The random assignment tends to result in groups that are equivalent in ability, which simplifies many of the comparisons across modes.
There are logistical difficulties associated with using a random groups design, such as not being able to test intact classrooms with the same mode. However, the utility of the results offsets the difficulties in situations where the results have important implications, such as adjusting scores from different modes, devices, or testing conditions so that they can be reported and used interchangeably. Using a common student design—in which the same examinees are administered a test in each mode—has some theoretical advantages, but this approach often is problematic in practice. The same student cannot be administered the same test form twice, so multiple forms per mode are needed. In addition, motivation typically is lower on a student’s second administration. Intact classrooms could be tested by mode, but counterbalancing would be better done at the individual level.
ACT has supplemented its random groups studies with additional studies; some consisted of assigning mode by classroom or school, and some were post hoc analyses of existing data collected as part of an operational administration or study in which either examinees or sites determined the testing mode. For example, in one study, a participating school had a 1:1 initiative for Chromebooks for students in the grade that was testing. After administration, that school was matched to a number of other schools in terms of available demographic characteristics that have been correlated with test performance (e.g., per pupil expenditure and the percent of students on free and reduced lunch), and the performance of students who tested on Chromebooks was compared to the performance of students testing on other devices.
Occasionally, timing studies have not been conducted at all when introducing a new mode of testing. In one program where sufficient testing time was allowed for all examinees to complete the assessment on paper, additional time was simply added to the new online assessment to provide ample time for scrolling or lack of examinee familiarity with responding on a computer. Monitoring of the examinee experience over time was done through observation and analysis of latency data. In the case of a software upgrade for WorkKeys (an assessment of career readiness used by schools, colleges, and vocational training programs), ACT staff were administered the assessments with and without the upgrade and expert judgment was used to determine that timing would not be affected (this was confirmed by subsequent monitoring; Liu, Zhu, & Gao, 2016).
Data analysis occurs at both the item and the test level. As is sometimes true with context position effects, some items may become easier and some items may become more difficult, but the overall impact on test scores used for decision-making could be negligible. For mode studies for WorkKeys, the ACT, and ACT Aspire, ACT typically looks at the distributions of raw test information, such as total testing time and raw scores. If the data are collected using either a random groups or single groups design, the distributions should differ only to the extent that sampling error and measurement error are factors. Often benchmark data are used. For example, if the form of a WorkKeys or ACT test being used in a special study examining different devices was previously used as an operational test form online before being retired, that form may have been seen by, say, 8,000 examinees. Two samples of 2,000 examinees each could be randomly drawn from that 8,000, and the distribution of total test time used can be compared across the two samples. The two samples from the special study across mode (or across devices) then are compared to the results from the two samples from the same mode (or same device). If the differences across mode are similar to the differences across samples within mode, it supports a negligible difference at the overall test score level. Note that this is supportive—but not conclusive—evidence. The essential point is that having a baseline of sample differences within mode can be helpful in trying to interpret observed differences.
Other score-level analyses compare the means and standard deviations of time used, examining (1) both reliability and conditional standard error of measurement of test scores, and (2) combinations of data, such as latency and number of omits and test score together. Examining both summary statistics and graphical illustrations is optimal. Test characteristic curves, distributions of theta scores, and bivariate plots of raw score by theta score and reported scale score also have been used in ACT studies of timing and mode/device comparability. It is possible to observe large differences at some places on the score scale that appear to balance out over the full sample because of how the sample is distributed. For example, there might be large differences in one direction at the high end and equally large differences in the other direction at the low end. Perhaps it is the case that high-scoring examinees in math perform much faster on one device because these examinees tend to be more familiar with that device due to the high-level math apps available for it, whereas mid-level examinees may do better on a different device due to their familiarity with it. If the sample used in the mode study has similar numbers of high- and mid-level examinees, the group-level statistics may suggest that the time needed by device is the same because the overall means of time used are the same. Looking for comparability requires more than just a cursory look at mean latency values.
Test dimensionality, raw-to-scale-score conversions, and survey results reporting on whether examinees felt they had sufficient time also have been examined at the test level. Generalizability analyses were conducted for some studies, particularly for WorkKeys and the ACT across modes. Kong et al. (2017) ran analysis of variance (ANOVA) in their study looking at response time across computers and tablets to examine mean response times by device by ethnicity. Additional analyses used to examine mode comparability—not specific to timing factors—included the Kolmogorov-Smirnov (KS) test of equivalency to look for statistically significant mode effects for all raw and scale scores, scale score correlations and effective weights, and exploratory factor analysis. In addition, irregularity reports, phone logs, test booklets presented side-by-side with online renderings of the same items, and notes from staff who observed onsite administrations have been reviewed as part of comparability analyses.