Tasks and Items
An enduring problem with the design of high-stakes tests is the construction of tasks or items that generate consistent results across forms of the test in terms of what they assess, and how difficult they are (Weir & Wu, 2006). If there is a lack of comparability, it is impossible to know if a test taker would get a higher or lower score if they had taken another form of the test, which is a synchronic threat to score meaning. If diachronic comparability is required as well, it may be essential to maintain control across versions as well as forms. This is achieved partly through the use of test specifications to guide version and form construction, and statistical procedures to uncover and control variation in task difficulty (Eckes, 2011). While the use of test specifications (or rather “assessment task specifications”) may be used in LOA as a tool for collaborative continuing professional development in schools (Fulcher, 2019), they are purely a way to help teachers relate tasks both to the learning objectives and the assessment of learning. Comparability of tasks is of little concern, however; what does matter is that suitable tasks or items are designed to challenge learners to take the next step towards their goals without being too demanding to cause demotivation (Poehner, 2008). Of particular importance in attempting to reflect “real world” language use is the exploitation of integrated task types, in which reading and listening may lead to speaking and/or writing (Plakans, 2013) in different patterns, or iteratively. The skill dependency that can be used in LOA is largely eschewed in traditional validity theory, which prefers “unmuddied” score interpretation in terms of a single skill or construct. A critical element of LAL is, therefore, creative goal-driven task design that integrates skills in a variety of contexts.
Roles in Design and Evaluation
In traditional validation theory, each stakeholder has a single clearly defined role. The test taker is the person who is evaluated; the rater is the person who awards a score. While consulting test takers about their views on the tests they take has become more frequent in recent years as one component of response validity and washback (Cumming, 2004), and teachers have become a source of evidence for content relevance (Cumming et al., 2005), they remain providers of information for validity arguments constructed under an instrumentalist paradigm. The power and social distance between the assessed and the assessor are great, with the former often having little awareness of who the latter is, or how decisions are made. In LOA, on the other hand, the assessor may be the teacher, but could equally be a peer, the learner, or even a family member. Indeed, if online portfolio assessment is being used, all of these people may be asked to comment and provide feedback (Yastibas 8c Yastibas, 2015). Peer- and self-assessment are particularly valued as learners are encouraged to become self-aware and self-critical evaluators of their own performance with the aim of becoming independent learners through an ability to identify the current level of performance and compare it with their desired goal (Black et al., 2003). Meta-studies of peer- and self-assessment have shown statistically significant benefits for learning (Sanchez et al., 2017). Similarly, learners and others may be involved in the design of learning tasks, co-creating new learning activities based around their understanding of the next learning goal. Assessment design is, therefore, no longer the domain of the testing professional. In each critical area of activity, the role boundaries are blurred or cease to exist. LAL programs, therefore, need to include the design of systems that encourage the expansion of potential assessors for more abundant feedback.
Most high-stakes language tests use closed-response test items exclusively, or have a small open-response section to assess speaking or writing. The primary reason is to increase psychometric reliability. The multiple-choice question is 100 years old, and its technology is very well understood (Haladyna, 2004). Each item is a piece of information with known difficulty that increases discrimination between test takers. Cumulatively, multiple-choice and other closed-response items add to reliability coefficients such as Cronbach’s alpha so long as their difficulty and discrimination are controlled through pre-testing or online calibration. As reliability is interpreted as score consistency, it is treated as a critical measure of test fairness, and in cases of lower reliability, can sometimes result in legal challenges to decisions made on the basis of score interpretations (Fulcher, 2014). A speaking or writing component provides only one piece of information, and reliability is often calculated in terms of levels of inter-rater agreement, which frequently requires the training (or “cloning”) of raters so that they agree on the classification of speech or writing samples (Davis, 2016) generated by highly controlled prompts. Little is gained in LOA by doing multiple choice questions, although they can sometimes be used as the basis for discussion of why the distractors are false and the key true. A more strategic pedagogic approach is to devise performance tasks that require discussion, analysis, and response to reading or listening texts to reveal the ability to interpret and use language for practical purposes (Davis & Vehabovic, 2018). Scoring integrated performance-based activities may not meet the psychometric criteria of reliability, but in a classroom context, this does not matter. Learning takes priority over consistency of judgment, and the success of judgment is evaluated by the quality of learner change, as described in LOA validity below. LAL for LOA, therefore, focuses on the nature of performance in context.