The statistical analysis, standardization, and reporting of high-stakes test scores require the normal distribution of scores. This is true whether the test developer is using classical or modern test theory as the underlying psychometric model (Crocker & Algina, 1986). The problem with this in LOA is the assumption not only that some learners must (by definition) get scores below the average, but also that there must be an average, and there must be scores. But if performances were to be given numerical grades, in LOA the aim would be to achieve a negatively skewed distribution, in which most scores are towards the higher end of the curve. This relates to the fundamental purpose of all assessment for learning: it is a set of interventions designed to improve performance. It is, therefore, a profoundly different paradigm to that which governs the construction of high-stakes tests. In LAL programs for LOA, any statistical component would, therefore, focus upon criterion-related measures that may help in assessing learner progress (Brown & Hudson, 2002).
Ideally, there would be no scores at all in LOA. The intention is not to interpret scores in terms of the distribution of a larger population of test takers, but to benchmark learners against a criterion (Fulcher & Svalberg, 2013). This may be a set of hierarchical performance descriptors that represent hypothesized levels of L2 development (e.g., Isaacs et al., 2018), or more radically on the current ability level of an individual learner. In the latter case, the learner is their own criterion, and the intervention of the assessment is designed to “construct a future with the learners during the assessment itself” (Poehner et al., 2019, p. 52). The interpretation of performance is, therefore, radically local, focusing on the individual and their own learning needs. To strengthen interpretation at the level of the individual, LAL needs to include learner-centered techniques such as portfolios, problem-based learning, learner-created achievement checklists, and learning diaries for reflection.
Generalization and Extrapolation
Kane et al. (1999) suggested that test score meaning is determined by two types of inferences (Fulcher, 2015b, p. 4). The first is a generalizability inference that the test score achieved on one form of the test would be comparable to the score achieved on any other form of the test. This would include achieving a similar score across all the facets of the test, such as interlocutor, rater, and task. The second type of inference is termed extrapolation, and is defined as the meaning and relevance of the score to a real-world language use context beyond the test to which we wish to make a prediction. In other words, what does the test score tell us about the likely performance of a test taker in non-test conditions? These inferences are fundamental to the construction of validity arguments for high- stakes tests. But in LOA, we are not concerned with whether a learner performs similarly across task types or interlocutors. Nor are we particularly concerned with whether or not they can perform language tasks in the real world at the present time. The inference of primary concern is that made by both the teacher and the learner about how their language ability is developing through engaging with a language-rich environment created by innovative teachers. In a sense, this is the only purely negative critical variance between the two paradigms.
The Meaning of Validity in LOA
LAL in LOA requires teachers to understand the seven variances so far described so that they may apply validity concepts appropriately to each paradigm in both research and practice. The single validity concept that distinguishes the LOA paradigm from the high-stakes standardized paradigm is “change” as a validity criterion. The assumption underlying the seven variances is that in high stakes there will be no change in outcomes across test facets (including time) if no significant learning has taken place. Learning over time (like language attrition) is also a threat to score interpretation. This is why the score on many high-stakes tests has a limited recognition period, after which the test must be taken again.
This is diametrically opposed to the central validity claim that is made in LOA, namely that LOA is valid if, and only if, the individual learner changes as a result of the assessment, which is also a learning intervention. This is the point at which assessment and learning become fused in a way that is absent from traditional validity theory, for very good reasons relating to its role in maintaining a meritocratic society (Fulcher, 2015a, pp. 145-168). But change is core to Pragmatic (with a capital “P”) learning theory, which is most clearly articulated in Dewey’s educational theory:
If education is growth, it must progressively realize present possibilities, and thus make individuals better fitted to cope with later requirements. Growth is not something which is completed in odd moments; it is a continuous leading into the future.
(Dewey, 1916, p. 56)
The assessment of the “present possibilities” through self-assessment or scaffolded assessment is the basis for personal growth and learning. A theory of Pragmatic validity' is not tied to a particular validation methodology, but proposes that those involved in any assessment paradigm define the effect they wish their assessment/testing practices to have, and upon whom. This has been termed “effect-driven testing” (Fulcher & Davidson, 2007, pp. 144, 177). In LOA, the intended effect is “change and growth,” and the effect is designed to impact upon each individual learner. The validation question in LOA is therefore: What evidence is there for individual growth as a result of our assessment interventions? While it is possible to answer this question through the use of more traditional tests as an independent measure of the intended changes (Poehner & Van Compernolle, 2018), qualitative assessments of individual growth would provide more detailed and targeted evidence that would in itself also become an iterative intervention (e.g., Travers et al., 2015). Such evidence would include comments on performances by peers, teachers, and others with an interest in the individual’s learning. Reflective writing or speech recordings in response to these comments, and records of new personal goals, are also valuable in the construction of a portfolio of performance, feedback, and reflection, to evidence personal growth.