What Does the Future Hold for Item Development?
Educational testing is in an unprecedented state of change. Significant change is occurring in how we deliver educational tests, why we design educational tests, and who writes educational tests. One important consequence of these changes is that educators must have access to large numbers of diverse, multilingual, high-quality test items. Currently, these types of item banks are not readily available. Moreover, the traditional approach to item development cannot be used to produce large numbers of new items quickly or efficiently because it is not scalable. AIG, on the other hand, is scalable. AIG is the process of using models to generate items using computer technology. It represents an alternative approach to creating test items. Template-based AIG can be implemented using a three-step method to systematically generate new items from a model of thinking, reasoning, and problem solving. It can be characterized as a scalable item production system that is used to generate hundreds or thousands of new test items.
In addition to the significant changes that now affect how we deliver tests, why we design tests, and who we test, another important change that we foresee emerging is related to what we test. Educators are now beginning to change the constructs they measure on their educational tests. For example, in an influential report recently published by OECD called "The Future of Education and Skills: Education 2030" (2018b), the authors assert that the students who started school in 2018 and, therefore, who are expected to graduate in 2030 must be prepared for jobs that haven't been created, for technologies that haven't been invented, and for problems that haven't yet emerged. To address these unforeseen opportunities and challenges, students will require a much broader set of knowledge, skills, attitudes, and values that should, according to OECD, be introduced, developed, and evaluated as they progress through our current educational system. Educational accountability systems of the past have focused on narrow measures of academic achievement when attempting to predict student success and to prepare students for their future. But the authors of "The Future of Education and Skills: Education 2030" claim that a narrow set of achievement constructs is inadequate to characterize success for students of the future. Instead, they advocate for an expanded list of constructs that should be used to characterize the competencies that will be required in the future. These competencies provide the foundation for an expanded list of educational constructs that will guide both curriculum and instruction:
The concept of competency implies more than just the acquisition of knowledge and skills; it involves the mobilization of knowledge, skills, attitudes and values to meet complex demands. Future- ready students will need both broad and specialised knowledge. Disciplinary knowledge will continue to be important, as the raw material from which new knowledge is developed, together with the capacity to think across the boundaries of disciplines and
"connect the dots". Epistemic knowledge, or knowledge about the disciplines, such as knowing how to think like a mathematician, historian or scientist, will also be significant, enabling students to extend their disciplinary knowledge. Procedural knowledge is acquired by understanding how something is done or made - the series of steps or actions taken to accomplish a goal. Some procedural knowledge is domain-specific, some transferable across domains. It typically develops through practical problem-solving, such as through design thinking and systems thinking. Students will need to apply their knowledge in unknown and evolving circumstances. For this, they will need a broad range of skills, including cognitive and meta-cognitive skills (e.g. critical thinking, creative thinking, learning to learn and self-regulation); social and emotional skills (e.g. empathy, self-efficacy and collaboration); and practical and physical skills (e.g. using new information and communication technology devices). The use of this broader range of knowledge and skills will be mediated by attitudes and values (e.g. motivation, trust, respect for diversity and virtue). The attitudes and values can be observed at personal, local, societal and global levels. While human life is enriched by the diversity of values and attitudes arising from different cultural perspectives and personality traits, there are some human values (e.g. respect for life and human dignity, and respect for the environment, to name two) that cannot be compromised.
(OECD, 2018, p. 5)
An expanded educational construct list, as advocated by OECD, will have a dramatic effect on our educational testing practices.
Large numbers of items will be required to create a testing system that allows educators to measure a comprehensive list of knowledge, skills, behaviours, and competencies. One way to address this challenge is by using a traditional item development approach. A large number of SMEs are hired who can write individual items that measure these constructs. This strategy—and its weaknesses—should now be familiar to the reader. An alternative way to produce large numbers of items efficiently and economically is to implement a scalable process. In our view, template-based AIG could be the scalable method that is used to meet the content demands for our testing systems of the future. But new item development practices would be required.
To begin, we suggest that a team-based approach be used. As we noted in Chapter 2, cognitive model development is best conducted using a two-member SME team where one SME develops the model, and a second SME provides feedback. For development in operational testing situations, a third team member should be added. This member is a content manager like an editor or AIG model bank developer (see the answer to the question, "How do you organize large numbers of generated items?") from the testing organization who is responsible for organizing the type of content that needs to be produced in the testing program and who can also provide guidance on the formatting and presentation conventions used in the program. On an intermittent item development schedule, the team works for two days, four times per year. The work can be conducted face-to-face, remotely, or with some combination of both modes. Once the team members are trained, it is reasonable to expect that they can produce eight high-quality models per day. By high-quality we mean that the models are created, evaluated, and edited during the development session so that the generated items meet the expected standards of practice for the testing organization. Three of these teams could work in one item development session. With this configuration, 3(Teanis) x 2(Days) x 8(Models per Day) = 48 models in one, two-day AIG development session. If each model produces, on average, 1,248 items (see Chapter 1 example), then one AIG development session can produce 59,904 new items. When four sessions are conducted per year, 239,616 items could be created annually. We have conducted AIG sessions in operational testing programs using this configuration to produce these kinds of item generation outcomes. On a regular rather than intermittent schedule, the team works more frequently and more often, thereby producing a larger number of items more quickly. Our suggestion for scaling the item development process is unique because it implements an AIG methodology which focuses on model-level outcomes using a team-based approach. It can be implemented flexibly using either a face-to-face or a remote working environment.
AIG is an example of how a systematic and scalable item development method can be used to increase the efficiency and productivity in a testing organization. But when the purpose of the test is also modified, a more complex problem must be addressed: How do you create large numbers of new test items in an efficient and cost-effective manner to measure complex, dynamic, and, potentially, ill-structured tasks characteristic of new and evolving constructs? To address this question, we need to draw on a more radical approach: We recommend using a problem-solving ecosystem for the purpose of creating large numbers of items to measure new constructs.
Data science competitions provide one example of a problem-solving ecosystem. A familiar and influential illustration of a data science competition in educational testing that used a problem-solving ecosystem was the Automated Student Assessment Prize sponsored by the William and Flora Hewlett Foundation in 2012. This prize, which was open to any competitor, focused on the application and effectiveness of automated essay scoring (AES) technology as it applies to evaluating students' written- response essays (Shermis & Hamner, 2013). The competitors were provided with the scoring guides used to train the human raters and then given four weeks to train their AES systems using a sample of data from the original data set. Once the training phase was complete, the competitors were given 59 hours to classify the human scores using the results from a sample of essays provided in the validation data. The competitor with the most accurate classification results was the winner. This competition served as the most comprehensive, independent, comparative evaluation of AES that had ever been conducted. The lasting impact of the competition remains with us today, as the data and the results are still used by researchers to evaluate AES systems (e.g., Shin & Gierl, 2020).
Crowdsourcing is another example of a problem-solving ecosystem. In this ecosystem, large complex tasks are decomposed into smaller, more simple tasks. Participants drawn from a broad range of content areas and backgrounds contribute to solving the tasks. The tasks are embedded in a workflow, which ensures that the participants in each step of the process use and augment solutions provided by participants from the previous step. This workflow, in turn, is embedded in an ecosystem designed to systematically address a larger, more complex multi-faceted problem.
In an item development ecosystem, participants could be used to create diverse cognitive and item models for measuring complex task performance in a unique construct, such as "global mind-set", "gratitude", "human dignity", "pro-activeness", and "trust" (OECD, 2018, p. 17) or
"information and communication technology interest" and "perceived information and communication technology competence" (Kunina- Habenicht & Goldhammer, 2020). These cognitive and item modelling outputs could then be supplied as inputs for an AIG computer system that uses the information from the models to produce large numbers of new items to measure these novel constructs. The ecosystem requires an online workspace where participants can connect, contribute, combine, revise, evaluate, and integrate their ideas and data using a common framework (Michelucci & Dickinson, 2016). The workspace must also include a well-designed user interface so that participants can easily and readily interact with the AIG system. An item development ecosystem must include strategies for scaling the item development process and for monitoring the quality of the generated content. But most of all, the ecosystem needs a broad range of human participants who have diverse backgrounds and experiences. The art of cognitive and item modelling provides an expression of the SME's understanding of how examinees use their knowledge and skill to solve problems in a specific content area. But cognitive and item models can be created from many diverse perspectives to capture different types of understanding. In our book, we demonstrated how this understanding is expressed in two well-defined content areas: mathematics and medicine. But it can also be expressed in novel but emerging content areas, such as "global mind-set", "gratitude", "human dignity", "pro-activeness", and "trust". Each cognitive and item model is distinct for each participant, meaning that different but equally valid models can be produced by different participants in the ecosystem. Some of the models will be characteristic of the SME perspective. This perspective is predictable and expected because SMEs are content experts who have item development experience. But an item development ecosystem is not just composed of SMEs from testing organizations. Other models characteristic of different and likely unanticipated but equally valuable perspectives could also be created. For instance, the cognitive and item models designed by a team of psychologists to capture educational constructs may differ from the models created by a team of mathematics SMEs. Similarly, the cognitive and item models designed by a recent university graduate who has just entered the workforce to constructs such as "global mind-set" and "trust" may differ from the models created by an experienced SME who has worked at the same testing organization for two decades. In short, item development ecosystems in which the unique contextual knowledge and rich cognitive capabilities of humans interact seamlessly with powerful computing systems to produce both predictable and novel outcomes may provide the most promising environment for solving the complex item development challenges that await researchers and practitioners in educational testing. An item development ecosystem could be framed either as a competition or a crowdsourcing task.
We end our book by repeating the sentence we used to start our book: The field of educational testing is in the midst of dramatic changes. In addition to the changes we have already described, we predict that item development ecosystems will play an important role in defining new constructs. We also predict that these ecosystems will contribute to the production of content needed by testing organizations to measure new constructs, which means that these ecosystems will impact and may even supplant our current item development activities. We acknowledge that item development ecosystems have yet to emerge. But the conceptual foundations, the item development methods, and the computational tools now exist to permit the participants in these ecosystems to generate large numbers of new items that measure the complex, novel, and evolving constructs that could be used to guide educational testing into the future.
American Educational Research Association, American Psychological Association, National Council on Measurement in Education (2014). Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association.
Brynjolfsson, E., & McAfee, A. (2013). The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant Technologies. New York, NY: Norton.
Drasgow, F., Luecht, R. M., & Bennett, R. (2006). Technology and testing. In R. L. Brennan (Ed.), Educational Measurement (4th ed., pp. 471-516). Washington, DC: American Council on Education.
Gierl, M. J., Lai, H., Hogan, J., & Matovinovic, D. (2015). A method for generating test items that are aligned to the Common Core State Standards, lournal of Applied Testing Technology, 16, 1-18.
Irvine, S. H., & Kyllonen, P. C. (2002). Item Generation for Test Development. Hillsdale, NJ: Lawrence Erlbaum.
Kunina-Habenicht, O., & Goldhammer, F. (2020). ICT Engagement: A new construct and its assessment in PISA 2015. Large-Scale Assessments in Education, 8, 1-21.
Lane, S., Raymond, M., & Haladyna, R. (2016). Test development process. In S. Lane, M. Raymond, &T. Haladyna (Eds.), Handbook of Test Development (2nd eel., pp. 3-18). New York, NY: Routledge.
Michelucci, P., & Dickinson,). (2016). The power of crowds. Science, 351, 32-33.
Organization for Economic Co-Operation and Development (2018). The Future of Education and Skills: Education 2030. Paris: OECD.
Schmeiser, С. B., & Welch, C.). (2006). Test development. In R. L. Brennan (Ed.), Educational Measurement (4th ed., pp. 307-353). Westport, CT: National Council on Measurement in Education and American Council on Education.
Shermis, M. D., & Hamner, B. (2013). Contrasting state-of-the art automated scoring of essays. In M.D. Shermis & J. Burstein (Eds.), Handbook of Automated Essay Evaluation: Current Applications and New Directions (pp. 313-346). New York: Routledge.
Shin, J., & Gierl, M. ). (2020). More efficient processes for creating automated essay scoring frameworks: A demonstration of two algorithms. Language Testing. doi:10.1 177/0265532220937830.