Summative Assessment and High-Stakes Testing
Summative assessment, such as paper-and-pencil tests, is an efficient and objective assessment approach that allows quick and reliable results. The goal of summative assessment is to evaluate student learning at the end of a specified period or event, such as at the end of an instructional unit. Summative assessment includes direct testing and may include the following question types: multiple-choice, matching, true or false, fill-in-the-blank, short answer, or a combination of these. Other examples of summative assessment include a final paper or a music recital. Although this approach can be informative, when used in isolation, it may not assess the development of the whole child.
Summative assessments are often high-stakes, which means that they have a high point value. Given the high point value, there can be significant consequences for students if they do not perform well. Although summative assessment practices can be less time-consuming than formative assessment methods, some children may learn to associate negative feelings such as fear with tests, which may lead to test anxiety. Therefore, teachers are encouraged to create friendly and low-stress environments when testing, such as playing relaxing music, giving tests fun names, and allowing children to participate in high-interest activities following tests.
When administering summative assessments, it is important to consider whether students need accommodations. Testing accommodations are changes to the regular testing environment and auxiliary aids and services that allow individuals with disabilities to demonstrate their knowledge (U.S. Department of Justice Civil Rights Division, n.d.). Examples of testing accommodations include extended time, testing in a distraction-free room, use of scribes, having the test read aloud when reading is not being assessed, and physical prompts such as redirection.
Judging the Quality and Utility of Assessments
Given that assessment instruments are used to inform instructional and diagnostic decisions, it is imperative that practitioners select quality instruments that demonstrate adequate psychometric properties. Reliability and validity are the two main psychometric properties.
Reliability is the ability to reproduce a result consistently. Three common forms of reliability include internal reliability, test/re-test reliability, and inter-rater reliability. Internal reliability assesses the consistency of scores across items within a test. Salvia, Ysseldyke, and Witmer (2017) recommended a minimum internal reliability standard of .80 when making screening decisions and a reliability standard of .90 when making diagnostic decisions. Test-retest reliability is obtained by administering the same test twice to the same individual over a period of time. Inter-rater reliability is used to assess the degree to which scores across items for different raters are similar. Inter-rater reliability is especially important when assessment involves observations of behavior, including rating scales, because observers may not interpret items the same way.
Validity encompasses the extent to which scores reflect what the test is designed to measure. The three main forms of validity include face validity, construct validity, and criterion-related validity. Face validity helps answer the question of whether the instrument appears to be assessing what it is designed to measure, and may include expert consensus. Construct validity assesses whether the instrument measures the construct it is designed to measure and not some other construct. For example, an assessment instrument designed to assess disruptive behavior should actually assess disruptive behavior and not reading skills or intelligence. Construct validity can be assessed by analyzing the relationship of scores between two or more assessment instruments (Messick, 1995). Messick (1995) suggested that both convergent and discriminant correlation patterns are important to investigate when exploring construct validity. Convergent patterns indicate a correspondence between measures of the same construct, whereas discriminant patterns indicate distinctness from measures of other constructs (Messick, 1995). Finally, criterion-related validity is used to predict current or future outcomes. Concurrent validity refers to a comparison of scores between the instrument and some outcome at the same time while predictive validity refers to a comparison of scores between the instrument and some later outcome.
When selecting quality instruments for both formative and summative assessment, it is insufficient to have adequate reliability, and test scores must also demonstrate adequate validity. When reviewing assessments, these properties may typically be found in the technical portion of the test manual. Additionally, the Mental Measurements Yearbook is a resource that includes test reviews, including measures of reliability and validity. By selecting quality tools, professionals may have confidence that the tools they are using are useful for obtaining information related to a specific area.