What Are Current Practices in Technical Reporting and Documentation?
In this section we describe the current technical documentation and reporting landscape. We describe and synthesize topics covered in technical reports and other documentation for K—12 educational testing programs and for professional certification and licensure programs. Technical reports typically are organized as collections of evidence about the psychometric quality of a testing program, described in separate chapters.2 Regardless of whether stated explicitly, the intended audiences are expected to be technically savvy. Other, less technical documentation, such as candidate bulletins, address nontechnical audiences, such as certification and licensure test candidates.
Technical Reporting and Documentation Practices in K-12 Educational Testing Programs
In order to create an accurate portrayal of technical documentation practices in K-12 educational testing programs, we surveyed technical reports for statewide reading tests required for No Child Left Behind reporting and for high school end-of-course and graduation examinations in mathematics. We retrieved technical reports provided on state testing program websites. By choice, we do not address technical documentation for K-12 educational tests offered commercially because, we believe, technical documentation practices for K-12 state testing programs are most representative of the state of the art.
We sampled 10 state testing programs systematically. We listed all 50 states and the District of Columbia in alphabetical order and arbitrarily selected the state in the nth position in the list, and then every nth subsequent state in the list. We then reviewed the technical documentation that we could find on those 10 state websites. We were able to locate recent technical reports for Grades 3-8 reading tests and high school end-of-course and graduation mathematics tests for six of the states with little difficulty. We were not able to locate technical reports for four states. Instead, for one state we located information for the immediately preceding state in the list, and we requested and received information via e-mail and phone calls for the other three states. These three states provided their technical reports. One of these states explained that it does not post its technical reports because the reports contain secure information, like positions of field test and linking items; it provided only high school reports, with our agreement not to divulge secure information. Along with those high school reports, we reviewed the Grades 3-8 technical report for the immediately preceding state in the list.
Some states provide a single technical report for all testing programs; other states provide multiple technical reports. In the end, our analyses are based on 19 technical reports from 11 states for the test administration years 2006-2012. The reports in our sample were produced by most of the familiar state testing program contractors: American Institutes for Research, CTB/McGraw-Hill, Data Recognition Corporation, ETS, Harcourt, Human Resources Research Organization (HumRRO), Measured Progress, Measurement Incorporated, Pearson, Questar and Riverside. No one company’s style of technical reporting dominates the sample: One company produced four reports in three states; two companies produced three reports each, one in three states, the other in one state; two companies produced two reports each; and six companies produced one report each.
In the chapter on technical documentation in the first edition of the Handbook of Test Development, Becker and Pomplun (2006) provided a general outline for “some of the better examples of technical documentation” (p. 715). They described six sections in such technical reports (see Table 31.1): overview and purpose of the test, description of the test, technical characteristics, validity evidence, score reporting and research services, and references. Indeed, the technical reports we reviewed address these six areas with varying degrees of detail and rigor and also provide information on test administration procedures, test-taker performance, performance standards and, in some cases, special studies and the role of technical advisory committees. Twelve of the 19 reports explicitly state the purposes and intended audiences for the report.
More specifically, these reports address many of the following topics to some degree and in various ways: (a) test purpose; (b) the targeted content standards; (c) test design (typically using tabular blueprints); (d) typical item review processes (e.g., bias and sensitivity); (e) test form assembly procedures and criteria; (f) test administration requirements (e.g., training for administrators, timing and test administration and materials accommodations); (g) test-taker participation rates; (h) con- structed-response item scoring; (i) item analysis results; (j) IRT item calibration, scaling, and linking and equating; (k) performance standards (i.e., cut scores and often some discussion of standard-setting procedures); (l) test-taker performance; (m) some information on score reports, interpretations and uses; (n) reliability evidence, including standard errors of measurement and classification consistency and accuracy; and (o) validity evidence (most often, coverage of content standards and subscore correlations). A small number of reports include special studies of a technical concern for a given administration year or proposed changes to the testing program. These topics include, for example, (p) comparability of scores from test forms with and without a writing score included, (q) benefits and drawbacks of transitioning to IRT scaling and (r) effects on scorer accuracy of requiring scorers to provide diagnostic comments for test takers as part of scoring responses.
Some important technical issues are overlooked in this sample of technical reports. For example, only one quarter of the reports address managing year-to-year drift in scoring of constructed-response items (i.e., four of the 15 reports on programs with constructed-response items). Approximately three quarters of the reports include information on item omit rates or discussion about speededness— information relevant to validity of score interpretations (e.g., fatigue and motivation effects at the end of a test) and trustworthiness of item statistics, especially for constructed-response items where omit rates as high as 5%—10% can be observed. And while three quarters of the reports address item fit (the programs in all 19 reports use an IRT model for item calibration and scaling) and linking item stability, only one quarter address the local item dependence and unidimensionality assumptions on which IRT models rely Forty-four percent of the reports address test security requirements, often in only one or two paragraphs. While procedures and requirements for maintaining security of test content and valid test administration procedures may be best addressed in separate school test coordinator and administrator manuals, these technical reports provide no indication that all test scores are as free as possible from exposure of test content, cheating and other security violations that would undermine validity of score interpretations and uses (e.g., Fremer & Ferrara, 2013).
The technical reports are inconsistent in using the technical evidence they provide to support intended inferences about examinee achievement from test scores and decisions based on those inferences. For example, more than three quarters of the reports state the purpose of the test for which technical evidence is provided and may provide a broad statement of intended interpretations of test scores (e.g., the test is “designed to measure student achievement” of the target content standards and identify “students who fail to master content”). However, only one third acknowledge efforts undertaken to minimize construct irrelevance and underrepresentation and just under half summarize the technical evidence as an approximation of an interpretation and use validity argument (e.g., Kane, 2006, 2013, this volume). (For example, one report refers to the alignment between test content and state content standards and involvement of educators in item development as evidence that test results are valid measures of what students know and can do in tested domains.) Further, only seven of the reports mention the depth of knowledge (see http://wat.wceruw.org/index.aspx) targeted and achieved by test items, and only eight refer to performance level descriptors—both No Child Left Behind peer review criteria, which we discuss later; only 10 provide information on score reports, intended interpretations and uses of test scores, and related matters. One report summarizes the relevance of the evidence in each report section to specific standards in the 1999 Standards, perhaps to facilitate the U.S. Department of Education (USDE) peer review process (and to avoid queries by reviewers who may overlook the relevance of the evidence).
These technical reports are similarly inconsistent in addressing issues related to the appropriateness of these tests for examinee subgroups and providing validity evidence on important examinee subgroups. While more than three quarters provide information on allowable test administration accommodations, just over one quarter indicate that test development procedures explicitly address how those procedures attempt to enhance accessibility to test item content (see Abedi, this volume; Elliot & Kettler, this volume) by referring to a commonly used framework for enhancing accessibility, universal design. Almost three quarters of the reports provide summary statistics on the performance of racial/ethnic and other examinee subgroups and reliability coefficients or standard error of measurement estimates for these subgroups. However, only two reports provide evidence to support interpretations and uses for these subgroups (i.e., beyond subgroup score reliabilities). For example, one report summarizes confirmatory factor analysis results as evidence that each test represents a single major factor for all examinees, for all examinees who were provided test administration accommodations, and separately for English language learners and students with disabilities who were provided accommodations.
Evidence of the validity of interpretations and uses of test scores related to the “interpretation/use argument” (Kane, 2013, p. 2) is, for the most part, documented in these reports as claims of technical and procedural soundness and collections of supporting evidence rather than as a line of argument supported with technical evidence. For example, several reports simply organize sources of validity evidence as proposed in the Standards (AERA et al., 2014, pp. 11—16): evidence based on test content (e.g., by referring to the content standards targeted in test blueprints), evidence based on internal structure (e.g., correlations among subtest scores) and evidence based on relations to other variables (e.g., correlations among reading, writing, mathematics and science test scores as divergent and convergent validity evidence). The reports are virtually silent regarding the other two proposed sources: evidence based on response processes and evidence based on consequences of testing.
Ferrara and DeMauro (2006) conducted a comprehensive and detailed review of technical reports for educational achievement tests in Grades K-12. They reviewed the information and arguments in technical reports from 11 state testing programs and one commercial norm-referenced test administered during 1998—2003. The results from their analysis represent technical reporting practices before No Child Left Behind peer review requirements, which appeared in 2004 (see U.S. Department of Education, 2007, p. 1), influenced documentation of validity evidence. They evaluated all sources of validity evidence in the 1999 Standards: evidence based on content, response processes, test internal structure, relations to other variables and consequences of testing. They reached “rather disappointing” conclusions (Ferrara & DeMauro, 2006, p. 616) about the state of the art in availability and quality of evidence relevant to test score interpretation validity. And they observed that “technical reports tend to describe evidence without integrating it into statements about the validity of various interpretations and uses” of test scores (Ferrara & DeMauro, 2006, p. 616). Apparently, despite the influence of No Child Left Behind (NCLB) peer review guidance on supporting an interpreta- tive/use argument using technical information, little has changed since the review of these technical reports published prior to 2004.