Handbook of Test Development

PREFACEI Foundations TEST DEVELOPMENT PROCESSOverall PlanDomain Definition and Claims StatementsContent SpecificationsItem DevelopmentItem Writing and ReviewItem Tryouts/Field TestingItem BankingTest Design and AssemblyTest ProductionTest AdministrationScoringCut ScoresTest Score ReportsTest SecurityTest DocumentationConclusionReferencesTEST DESIGN AND DEVELOPMENT FOLLOWING THE STANDARDS FOR EDUCATIONAL AND PSYCHOLOGICAL TESTINGStandards for Test DesignValidity Considerations in Test DesignAlignment Evidence: Content and Cognitive ProcessesPredictive EvidenceEvidence Based on Internal and External RelationshipsReliability/Precision Considerations in Test DesignFairness Considerations in Test DesignConsideration of Relevant GroupsRemoving Construct-Irrelevant Variance in Item and Test DevelopmentProviding Appropriate Access and Use for AllTest Development and ImplementationItem DevelopmentTest AssemblyTest Administration InstructionsSpecification and Monitoring of Scoring ProcessesScaling and Equating Test FormsScore ReportingDocumentationOngoing Checks on Interpretation and UseConclusionNoteReferencesEVIDENCE-CENTERED DESIGNDefining AssessmentEvidentiary Reasoning and Assessment ArgumentsKnowledge RepresentationsA Layered ApproachExamples and ApplicationsThe ECD LayersDomain AnalysisDomain ModelingConceptual Assessment FrameworkStudent Model: What Are We Measuring?Evidence Model: How Do We Measure It?Task Model: Where Do We Measure It?Assembly Model: How Much Do We Need to Measure It?Sample Knowledge RepresentationsAssessment ImplementationAssessment DeliveryConclusionNotesReferencesVALIDATION STRATEGIES. Delineating and Validating Proposed Interpretations and Uses of Test ScoresThe Evolution of Validity TheoryThe Argument-Based Approach to ValidationThe Interpretation/Use Argument (IUA)The Validity ArgumentDeveloping the IUA, the Test and the Validity ArgumentFallaciesSome Common Inferences, Warrants and BackingScoringGeneralizationExtrapolationTheory-Based InferencesScore UsesNecessary and Sufficient Conditions for ValidityTwo ExamplesLicensure Tests and Employment TestsMonitoring Programs and Accountability ProgramsConcluding RemarksReferencesDEVELOPING FAIR TESTSPurposeOverviewValidity, Constructs and VarianceDefinitions of Fairness in AssessmentBias, Sensitivity and FairnessVarious Definitions of FairnessImpartialityScore DifferencesPrediction and SelectionFairness GuidelinesSources of GuidelinesGuidelines Concerning Cognitive Sources of Construct-Irrelevant VarianceGuidelines Concerning Affective Sources of Construct-Irrelevant VarianceGuidelines Concerning Physical Sources of Construct-Irrelevant VarianceExceptions to GuidelinesTest DesignFocus on ValidityEvidence-Centered DesignUniversal Design and Accessible Portable Item ProtocolSelection of ConstructsDiversity of InputItem Writing and ReviewItem Review for FairnessProceduresRepresent DiversityAvoid StereotypesReview TestsTest Administration and Accommodation/ModificationPreparation and AdministrationAccommodations/Modificationsfor People With DisabilitiesAccommodations/Modificationsfor English-Language LearnersItem and Test AnalysesMeaning of DIFProcedures for Using DIFTest AnalysisScoring and Score Reporting ScoringScore ReportingTest UseAllegations of MisuseOpportunity to LearnFairness ArgumentsConclusionNotesReferencesCONTRACTING FOR TESTING SERVICESDifferent Ways to Contract for Testing ServicesOverview of This ChapterPlanning for the Request for Proposals/Invitation to BidDefine the Scope of the ProjectDetermine Available ResourcesProducts and Services NeededType of RequisitionSingle or Multiple ContractorsPrequalification and/or Precontact With Potential VendorsIdentify and Address Risks Associated With the ProjectCrafting the Request for Proposals/Invitation to BidSummarize the Program HistoryDetermine Whether to Specify Level of Resources AvailableDecide on the Level of Specificity of the RFPDetermine Whether Bidders Can Suggest Changes to the Desired Outcomes, Processes and ServicesDescribe Desired Products and Services in DetailProposed Management PlansStaffingSoftware DevelopmentBudgetThe Process of BiddingProposal Development TimeDetermine How Bidders Can Raise QuestionsPre-Bid MeetingDescribe the Proposal FormatDescribe the Submission ProcessBid Submission Process to Be UsedDefine the Proposal Evaluation CriteriaDefine the Proposal Evaluation ProcessEvaluating the ProposalsIdentify the Proposal ReviewersCheck Bidders’ ReferencesCarry Out the Proposed Review ProcessDetermine If a “Best and Final Offer” Will Be UsedReview of the BAFO(s)Make Final DecisionPrepare Summary Notes and Report on Bidding ProcessAnticipating and Dealing With Award ProtestsSummaryReferencesII Content DETERMINING CONTENT AND COGNITIVE DEMAND FOR ACHIEVEMENT TESTSEvolving Models of Assessment DesignApproach to Assessment DesignConduct Domain Analysis and ModelingArticulate Knowledge and SkillsDraft ClaimsDevelop PLDsDevelop Test SpecificationsWrite Items to Measure Claims and Targeted Performance StandardsMethods for Writing Items to Performance StandardsBenefits and Challenges of Evidence-Centered Approach to Determining Content and Cognitive Demand of Achievement TestsBenefitsChallengesNotesReferencesJOB ANALYSIS, PRACTICE ANALYSIS AND THE CONTENT OF CREDENTIALING EXAMINATIONSMethods of Job and Practice AnalysisPractice Analysis QuestionnairesQuestionnaire Planning and DesignTypes of Rating ScalesDevelopment of Content SpecificationsDeciding on an Assessment FormatOrganization of Content SpecificationsFrom Practice Analysis to Topics and WeightsProcess-Oriented SpecificationsContent-Oriented SpecificationsSME Panel MeetingsKnowledge ElicitationPrincipled Test Design and Construct MapsLinkage ExerciseVerifying the Quality of Content SpecificationsConcluding CommentsNotesReferencesLEARNING PROGRESSIONS AS A GUIDE FOR DESIGN. Recommendations Based on Observations From a Mathematics AssessmentDefinitions of Learning ProgressionsValidation of Learning ProgressionsExamples of Learning ProgressionsUsing Learning Progressions in the Design of Assessments: An ExampleThe Linear Functions Learning ProgressionThe Moving Sidewalks TasksEmpirical Recovery of a Learning Progression for Linear FunctionsScoring Using the Linear Functions Learning ProgressionSelected FindingsSummary of FindingsConclusions and RecommendationsAcknowledgmentsNoteReferencesDESIGNING TESTS TO MEASURE PERSONAL ATTRIBUTES AND NONCOGNITIVE SKILLSBackgroundRecent Frameworks for Specifying Personal Attributes and Noncognitive SkillsFive-Factor ModelBeyond the Big 5st-Century Skills FrameworksChicago Schools ConsortiumCollaborative for Academic, Social and Emotional LearningLarge-Scale Assessment FrameworksMethods for Assessing Personal Attributes and Noncognitive SkillsSelf-RatingsResponse Style EffectsAnchoring VignettesForced-Choice and the Faking Problem in High-Stakes TestingIpsative ScoringItem-Response Theory (IRT) ScoringBiodata and Personal StatementsPassive Self-Report DataRatings by OthersLetters of RecommendationSituational Judgment TestsInterviewsNoncognitive Tests (Performance Measures)Summary and ConclusionsReferencesSETTING PERFORMANCE STANDARDS ON TESTSWhat Is Standard Setting?Standard-Setting StandardsCommon Considerations in Standard SettingStandard-Setting MethodsThe Angoff MethodThe Bookmark MethodExaminee-Centered MethodsThe Contrasting Groups MethodThe Body of Work MethodMethods Grounded in External DataMethods for Adjusting Cut ScoresVertically Moderated Standard SettingWhat Is Vertically Moderated Standard Setting?Approaches to VMSSFrontiers and ConclusionsNotesReferencesIII Item Development and Scoring WEB-BASED ITEM DEVELOPMENT AND BANKINGRemote Authoring: Content Creation and StorageAdministrative FeaturesMetadata and QueriesTest Assembly, Packaging and InteroperabilityMaintenance and SecurityConclusions and Future ConsiderationsAcknowledgmentNotesReferencesSELECTED-RESPONSE ITEM DEVELOPMENTThe Context of Item WritingChoosing the SR Item FormatItem Writing: A Collaborative EffortA Current Taxonomy of SR Item FormatsMultiple-Choice FormatsFill-in-the-Blank MCTrue-FalseMatchingTestlet-Based Item SetsGuidelines for SR Item WritingEmpirical Evidence for SR Item Writing GuidelinesContent concernsEvidence Regarding the Number of OptionsEvidence From Item and Test Accessibility StudiesGathering Validity Evidence to Support SR Item DevelopmentThe Role of Items in the Interpretation/Use ArgumentFuture of the Science of Item WritingRecommendations for the Test DeveloperNoteReferencesDESIGN OF PERFORMANCE ASSESSMENTS IN EDUCATIONCharacteristics of Performance AssessmentsDesign and Scoring of Performance AssessmentsArgument-Based Approach to Validity as the Foundation for Assessment DesignDesign of Performance AssessmentsUse of Principled Approaches to Test DesignSpecification of Task DemandsUse of Task ModelsUse of Computer-Based Simulation TasksScoring Specificationsfor Performance TasksSpecification of Scoring CriteriaScoring ProceduresHuman and Automated ScoringDesign of Administration GuidelinesPsychometric Considerations in the Design of Performance AssessmentsConstruct-Irrelevant Variance and Construct UnderrepresentationComparabilityGeneralizability of ScoresRater EffectsLocal Item DependencyDifferential Item FunctioningConclusionNoteReferencesUSING PERFORMANCE TASKS IN CREDENTIALING TESTSDistinguishing Features of Credentialing TestsIdentifying the Important Performance ConstructsMoving From Constructs to TasksWhy Use Performance Tasks?Common Types of Performance Tasks Used in CredentialingExamples of Current Credentialing Tests That Use Performance TasksScoring Performance Tasks for Credentialing TestsSelection of DataScoring Procedures, Raters and MethodsScoring Resources and CostThe Impact of Performance Tasks on Reliability and ValidityReliabilityPotential Threats to Reliability and GeneralizabilityValidityPotential Threats to ValidityConclusionNoteReferencesCOMPUTERIZED INNOVATIVE ITEM FORMATSWhy Computer-Based Item Formats?Review of Current Computerized Item FormatsSelection: Multiple-Choice and Its CBT VariantsReadingSelection/IdentificationReordering/RearrangementSubstitution/CorrectionCompletionConstructionStructural Considerations: Multiple Format SetsValidity Issues for Digital Item FormatsConstruct Representation and Construct-Irrelevant VarianceAnxiety, Engagement and Other Psychological FactorsAdaptive Testing and Test AnxietyAutomated ScoringTest SpeedednessTest SecurityIntended and Unintended ConsequencesQuality ControlTesting Students With Disabilities and English LearnersReducing Threats to ValiditySummary of Validity IssuesBenefits and Challenges of Computerized Item FormatsSummary and ConclusionsNoteReferencesRECENT INNOVATIONS IN MACHINE SCORING OF STUDENT- AND TEST TAKER-WRITTEN AND -SPOKEN RESPONSESMachine Scoring: Definition, History and the Current WaveExpansion of Automated Essay EvaluationLimits to Machine ScoringAutomated Essay EvaluationBackgroundE-rater Features and AdvisoriesModel Building and EvaluationAEE Applications and Future DirectionsAutomated Student Assessment Prize Competitions on Essay ScoringC-rater: Educational Testing Service’s Short-Answer SystemConcept Elicitation and FormalizationSentence MatchingMaking the Model More RobustAutomated Student Assessment Prize Competition (Short-Answer)Speech Evaluation (SpeechRater)Reliability and ValidityGuidance for Test DevelopersNotesReferencesLANGUAGE ISSUES IN ITEM DEVELOPMENTPerspectiveMethodologies for Identifying Multidimensionality Due to Linguistic FactorsLinguistic Modification of Test Items: Practical ImplicationsLinguistic Features That May Hinder Student Understanding of Test ItemsWord Frequency and FamiliarityWord LengthSentence LengthVoice of Verb PhraseLength of NominalsComplex Question PhrasesComparative StructuresPrepositional PhrasesSentence and Discourse StructureSubordinate ClausesConditional ClausesRelative ClausesConcrete Versus Abstract or Impersonal PresentationsNegationProcedures for Linguistic Modification of Test ItemsFamiliarity/Frequency of Nonmath VocabularyVoice of Verb PhraseLength of NominalsConditional ClausesRelative ClausesComplex Question PhrasesConcrete versus Abstract or Impersonal PresentationsA Rubric for Assessing the Level of Linguistic Complexity of the Existing Test ItemsAnalytical RatingHolistic RatingInstructions for the Incorporation of Linguistic Modification When Developing New Test ItemsSummary and DiscussionReferencesITEM AND TEST DESIGN CONSIDERATIONS FOR STUDENTS WITH SPECIAL NEEDSKey Terms and ConceptsStudents With DisabilitiesAchievement of Students With DisabilitiesMeasurement Precision and Students With DisabilitiesResearch on Key Instructional and Inclusive Testing PracticesItem and Test AccessibilityTesting AccommodationsChanges in Performance Across YearsGuidelines for Designing and Using Large-Scale Assessments for Students With Special NeedsConclusionsReferencesITEM ANALYSIS FOR SELECTED- RESPONSE TEST ITEMSPurposes of Item AnalysisDimensionalityCoefficient AlphaItem Factor AnalysisSubscore ValidityRecommendationEstimating Item Difficulty and DiscriminationSample CompositionOmits (O) and Not-Reached (NR) ResponsesKey Balancing and ShufflingItem DifficultyItem DiscriminationStatistical IndicesTabular MethodsGraphical MethodsIRT DiscriminationCriteria for Two Types of Evaluation of Difficulty and DiscriminationCriterion-Referenced EvaluationNorm-Reference EvaluationItem Discrimination and DimensionalityCriteria for Evaluating Difficulty and DiscriminationDistractor AnalysisGuessingDistractor Response PatternsLow-Frequency DistractorPoint-Biserial of a DistractorChoice MeanExpected/Observed: A Chi-Squared ApproachTrace LinesSpecial Topics Involving Item AnalysisUsing Item Response Patterns in the Evaluation and Planning of Student LearningInstructional SensitivityCheatingItem Drift (Context Effects)Differential Item FunctioningPerson FitSummaryNoteReferencesAUTOMATIC ITEM GENERATIONPurpose of ChapterAIG Three-Step MethodStep 1: Cognitive Model DevelopmentStep 2: Item Model DevelopmentStep 3: Generating Items Using Computer TechnologyEvaluating Word Similarity of Generated ItemsMultilingual Item GenerationSummaryThe New Art and Science of Item DevelopmentLimitations and Next StepsAcknowledgmentsReferencesIV Test Design and AssemblyPRACTICAL ISSUES IN DESIGNING AND MAINTAINING MULTIPLE TEST FORMSDesignScore UseTest Validation PlanTest Content ConsiderationsPsychometric ConsiderationsTest Delivery PlatformImplementTest Inventory NeedsItem Development NeedsBuilding Equivalent FormsTest EquatingTest Security IssuesSustaining the Development of Equivalent FormsMaintaining Scale MeaningPractical Guidelines and Concluding CommentsAcknowledgmentReferencesVERTICAL SCALESDefining Growth and Test Content for Vertical ScalesData Collection DesignsCommon Item DesignsCommon Person DesignEquivalent-Groups DesignChoosing a Data Collection DesignMethodologies for Linking Test FormsEvaluating Item Response Theory AssumptionsItem Response Theory Scaling ModelsMultidimensional IRT ModelsEstimation StrategiesPerson Ability EstimationChoosing a Linking MethodologyEvaluating Vertical ScalesMaintaining Vertical Scales Over TimeUsing Horizontal LinksUsing Vertical LinksCombining Information From Horizontal and Vertical LinksDeveloping Vertical Scales in Practice: Advice for State and School District Testing ProgramsState Your AssumptionsThe Choice of Data Collection DesignThe Choice of Linking MethodologyTying the Vertical Scale to Performance StandardsReferencesDESIGNING COMPUTERIZED ADAPTIVE TESTSConsiderations in Adopting CATChanged MeasurementImproved Measurement Precision and EfficiencyIncreased Operational Convenience for Some, Decreased for OthersStakes and SecurityTest Taker VolumeCAT Concepts and MethodsTest SpecificationsItem Types and FormatsItem PoolsPool SizePool CompositionItem Selection and Test Scoring ProceduresMeasurement ConsiderationsContent ConsiderationsExposure ControlProficiency Estimation and Test ScoringImplementing an Adaptive TestDeveloping Test SpecificationsCan Test Specifications Be Trusted to Assemble Proper Forms?Do Test Specifications Guarantee That Different Tests Measure the Same Trait?How Does IRT Scoring Impact Test Specifications?Test Precision and LengthChoosing Item Selection and Test Scoring ProceduresItem Banks, Item Pools, Item Calibration and PretestingEstablishing an Item BankCalibrating and Scaling an Item BankEvaluating Test Designs and Item Pools Through SimulationEvaluating Item Pools Through SimulationEvaluating Test DesignsFairly Comparing Test DesignsConclusionNotesReferencesAPPLICATIONS OF ITEM RESPONSE THEORY. Item and Test Information Functions for Designing and Building Mastery TestsIRT Item and Test Characteristic and Information FunctionsIRT Information FunctionsSome Useful Extensions of IRT Information Functions for Mastery TestingGenerating Target Test Information Functions (TIF)Some Considerations for TIF TargetingThe Analytical TIF Generating MethodAutomated Test AssemblyItem Bank Inventory ManagementSome Recommended Test Development StrategiesNotesReferencesOPTIMAL TEST ASSEMBLYBirnbaum’s MethodFirst Example of an OTA ProblemMIP SolversTest SpecificationsDefinition of Test SpecificationAttributesRequirementsStandard FormA Few Common ConstraintsTest-Assembly GUIExamples of OTA ApplicationsAssembly of an Anchor FormMultiple-Form AssemblyFormatted Test FormsAdaptive TestingNewer DevelopmentsConclusionReferencesV Production, Preparation, Administration, Reporting, Documentation and Evaluation TEST PRODUCTIONAdopting a Publishing PerspectiveStart-Up PhasePretest PhasePublication PhaseTest Format and Method of DeliveryGeneral ConsiderationsPaper-and-Pencil TestsComputer-Based TestsDelivery PresentationImplementing the CBT FormatProcedures and Quality ControlTypical ProcedureQuality ControlReferencesPREPARING EXAMINEES FOR TEST TAKING. Guidelines for Test DevelopersControversial Issues in Test PreparationTerminologyConstruct-Irrelevant Variance (CIV)AccessibilityFocus and Format of Test PreparationEfficacy of Test PreparationEfficacy of Preparationfor CollegeAdmission TestingEquity Issues in College Admission TestingEfficacy of PreparationforEssaysCaveat EmptorStandards Related to Test PreparationResearch That Can Inform Practices and PoliciesSummary of RecommendationsReferencesTEST ADMINISTRATIONTest AdministrationTest Administration Threats to ValidityTest Administration and CUTest Administration and CIVCIV and Test Delivery FormatAdministration-Related Sources of CIVPhysical EnvironmentInstructions, Equipment and SupportTime Limits and SpeedednessAlterations Change What Test MeasuresTest Administrator EffectsFraud and SecurityEfforts That Enhance Accuracy and Comparability of ScoresEfforts That Enhance StandardizationSelecting Appropriate Test AdministratorsTest Administrator TrainingDetecting and Preventing Administration IrregularitiesQuality Control ChecksCheck PreparednessCheck AuthenticityCheck for Unauthorized Materials/DevicesCheck Proper TestCheck Active MonitoringCheck End-of-Session ActivitiesMinimizing Risk ExposuresTest Administrator Job AidCheck-InTest SessionCheck-OutSummaryReferencesA MODEL AND GOOD PRACTICES FOR SCORE REPORTINGBackground on Reports, Report Delivery and Report ContentsPaper and Digital Delivery of ReportsIndividual and Group ReportingReport ContentsThe Hambleton and Zenisky (2013) ModelEvaluating Reports: Process, Appearance and ContentsPromising Directions for ReportingSubscore ReportingConfidence BandsGrowth Models and ProjectionsConclusionsReferencesDOCUMENTATION TO SUPPORT TEST SCORE INTERPRETATION AND USEWhat Is Documentation to Support Test Score Interpretation and Use?Requirements and Guidance for Testing Program Documentation in the Standards for Educational and Psychological TestingRequirements for Testing Program Documentation in the No Child Left Behind Peer Review GuidanceWhat Are Current Practices in Technical Reporting and Documentation?Technical Reporting and Documentation Practices in K-12 Educational Testing ProgramsOther Technical DocumentationTechnical Reporting and Documentation Practices in Certification and Licensure Testing ProgramsOther Technical DocumentationConstructing Validity ArgumentsValidity Arguments and Current Validity TheoryUsing Evidence From Technical and Other Documentation to Construct a Validity ArgumentA Proposal: The Interpretation/Use Argument Report, or IUA ReportDeveloping Interpretative Arguments to Support the Validity ArgumentSources of Validity Evidence, Research Questions, Challenges to Validity and Topics for the IUA ReportDiscussion and ConclusionAcknowledgmentsNotesReferencesTEST EVALUATIONThe History of Test Evaluation in the U.S.Types of Test Evaluations: Reviews, Accreditation and CertificationProfessional Standards and the Basis for Test EvaluationDimensions Upon Which Test Evaluation Is BasedValidityReliabilityFairnessUtilityThe Internationalization of Test ReviewingTest ReviewersTest ReviewsVolume of ReviewsLimitations and Challenges in Test Review and EvaluationConclusionReferences
Next >