Cosine Similarity Index (CSI)

The similarity of the generated items can be evaluated using the CSI (Lai, 2013; Singhal, 2001). The CSI provides a statistical measure of syntactic similarity (i.e., item heterogeneity) using a sample of generated items. In Chapter 3, we presented item models that contain different layers of information. The purpose of generation using a 1-layer model is to produce items by manipulating a small number of elements at a single layer in the model. The benefit of this approach is that the models are easy to create, while the drawback is that the generated items may be quite similar to one another. To overcome this limitation, generation using an n-layer model can be conducted to produce more diverse and heterogeneous items. Models with two or more layers contain elements at two or more levels in the model. The benefit of n-layer modelling is that the generated items tend to be much more diverse because a larger number of elements are manipulated at more than one layer in the model. The CSI is a quantitative measure of syntactic similarity that can be used to evaluate the success of the item modelling approach when the goal is to produce more diverse items. To measure and compare the similarity of items that created different layers of content, intra-model differences—meaning items generated within the same model—can be assessed. The CSI is a measure of the syntactic similarity between two vectors of co-occurring texts. It is computed using the cosine of the angle between the two vectors in a multidimensional space of unique words. The CSI is expressed as

where cos(0) is the cosine of the angle, while A and В are two items expressed in a binary vector of word occurrences. For example, if A is a list of three words (e.g., run, jog, crawl) and 6 is a list of three words (e.g., move, jog, motion), then the length of both binary vectors is the number of unique words used across both lists (i.e., run, jog, crawl, move, motion). To express A and В as vectors so that the words can be compared, the occurrence of each word in the list is assigned a value of 1. The resulting vectors for Д and В in our example are [1,1,1,0,0] and [0,1,0,1,1]. The CSI has a minimum value of 0, meaning that no word overlapped between the two vectors, and a maximum of 1, meaning that the text represented by the two vectors is identical.

To compare the word similarity of items created using a 1 - and a 2-layer model, we return to the examples introduced in Chapter 2. Items were generated for both the logical structures math and key features medical models. Then to compare the word similarity among the generated items, a sample of 100 items from each model was randomly selected and analyzed using the CSI. To compute the CSI, the items were compiled into a matrix of word occurrences for each item model, where each row represents a vector of a generated item, each column represents a unique word in the pool of generated items, and each row-by-column cell is numerated to determine whether a given item contains a given word. The CSI was calculated for each unique item pair within the same item model. The outcome is a CSI mean and standard deviation for each model that, in turn, can be compared between models.

A comparison of the CSI results from 1-layer and 2-layer item models in mathematics and medicine is presented in Table 7.4. One important advantage of using a n-layer item model is that more elements can be manipulated simultaneously, thereby expanding the generative capacity of the model. Another important advantage is that the generated items will likely be quite different from one another because more content in the model is manipulated. The results in Table 7.4 provide a statistical summary of the outcome from the layering process. For mathematics, the 1 -layer model with only the correct option produced a mean of 0.88 and a standard deviation of 0.04. Because the CSI ranges from 0 (no similarity) to 1 (perfect similarity), a high mean for the 1 -layer math model indicates that the generated items are quite similar, while the low standard deviation reveals that the items are relatively homogeneous. The addition of the distractors does produce more heterogeneous items. The 1-layer model with the correct option and the distractors produced a mean of 0.77 (lower CSI mean compared to correct option only) and a standard deviation of 0.07 (higher CSI standard deviation compared to correct option only). A dramatic change occurs in the similarity measure with the use of n-layering. The 2-layer model with only the correct option produced a mean of 0.37 and a standard deviation of 0.32. With the addition of the distractors, the 2-layer model produced an even lower mean of 0.34 and a higher standard deviation of 0.38.

Table 7.4 Summary of CSI as a Function of Content Area and Item Model



Correct Option

Correct Option+ Distractors

Correct Option

Correct Option+ Distractors























For medicine, the 1 -layer model with only the correct option produced a mean of 0.78 and a standard deviation of 0.11. As with mathematics, the addition of the distractors yields more item diversity even with a single layer. The 1-layer model with the correct option and the distractors produced a mean of 0.77 and a standard deviation of 0.13. A noteworthy difference is produced with n-layering. The 2-layer model for medicine with only the correct option produced a mean of 0.58 and a standard deviation of 0.20. With the addition of the distractors, the 2-layer model produced a lower mean of 0.52 and a higher standard deviation of 0.25. In sum, by introducing layered elements, more diverse items can be generated because the 1-layer model produces items that form a subset of the items created using the 2-layer model. This diversity can be measured using the CSI. In our example, the logical structures math model produced more diverse items than the key features medical model because logical structures contained more variation among the elements in the stem and options. The selected-response items are more diverse than constructed-response items because the inclusion of the distractors increases the variability of the generated items.

The Key to Validating Generated Items

In this chapter, we presented alternative methods for validating the generated items. To ensure that the models used for generation are of high quality, a validation table can be reviewed because it contains all of the relevant information required for producing the generated outcomes. To ensure the items that are generated meet high standards of quality, an item review can be implemented so that the SME can evaluate the generated outcomes. A comparison between the traditional and generated items—a contentious comparison that is often requested—can also be included as a part of the item review to confirm that the two item development methods produce comparable results. To ensure that the models produce generated items with acceptable statistical properties, psychometric and computational linguistic methods can be used to evaluate the generated outcomes. These validation methods provide critical sources of evidence that are often needed as testing organizations move from a traditional item development process to AIG.

While many different methods can be used to evaluate item quality, the methods we present have one important commonality—model reviews, item reviews, and statistical analyses all rely on the interpretation and scrutiny of the SME. Model reviews depend on the SME's judgements about whether the content is organized appropriately in the cognitive and item models. Item reviews depend on the SME's judgements about whether the items meet the appropriate standards of quality. Statistical analyses depend on the SME's judgements about whether the generated items yield appropriate psychometric outcomes. Hence SMEs who are competent in the subject matter and who are familiar with the AIG development process are critical for implementing and interpreting the methods necessary to validate generated items.


Bejar, I. I. (2011). A validity-based approach to quality control and assurance of automated scoring. Assessment in Education: Principles, Policy & Practice, 18, 319-341.

Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 29-51.

Embretson, S. E., & Kingston, N. M. (2018). Automatic item generation: A more efficient process for development mathematics achievement items? lournal of Educational Measurement, 55, 112-131.

Gierl, M. J. and Lai, H. (2016). A process for reviewing and evaluating generated test items. Educational Measurement: Issues and Practice, 35, 6-20.

Gierl, M. J., Lai, H., Pugh, D., Touchie, C., Boulais, A.-R, & De Champlain, A. (2016). Evaluating the psychometric characteristics of generated multiple- choice test items. Applied Measurement in Education, 29, 196-210.

Gierl, M. J., Latifi, F., Lai, H., Matovinvic, D., & Boughton, K. (2016). Using automated processes to generate items to measure K-12 science skills. In Y. Rosen, S. Ferrara, & M. Mosharraf (Eds.), Handbook of Research on Computational Tools for Real-World Skill Development (pp. 590-610). Hershey, PA: IGI Global.

Haladyna, T. M., & Downing, S. M. (1993). How many options is enough for a multiple-choice item? Educational and Psychological Measurement, 53, 999-1010.

Lai, H. (2013). Developing a Framework and Demonstrating a Systematic Process for Generating Medical Test Items (Doctoral Dissertation). Retrieved from https://doi.Org/10.7939/R3C93H

Lord, F. M. (1980). Applications of Item Response Theory to Practical Testing Problems. Mahwah, N): Erlbaum.

Pugh, D., De Champlain, A., Gierl, M. J., Lai, H, & Touchie, C. (2016). Using cognitive models to develop quality multiple-choice questions. Medical Teacher, 38, 838-843.

Pugh, D., De Champlain, A., Gierl, M. J., Lai, H, & Touchie, C. (2020). Can automated item generation be used to develop high quality MCQs that assess application of knowledge? Research and Practice in Technology Enhanced Learning, 15.

Samejima, F. (1979). A new family of models for the multiple choice item (Research Rep. No. 79-4). Knoxville, TN: University of Tennessee.

Schmeiser, С. B., & Welch, C.). (2006). Test development. In R. L. Brennan (Ed.), Educational Measurement (4th ed., pp. 307-353). Westport, CT: National Council on Measurement in Education and American Council on Education.

Singhal, A. (2001). Modern information retrieval: A brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24: 35-43.

< Prev   CONTENTS   Source   Next >