Log in / Register

Computing forward associations


As discussed in the introduction, we assume that there is a relationship between word associations as collected from human subjects and word cooccurrences as observed in a corpus. As our source of human data, we use the Edinburgh Associative Thesaurus (EAT; [KIS 73, KIS 75]), which is the largest classical collection of its kind[1]. The EAT comprises about 100 associative responses as requested from British students for each of the 8,400 stimulus terms. As some of these stimulus terms are multiword units which we did not want to include here, we removed these from the thesaurus, such that 8,210 items remained.

To obtain the required co-occurrence counts, we aimed for a corpus which is as representative as possible for the language environment of the EAT’s British test subjects. We therefore chose the British National Corpus (BNC), a 100-million-word corpus of written and spoken language, which was compiled with the intention of providing a balanced sample of British English [BUR 98]. For our purpose, it is also an advantage that the texts in the BNC are not very recent (from 1960 to 1993), thereby including the time period when the EAT data was collected (between June 1968 and May 1971).

Since function words were not considered important for our analysis of word semantics, we decided to remove them from the text to save memory requirements and processing time. This was done on the basis of a list of approximately 200 English function words. We also decided to lemmatize the corpus using the lexicon of full forms provided by Karp et al. [KAR 92]. This not only improves the problem of data sparseness, but also significantly reduces the size of the co-occurrence matrix to be computed. Since most word forms are unambiguous concerning their possible lemmas, we only conducted a partial lemmatization that does not take the context of a word into account and thus leaves the relatively few words with several possible lemmas unchanged. For consistency reasons, we applied the same lemmatization procedure to the whole EAT. Note that, as the EAT contains only isolated words, in this case, a lemmatization procedure that takes the context of a word into account would not be possible.

For counting word co-occurrences, as in most other studies, a fixed window size is chosen and it is determined how often each pair of words occurs within a text window of this size. Choosing a window size usually means a trade-off between two parameters: specificity versus the sparse-data problem. The smaller the window, the more salient the associative relations between the words inside the window, but the more severe the problem of data sparseness. In our case, with ±2 words, the window size looks rather small. However, this can be justified since we have reduced the effects of data sparseness by using a large corpus and by lemmatizing the corpus. It should also be noted that a window size of ±2 applied after elimination of the function words is comparable to a window size of ±4 applied to the original texts (assuming that roughly every second word is a function word).

Based on the window size of ±2, we computed the co-occurrence matrix for the corpus. By storing it as a sparse matrix, it was feasible to include all of the approximately 375,000 lemmas occurring in the BNC.

Although word associations can be successfully computed based on raw word co-occurrence counts, the results can be improved when the observed co-occurrence-frequencies are transformed by some function that reduces the effects of absolute word frequency. As it is well established, we decided to use the log-likelihood ratio [DUN 93] as our association measure. It compares the observed co-occurrence counts with the expected co-occurrence counts, thus strengthening significant word pairs and weakening incidental word pairs. In the remainder of this paper, we refer to co-occurrence vectors and matrices that have been transformed this way as association vectors and matrices.

  • [1] An even larger, though possibly more noisy, association database has been collected viaonline gaming at
Found a mistake? Please highlight the word and press Shift + Enter  
< Prev   CONTENTS   Next >
Business & Finance
Computer Science
Language & Literature
Political science