It is useful to now consider the question of how it is even possible that CA works as well as it does although its inclusion of context, while better than approach 1, is still so impoverished. After all, all it includes is two token frequencies (e.g., 200 and 1000 for w1) rather than two type frequencies and their type-token distributions let alone dispersion.

As I see it, CA works as well as it does and especially so when used with pFYE for several reasons, most of which are typically not recognized. First, because, as Ellis & Ferreira-Junior (2009) show, the correlation of pFYE with the above-mentioned AP measure associative learning is high. This is so because CA approximates the type-token distributions of approaches 3 and 4 by including the corresponding token frequencies outside of cl (200) and nc1 without wl (1000) rather than ignoring them as frequencies do and because, as a p-value, pFYE weighs observed percentages of co-occurrence more strongly as the overall n of a 2x2 table increases (recall Section 2.3). This logic can be visualized as in Figure 8, which represents the frequencies of words a to m in construction c as well as their attraction to c. According to approach 1, the value that reflects how important a is for the analysis of c is the horizontal line at the bottom, the line from the origin to the x-value (the frequency) of a in c. However, AMs add information (on the y-axis) so the value that reflects how important a is for the analysis of c becomes the line from the origin to a in the top right corner; this additional information is one reason why approach 2/CA does often better than approach 1.

There is a second and theoretically more important reason why CA works, and this is concerned with a characteristic of language that is, with some exceptions, topicalized too little in the usage-based approach: the Zipfian distribution that linguistic elements within, say, syntactically-defined slots, exhibit. What makes CA work most of the time is that the 200 and 1000 elsewhere-uses in the above example of w1 will be Zipfian-distributed, which in turn means that, especially with

Figure 8. The comparison of a frequency- vs an AM-based approach

high frequencies of a word in a construction, all other uses will be much rarer and thus not distort the data much.

Finally, there is an implication of Zipfian distributions that is little commented on in cognitive/usage-based linguistics: We know that the frequencies of words in constructional slots are Zipfian distributed (Ellis & Ferreira-Junior 2009), we know that skewed low-variance distributions lead to better learning than balanced ones (Goldberg et al. 2004; Goldberg 2006), and we know that type frequency is related to productivity (Goldberg 2006: 99). Goldberg (2006: 89) speculates that this "may involve a type of cognitive anchoring," but I think another (yet not incompatible) perspective is to realize that Zipfian distributions involve less uncertainty than random, uniform, or less Zipfian distributions: The more tokens are accounted for by fewer types, the lower the entropy of the distribution, as is exemplified informally in Figure 9.

Thus, the notion of entropy not only highlights the need to go beyond approaches 1 and 2, but also unites many findings in cognitive-linguistic theorizing under one umbrella. In fact, it unites cognitive-linguistic theorizing with recent approaches in psycholinguistics that study constructional choices on the basis of notions such as surprisal (an information-theoretic operationalization of 'surprise, cf. Jaeger & Snider 2008) and unified information density (Frank & Jaeger 2008) or study how category learning and productivity arise from Hebbian learning from Zipfian input with low entropy (cf. Zeldes 2011).

Figure 9. Pseudo-random Zipfian distributions and their entropies

Found a mistake? Please highlight the word and press Shift + Enter