Towards a new empirical perspective and its theoretical implications

A cline of co-occurrence complexity and its motivations/implications

So far this chapter has been concerned with documenting how CA is, contrary to Bybee's claims, a good tool for the analysis of co-occurrence data from corpora. However, it is now worth returning in more detail to two questions that were discussed only briefly above: (i) why exactly does CA provide the (relatively) good results that it does and (ii) what is the cognitive mechanism that it reflects/ assumes? In what follows, I will discuss these issues in detail because a more elaborate treatment of them has profound implications on how (different kinds of) data inform cognitive-linguistic theory and establish connections to other theoretical approaches. To explore the answers to these questions and their implications, I will outline a cline of co-occurrence complexity of how to study corpus data and, as this cline is built up, discuss how each step of increased methodological complexity is motivated theoretically; ultimately, this build-up will result in what I think is a necessary clarification of what a usage-/exemplar-based approach entails both in terms of data and theoretical notions such as construction.

Approach 1: Raw frequencies/percentages

As a first step on the co-occurrence cline, let's look at a raw frequency/percentage type of approach, which is represented in Figure 4: "wl", "w2", etc. and "cl" stand for 'word 1', 'word 2, etc. (e.g., give, tell, etc.) and 'construction 1' (e.g., the ditransitive) respectively.

Approach 1: Observed frequencies of words 1-x in construction 1

Figure 4. Approach 1: Observed frequencies of words 1-x in construction 1

This information is often easy to obtain and can be useful in a variety of applications as Bybee and others have shown. As argued above, this approach is also extremely restrictive in that it adopts a very limited view of the more complex reality of use. Among other things, it focuses on only one context, c1, and does not take into consideration uses of w1, w2, etc. outside of c1 into consideration, something which the next approach, AMs, does.

Approach 2: Association measures

As argued in detail above, AMs consider uses of wl, w2, ... outside of cl, cf. Figure 5. The bold figures 80, 60, and 40 here correspond to those in Figure 4; the italics will be explained below.

Approach 2: AMs for occurrences of words 1-3 (of x) in construction 1

Figure 5. Approach 2: AMs for occurrences of words 1-3 (of x) in construction 1

Obviously, Figure 5 illustrates a more comprehensive approach than Figure 4: This is true in the trivial sense that all the information in Figure 4 is also present in Figure 5, plus more, namely the token frequencies of the words w1-3 outside of c1 and the frequency of c1. But this is also true in the sense that this is the CA approach that, as discussed above, proved superior in terms of explaining completion preferences, reading times, and learner uptake.

It is probably fair to say that, in general, approach 2 is one of the more sophisticated ways in which co-occurrence data are explored in contemporary usage-based linguistics. However, while I have been defending just this AM approach against the even simpler approach of Figure 4, it is still only a caricature of what is necessary, as we will see in the next section.

Approach 3: Full cross-tabulation

Figure 6 shows the next step on the cline, a full cross-tabulation of words and their uses in contexts/constructions.

Again, this approach is more comprehensive than the preceding ones; it contains all their information, and more. This additional information is very relevant within usage-based theory and should, therefore, also figure prominently in usage-based analyses of data.

First, approach 3 provides crucial information on type frequencies that both previous approaches miss. Approach 1 only stated that wl occurs in cl; approach 2 stated that w1 occurs in c1 but also elsewhere and that c1 occurs with w1 and also elsewhere. Approach 3, however, zooms in on the 200 elsewhere-uses of w1 and the 1000 elsewhere-uses of c1 (italicized in Figure 5) by revealing, for instance, that w1 occurs in 6 out of the 15 constructions; analogously for the 310 and 420 elsewhere-uses of w2 and w3 in 2 and all 15 constructions respectively, etc.

Approach 3: Cross-tabulation of words wl-20 and constructions ct-15. The row/column

Figure 6. Approach 3: Cross-tabulation of words wl-20 and constructions ct-15. The row/column 'types' represents the number of constructions/words a word/construction is attested with. The row/column H represents the uncertainty/entropy of the token distributions.[1]

This kind of type-frequency information is already important for many pertinent reasons. On the one hand, there are results showing that type frequencies are relevant to acquisition, and recent studies on a new AM that incorporates type frequencies gravity have yielded very promising results (cf. Daudaravicius & Marcinkeviciene 2004, Gries 2010a). However, there is an even more important theoretical motivation, namely how type frequencies tie in with psycholinguistic/ cognitive-psychological theories. Consider, for instance, the so-called fan effect, which is "[s]imply put, the more things that are learned about a concept [the more factual associations fan out from the concept], the longer it takes to retrieve any one of those facts" (Radvansky 1999: 198).[2] While the analogy is admittedly crude, the first clause can be seen as involving the number of connections (i.e., a kind of type frequency) between, say, a construction and the range of words that can be used in it (or a word and the range of constructions it can be used in). Following this analogy, in a cognitive architecture such as Anderson's ACT-R theory, the strength of activation between a source of activation j and a fact i is dependent on the log of the fan: "activation [...] will decrease as a logarithmic function of the fan associated with the concept. [...] the strengths of associations decrease with fan because the probability of any fact, given the concept, decreases with fan" (Anderson & Reder 1999: 188). For the association of a word to constructions, this would mean that the strength of the word's associations will be affected by the number of constructions to which it is connected, and vice versa for the association of a construction to words, which shows that the number of types with which words/ constructions occur is, contra approach 1, undoubtedly cognitively relevant. In fact, as I will discuss now, it is not just this type frequency that is important.

Second, approach 3 provides not just the type frequencies just discussed, but also the type-token distributions: Not only do we now know that w1 appears in c1 and in 5 other constructions we also know with which (italicized) frequencies (80 in c1, plus 90, 45, 35, 25, and 5 instances in c2-6); analogously for the other words and the other constructions. This raises an important issue which most usage-based theorizing discusses very little: Is there any reason to regard this level of resolution as relevant especially given Bybee's (2010: 100f.) question, "[b]y what cognitive mechanism does a language user devalue a lexeme in a construction if it is of high frequency generally?" In approach 1, of course, the question of 'devaluing' does not arise because one does not have to consider where, other than in construction c1, word w1 occurs. However, by insisting that the distribution of a word w1 outside of the construction c1 is irrelevant (cf. p. 100) and that only the frequency of w in c is needed, Bybee and other proponents of approach 1 run into a huge problem. Not only have we seen above that type frequencies are already

relevant to a truly cognitive approach, but Bybee (2010: 89) herself also approvingly states "Goldberg 2006 goes on to argue that in category learning in general a centred, or low variance, category is easier to learn." This correctly emphasizes the importance of type-token distributions but her own approach 1 does not incorporate the very type frequencies and type-token distributions which allow usage-based theorists to talk about 'centred, or low variance, categories' in the first place.

As another example of the importance of type-token distributions, consider Goldberg, Casenhiser, & Sethuraman's (2004) learning experiment: Subjects heard the same number of novel verbs (type frequency: 5), but with two different distributions of 16 tokens. These different token distributions a balanced condition of 4-4-4-2-2 (with an entropy of H = 2.25) and a skewed lower-variance condition of 8-2-2-2-2 (H = 2). The more skewed distribution was learned significantly better, but proponents of a radical approach 1 cannot explain this very well since both conditions involved 16 tokens. Proponents of approach 3, on the other hand, can explain this result perfectly with reference to the lower entropy/uncertainty of the skewed distribution; in a similar vein, it is such type-token distributions that help explain the issue of preemption.

Similar examples of how such more comprehensive co-occurrence information is useful abound. The classics of Redington, Chater & Finch (1998) and Mintz, Newport & Bever (2002) are based on similar co-occurrence matrices (based on bigram frequencies, however), as is Latent Semantic Analysis. McDonald & Shillcock (2001: 295) demonstrate that:

Contextual Distinctiveness (CD), a corpus-derived word recognition summary measure of the frequency distribution of the contexts in which a word occurs [based on H l, STG] [...] is a significantly better predictor of lexical decision latencies than occurrence frequency, suggesting that CD is the more psychologically relevant variable.

Recchia, Johns & Jones (2008: 271f.) summarize their study:

The results [...] suggest that lexical processing is optimized for precisely those words that are most likely to be required in any given situation. [.] context variability is potentially a more important variable than is frequency in word recognition and memory access.

Raymond & Brown (2012) find that word frequency plays no role for reduction processes once contextual co-occurrence factors are taken into consideration; Baayen (2010) discusses comprehensive evidence for the relevance of rich contextual and entropy-based measures. Thus, in addition to the many problems of Bybee's argumentation addressed above, there is a large number of theoretical approaches and empirical studies in corpus and psycholinguistics that powerfully converge in their support of a usage-based approach that invokes much more contextual information than the CA-type of approach 2, let alone approach 1 at the very least, we need type frequencies of co-occurrence of words and constructions and their type-token distributions.

Approach 4: Dispersion of (co-)occurrence

In some sense, unfortunately, the two-dimensional cross-tabulation of Figure 6 is still not sufficient: What is missing is how widespread in language use a particular (co-)occurrence is, a notion that is known as dispersion in corpus linguistics (cf. Gries 2008). Essentially we need a three-dimensional approach in which cross-tabulations such as Figure 6 are obtained for a third dimension, namely one containing 'corpus parts,' which could correspond to registers/genres or any other potentially relevant distinction of usage events; cf. Figure 7.

Dispersion is relevant because frequent co-occurrence or high attractions are more important when they are attested in many different registers or situations or other types of usage events, which affects how associations between linguistic elements are discovered/learned:

Given a certain number of exposures to a stimulus, or a certain amount of training, learning is always better when exposures or training trials are distributed over several sessions than when they are massed into one session. This finding is extremely robust in many domains of human cognition. (Ambridge et al., 2006: 175)

Stefanowitsch & Gries (2003) find that the verbs fold and process are both relatively frequent in the imperative, occurring 16 and 15 out of 32 and 44 times, respectively, in the imperative, and are highly attracted to it (with collostruction values of 21 and 16.7, respectively). However, both verbs occurred in the imperative

Approach 4: Cross-tabulation of words w1-m and constructions c1-n in (here, 3) different slices/parts of a corpus

Figure 7. Approach 4: Cross-tabulation of words w1-m and constructions c1-n in (here, 3) different slices/parts of a corpus

in only one of the 500 files of the corpus studied; their dispersion values DP (cf. Gries 2008) are > 0.99, which indicates their absolutely unrepresentative dumpiness in the corpus, which in turn means their relevance to the imperative should be downgraded especially when compared to hang on, which is just as frequent in the imperative but occurs in many more corpus files. Thus, while frequency of (co-)occurrence is related to dispersion on the whole, frequent items will be more dispersed, less frequent items will be more clumpy this correlation is by no means absolute, and Gries (2010b) shows that dispersion is sometimes a better predictor of reaction times than frequency. Therefore, a cognitively realistic approach should include dispersion and even different word senses.

  • [1] H, entropy, is a measure of uncertainty, or dispersion, for categorical data which quantifies how evenly distributed elements are across categories. It ranges from 0 (for perfectly skewed/ predictable distributions such as {0, 0, 0, 100}) to log2 n (for perfectly equal/unpredictable distributions such as {25, 25, 25, 25}; cf. Gries (2009: 112f.). Data of this type are of course extremely hard to obtain (especially with a reasonable degree of precision) but see Roland, Dick & Elman (2007) for one recent attempt.
  • [2] I thank a reviewer for pointing out this connection.
< Prev   CONTENTS   Next >