Collostructional analysis: A brief overview

Perspective 1: CA and its goals

All of corpus linguistics is by definition based on frequencies either on the question of whether something occurs (i.e., is a frequency n > 0?) or not (i.e., is n = 0?) or on the question of how often something occurs (how large is n?) which makes it a distributional discipline. Since linguists are usually not that much interested in frequencies per se but rather structure, semantics/meaning, pragmatics/function, etc., corpus-linguistic work has to make one very fundamental assumption, namely that distributional characteristics of an element reveal many if not most of its structural, semantic, and pragmatic characteristics; cf. the following quote by Harris (1970: 785f.):

[i]f we consider words or morphemes A and B to be more different in meaning than A and C, then we will often find that the distributions of A and B are more different than the distributions of A and C. In other words, difference of meaning correlates with difference of distribution.

A more widely-used quote to make the same point is Firth's (1957: 11) "[y]ou shall know a word by the company it keeps." Thus, corpus-linguistic studies of words have explored the elements with which, say, words in question co-occur, i.e., the lexical items and, to a much lesser degree, grammatical patterns with which words co-occur their collocations and their colligations. However, since some words' overall frequencies in corpora are so high that they are frequent nearly everywhere (e.g., function words), corpus linguists have developed measures that downgrade/penalize words whose high frequency around a word of interest w may reflect more their overall high frequency than their revealing association with w. Such measures are usually referred to as association measures (AMs) and are usually applied such that one

i. retrieves all instances of a word w;

ii. computes an AM score for every collocate of w (cf. Wiechmann 2008 or Pecina 2009 for overviews);

iii. ranks the collocates of w by that score;

iv. explores the top t collocates for functional patterns (where functional encompasses 'semantic, 'pragmatic, 'information-structural', ...).

Thus, the purpose of ranking words on the basis of such AMs is to produce a ranking that will place words at the top of the list that (i) have a relatively high frequency around w while (ii) not being too frequent/promiscuous around other words.

Perspective 2: CA and its mathematics/computation

CA is the extension of AMs from lexical co-occurrence a word w and its lexical collocates to lexico-syntactic co-occurrence: a construction c and the x words w1, w2, wx in a particular slot of c. Thus, like most AMs, CA is based on (usually) 2x2 tables of observed (co-)occurrence frequencies such as Table 1.

Table 1. Schematic frequency table of two elements A and B and their co-occurrence

Schematic frequency table of two elements A and B and their co-occurrence

Two main methods are distinguished. In the first, collexeme analysis (cf. Stefanowitsch & Gries 2003), A is a construction (e.g., the ditransitive NP V NP1 NP2), -A corresponds to all other constructions in the corpus (ideally on the same level of specificity), B is a word (e.g., give) occurring in a syntactically-defined slot of such constructions, and -B< corresponds to all other words in that slot in the corpus. A collexeme analysis requires such a table for all x different types of B in

Table 2. Observed frequencies of give and the ditransitive in the ICE-GB (expected frequencies in parentheses; from Stefanowitsch & Gries 2003)[1]

Observed frequencies of give and the ditransitive in the ICE-GB (expected frequencies in parentheses; from Stefanowitsch & Gries 2003)

the relevant slot of A. For example, Table 2 shows the frequency table of give and the ditransitive based on data from the ICE-GB. Each of these x tables is analyzed with an AM; as Stefanowitsch & Gries (2003: 217) point out, "[i]n principle, any of the measures proposed could be applied in the context of CA." Most applications of CA use the p-value of the Fisher-Yates exact test (pFYE) or, as a more easily interpretable alternative, the (usually) negative log10 of that p-value (cf. Gries, Hampe & Schonefeld 2005: 671f., n. 13).

The authors give several reasons for choosing pFYE, two of which (cf. Pedersen 1996) I mention here, a third important one will be mentioned in Section 2.3.

i. exact tests do not make distributional assumptions that corpus data usually violate, such as normality and/or homogeneity of variances (cf. Gries & Stefanowitsch 2004: 101);

ii. because of the Zipfian distribution of words in a construction's slot, any AM one might want to use must be able to handle the small frequencies that characterize Zipfian distributions (Stefanowitsch & Gries 2003: 204) and at the same not be anti-conservative.

For Table 2, the pFYE is a very small p-value (< 4.94e-324) or a very large log10 of that p-value (> 323.3062) so the mutual attraction between give and the ditransitive is very strong. This measure is then computed for every verb type in the ditransitive so that the verbs can be ranked according to their attraction to the di-transitive. This entails that the p-values are mainly used "as an indicator of relative importance" (cf. Stefanowitsch & Gries 2003: 239, n. 6), and virtually all collostructional applications have focused only on the 20 to 30 most highly-ranked words and their semantic characteristics (although no particular number is required).

For the second method, distinctive collexeme analysis (cf. Gries & Stefanowitsch 2004a), the 2x2 table is set up differently: A corresponds to a construction (e.g., the ditransitive), -A corresponds to a functionally similar construction (e.g., the prepositional dative NP V NP PP^or/to), B corresponds to a word (e.g., give) occurring in syntactically-defined slots of A, and -B corresponds to all other words in the slots/the corpus; cf. Table 3.

Table 3. Observed frequencies of give and the ditransitive and the prepositional to-dative in the ICE-GB (expected frequencies in parentheses; from Gries & Stefanowitsch 2004)

Verb: give

Other verbs



461 (213)





146 (394)

1,773 (1,525)


prepositional dative





Again, this results in a very small pFYE (1.835954e-120) or very large negative logged10 p-value (119.7361), indicating that give's preference for the ditransitive over the prepositional dative is strong. Again, one would compute this measure for all x verbs attested at least once in either the ditransitive or the prepositional to-dative, rank-order the x verbs according to their preference and strength of preference, and then inspect the, say, top t verbs for each construction.

Other extensions of CA are available and have been used. One, multiple distinctive collexeme analysis, extends distinctive collexeme analysis to cases with more than two constructions (e.g., the will-future vs. the going-to future vs. the shall-future vs. present tense with future meaning). Another one, covarying collexeme analysis, computes measures for co-occurrence preferences within one construction (cf. Gries & Stefanowitsch 2004b).[2]

  • [1] The expected frequencies are computed as in every contingency table or in chi-square tests for independence. The expected frequency in each cell is the result of row total times column total divided by the sum of all frequencies in the table. For instance, 1035-1160/138,664 ~ 8.66 ~ 9.
  • [2] All of these CA methods (with different AMs) can be computed easily with an interactive R script available at <>.
< Prev   CONTENTS   Next >