Frequencies, probabilities, and association measures in usage-/exemplar-based linguistics

Some necessary clarifications[1]

Stefan Th. Gries

University of California, Santa Barbara

In the last few years, a particular quantitative approach to the syntax-lexis interface has been developed: collostructional analysis (CA). This approach is an application of association measures to co-occurrence data from corpora, from a usage-based/cognitive-linguistic perspective. In spite of some popularity, this approach has come under criticism in Bybee (2010), who criticizes the method for several perceived shortcomings and advocates the use of raw frequencies/ percentages instead. This chapter has two main objectives. The first is to refute Bybee's criticism on theoretical and empirical grounds; the second and further-reaching one is to outline, on the basis of what frequency data really look like, a cline of analytical approaches and, ultimately, a new perspective on the notion of construction based on this cline.


Linguistics is a fundamentally divided discipline, as far as theoretical foundations and empirical methodology are concerned. On the one hand and with some simplification, there is the field of generative grammar with its assumptions of (i) a highly modular linguistic system within a highly modular cognitive system (ii) with considerable innate structure given the poverty of the stimulus, and (iii) a

methodology largely based on made-up judgments of made-up (often context-free) sentences. On the other hand and with just as much simplification, there is the field of cognitive/functional linguistics with its emphasis on (i) domain general mechanisms, (ii) pattern-learning based on statistical properties of the input, and (iii) an (increasing) reliance on various sorts of both experimental and observational data.

Over the last 25+ years, this latter field has amassed evidence calling into the question the assumption of a highly modular linguistic system, a large amount of innate structure, and the reliability of the predominant kind of acceptability judgment data. First, there is now a lot of experimental evidence that shows how much aspects of syntax interact with, or are responsive to, e.g., phonology, semantics, or non-linguistic cognition. Second, many studies have now demonstrated that the supposedly poor input is rich in probabilistic structure, which makes many of the supposedly unlearnable things very learnable. Third, Labov and Levelt, among others, already showed in the early 1970s that the judgments that were adduced to support theoretical developments were far from uncontroversial and that better ways of gathering judgment data are desirable. Over the last few years, corpus data have especially become one of the most frequently used alternative types of data.

This movement towards empirically more robust data is desirable. However, while (psycho)linguistic experimentation has a long history of methodological development and refinement, the situation is different for corpus data. While corpus linguistic approaches have been around for quite a while, the methodological evolution of corpus linguistics is still a relatively young development and many corpus-based studies are lacking the methodological sophistication of much of the experimental literature. This situation poses a bit of a challenge because, while a usage-based approach to language an approach stipulating that the use of language affects the representation and processing language does not require usage data, the two are of course highly compatible. This makes the development of an appropriate corpus-linguistic toolbox an important goal for usage-based linguistics.

This chapter is concerned with a recent corpus-based approach to the syntax-lexis interface called collostructional analysis (CA), which was developed to apply recent developments in corpus linguistics to issues and questions in cognitive/usage-based linguistics. Most recently, however, this approach was criticized (Bybee 2010: Section 5.12) for several perceived shortcomings. The first part of this chapter constitutes a response to Bybee's claims, which result from a lack of recognition of the method's assumptions, goals, and published results. However, I will also discuss a variety of cognitive-linguistic and psycholinguistic notions which are of relevance to a much larger audience than just collostructional researchers and which speak to the relation between data and the theory supported or required by such data. Section 2 provides a brief explanation of the collostructional approach while the approach is now reasonably widespread, this is necessary for the subsequent discussion. Section 3 presents the main claims made by Bybee, which I will then address in Section 4. Section 5 will develop a cline of co-occurrence complexity and discuss its theoretical motivations and implications with a variety of connections to psychological and psycholinguistic work.

  • [1] This chapter is a revised and extended version of a plenary talk I gave at the 6th International Conference on Construction Grammar in Prague. I thank the audience there, workshop participants and panel discussants at the Freiburg Institute of Advanced Studies, students of my doctoral seminar on psycholinguistics at UCSB, the audience of a Linguistics Colloquium talk at UC Berkeley, and (in alphabetical order) William Croft, Sandra C. Deshors, Adele E. Goldberg, Anatol Stefanowitsch, and Stefanie Wulff for feedback, input, and/or discussion. I also thank two anonymous reviewers and the editors of this volume for their comments. The usual disclaimers apply.
