I Methodological advances

Natalia Levshina and Kris Heylen

A radically data-driven Construction Grammar: Experiments with Dutch causative constructions

The need for objective data-driven semantic classes1

Constructions are commonly defined as pairings of form and function (Goldberg 1995, 2006). The meaning, understood here as the concept or conceptual structure associated with a construction, is a crucial aspect of the latter’s function. Although the meaning of a construction cannot always be reduced to the meaning of its components (e.g. Goldberg 1995), the semantic properties of its slot fillers can be used as a convenient heuristic to access the conventional uses of the construction in question. For instance, the central sense of the caused- motion construction (X CAUSES Y to MOVE Z, e.g. She threw a coin in the Trevi fountain) commonly involves a verb of directed physical action (threw), a movable physical object (a coin) and another physical object that can serve as a location (the Trevi fountain). Semantic classes of the slot fillers can be helpful in two ways. First, they can indicate the differences between the senses of one construction; second, they may reflect the division of “semantic labour” between two or more near-synonymous constructions.

Quantitative corpus-based studies of constructional semantics (including lexical semantics) frequently employ semantic classes. Table 1 lists some of the existing quantitative methods and approaches. They differ with regard to the research perspective: the researcher can either focus on the internal semantic structure of a construction, most commonly on its polysemy, or compare the distinctive features of functionally related constructions, for instance, nearsynonyms, or “alternations”. This distinction corresponds to the semasiological and onomasiological perspective, in more traditional semantic terms (Geeraerts, Grondelaers, and Bakema 1994). The other distinction is whether the semantic classes are determined a priori and form the basis of the subsequent analyses, [1]

or they are inferred a posteriori in order to interpret the results (e.g. lists of distinctive collexemes). Yet, all those methods involve semantic classes of the slot fillers - exclusively or alongside other semantic features.

Table 1: Quantitative corpus-based methods in usage-based approaches to constructions (a selection)

a priori semantic classes

a posteriori semantic classes

semasiological

perspective

(polysemy)

  • - Behavioural Profiles of a word's senses (Gries 2006)
  • - Multidimensional Scaling-based semantic maps of linguistic categories (Levshina 2011)

(standard) Collostructional Analysis (Stefanowitsch and Gries 2003)

onomasiological

perspective

(near-synonymy)

  • - regressing on functionally similar constructions (e.g. Heylen 2005; Bresnan et al. 2007)
  • - Behavioural Profiles of lexemes (Gries and Divjak 2006)
  • - Correspondence Analysis maps of constructional spaces (Levshina, Geeraerts, and Speelman 2013)

Distinctive Collexeme Analysis (Gries and Stefanowitsch 2004)

However, the use of classifications is often problematic. If a researcher applies an ad hoc intuitive classification, (s)he runs the risk of missing some important distinctions or imposing irrelevant ones. Trying to avoid this caveat, many linguists apply ready-made classifications, such as the ones available in Levin (1993) or WordNet (Fellbaum 1998), which are based on more or less definite criteria or conventions. Still, this practice involves several conceptual and practical difficulties. First of all, ready-made conventional classifications are not available for many languages besides English. Second, the existing classifications tend to be incomplete, so that the researcher has to decide what to do with a large chunk of data that fall outside the classifications. In addition, many classifications, such as WordNet, are tree-like and contain several levels. In this situation, choosing the level of classification granularity (i.e., how deep one should prune the classification tree) becomes an empirical problem. One of the goals of the present paper is to develop a strategy of finding the optimal level of granularity on the basis of objective quantitative criteria.

The greatest problem, however, is that even the conventional classifications are largely introspective. More recently, there have been attempts to classify constructional slot fillers on the basis of large-scale corpus evidence. For instance,

Gries and Stefanowitsch (2010) have attempted to classify constructional col- lexemes with the help of a set of contextual features found in the corpus. The classes were evaluated qualitatively, the main criterion being semantic interpret- ability of the classes. Yet, if one adheres to the principles of empirical semantics (e.g. Geeraerts 2010a), it is at least just as important to present objective quantitative evidence that the choice of classification is justified by the facts of usage.

In this paper, we propose a novel objective distributional approach based on large-scale corpus data and rich contextual information. The core of the approach is Semantic Vector Spaces (Lin 1998), a method widely used in computational models of language. We demonstrate how the method can be used to choose between hundreds of possible classifications, arriving at the optimal one in terms of parsimony and predictive power for every particular set of near-synonymous constructions. We illustrate how the method works on the “alternation” of Dutch causative constructions with the auxiliaries doen ‘do’ and laten ‘let’.

The structure of the article is as follows. In the following section, we introduce the object of the case study, the causative constructions in Dutch. Section 3 presents the general principles of the distributional models of Semantic Vector Spaces, followed by a description of the data and specific models in section 4. Section 5 reports the results of our classification experiments. In section 6, we discuss these results from a constructionist perspective and suggest some steps for future research.

  • [1] This research project was partly funded by a grant from the Research Foundation of Flanders(FWO) (G.0330.08) awarded to Dirk Geeraerts and Dirk Speelman, the Quantitative Lexicologyand Variational Linguistics Research Unit at the University of Leuven.
 
Source
< Prev   CONTENTS   Source   Next >