Semantic vector spaces
Semantic Vector Spaces (SVSs) have become the mainstay of modeling lexical semantics in Computational Linguistics over the last 20 years. Based on the hypothesis that semantically similar words tend to be used in similar contexts, these corpus-based approaches model the meaning of a word in terms of the contexts in which it appears. They have been applied to a wide variety of computational tasks - from Question Answering and Information Retrieval to automated essay scoring (Landauer and Dumais 1997) or the modeling of human behavior in psycholinguistic experiments (Lowe and McDonald 2000). SVSs were first developed during the so-called statistical turn in Natural Language Processing (NLP) in the 1990s, when NLP moved away from the then prevalent rule-based approach. They addressed the need to model semantics in a bottom- up, automated fashion from large amounts of corpus data, rather than having to rely on the time-consuming manual construction of lexical resources. As such, this data-oriented development in Computational Linguistics was not unlike the empirical and statistical turn observable today in Theoretical Linguistics, and in Cognitive Linguistics and Construction Grammar in particular. We will argue that SVSs can also be useful in more theoretically-oriented linguistic research in Construction Grammar. Thanks to their fully automatic, bottom-up analysis of the distribution of a word, SVS models are not only able to deal with enormous quantities of data; they also bypass the need for subjective human judgments and may bring to light patterns that escape the human eye.
The origin of SVSs can be traced back to a fundamental linguistic insight already expressed in the 1950s. Back then, a number of linguists and philosophers stressed the dependency, or even the identity, between the meaning of a word and its use. This view inspired John Rupert Firth’s quote that “you shall know a word by the company it keeps” (Firth 1957), Ludwig Wittgenstein’s “the meaning of a word is its use in the language” (1953), and Zelig Harris’ (1954) insight that semantically similar words are used in similar contexts - a view which is now often referred to as the distributional hypothesis. In the (mainly) British tradition of Corpus Linguistics this hypothesis was put into practice by investigating the collocational behavior of words and identifying their idiomatic usage. SVSs can be seen as an extension and generalisation of collocational analysis. Instead of identifying only a restricted number of significant collocations as input for further qualitative analysis, SVSs track a word’s co-occurrences with all other words in the corpus, resulting in a sort of over-all collocational profile that is the input for further quantitative analysis. More specifically, the similarity of collocational profiles is measured mathematically. The hypothesis is that words with a similar collocational profile will be semantically related and can thus be grouped into semantic classes.