Distributional classes

The Semantic Vector Spaces were all constructed from the Twente Nieuws Corpus (380 million words, see Ordelman et al. 2007).[1] For the SVSs using syntactic dependency information, we used the version parsed with the Alpino dependency parser (van Noord 2006) whose dependency triples have been shown to be 90% accurate in the Twente Nieuws Corpus (Plank and Van Noord 2010). Following our discussion in section 3.3, we built SVS models with 3 different types of context definition and with the slot fillers of the Dutch causative construction as target words:

  • 1) bag-of-words models for nouns and verbs;
  • 2) dependency-based models for nouns and verbs;
  • 3) subcategorisation frame models for verbs only.

For the nouns filling the Causer and Causee slots in the doen en laten constructions, we only constructed one bag-of-words SVS and one dependency- based SVS (subcategorisation frame models are not relevant for nouns). This choice was based on previous research (Heylen et al. 2008; Peirsman, Heylen, and Geeraerts 2008) where these two models gave the best performance in finding tight semantic relations. The bag-of-words SVS had a relatively small context window of 5 words left and right of the target noun. The dependency-based SVS used 8 dependency relations distinguished by the Alpino parser that a noun can be involved in. These relations are listed in Table 4 with examples (the target noun is in italics and the context feature resulting from the dependency relation is underlined).[2] In the first four relations (su, objl, pc, advPP), the target noun is regarded as the dependent element and the governing verb is counted as a context feature. In the next three relations (pmPP, adj, app) the noun is the head and the dependent adjective and/or noun is counted as a context feature. Finally, the co-ordination relation is a symmetric one and always generates two target noun/context noun pairs.

Table 4: Syntactic dependency relations of nouns. The target noun is in italics and the context feature resulting from the dependency relation is underlined

Abbr.

Dependency relation

example

su

subject

De babv slaapt. ‘The baby sleeps.'

objl

direct object

Hijeet een appel. ‘He eats an apple.'

pc

prepositional complement

Ze luistert naar de radio. ‘She listens to the radio.'

advPP

adverbial prepositional phrase

Hij woont in een dorp. ‘He lives in a village.'

pmPP

post-modifying prepositional phrase

het meisje met de jurk ‘the girl in the dress'

adj

adjective

de gelaarsde kat ‘the booted cat'

app

apposition

de koninqin, een wijze vrouw ‘the queen, a wise woman'

cnj

co-ordination

de krekel en de mier ‘the cricket and the ant' de krekel en de mier

Since previous research has shown several possible classification criteria of the Effected Predicates (see section 2), we focused on exploring different Semantic Vector Spaces for verbs. Within each of the 3 context definition types, we therefore varied the specific number and type of context features. Within the bag-of-word models, we varied the size of the window around the target verb and constructed a first model with a relatively small window size of 4 context words to the left and right, and a second model with a relatively large window of 15 words on either side. Within the dependency-based models we varied the number of different dependency relations. Based on the Alpino dependency parser, we can distinguish 23 different dependency relations that a verb can engage in. They are listed in Table 5. A first model only takes context features into account that are based on the 3 bare NP arguments (su, objl, obj2). A second model uses all 7 arguments to extract context features (su, objl, obj2, pc, ld, ldprep, predc) and, finally, the third model took all 13 dependency relations with a lexically full element into account (su, objl, obj2, pc, ld, me, ldPP, adv, advPP, predm, predc, invomte, invte).[3]

Table 5: An overview of syntactic dependency relations of verbs. The target verb is in italics and the context feature resulting from the dependency relation is underlined

Abbr.

Relation

Example

su

subject

Het meisje slaapt. ‘The girl sleeps.'

sup

cataphoric subject

Het blijkt dat... ‘It appears that...'

objl

direct object

Hij eet een appel. ‘He eats an apple.'

pobjl

cataphoric object

.. .of ik het betreur dat... ‘whether I regret it that...'

obj2

indirect object

Ze geeft papa een kus. ‘She gives daddy a kiss.'

se

reflexive

Hij'schaamt zich. Lit. ‘He shames himself (i.e., He is ashamed).'

svp

separable affix

Je lacht me uit. Lit. ‘You laugh me out (i.e., You're mocking me).'

pc

prepositional complement

Ze luistert naar de radio. ‘She listens to the radio.'

me

measure complement

Het kost 20 euro. ‘It costs 20 euro's.'

ld

locative complement

Ze werkt thuis. ‘She works at home.'

ldPP

locative prepositional phrase

Ze rijdt naar huis. ‘She is driving home.'

adv

adverbial complement

Je zingt goed. ‘You sing well.'

advPP

adverbial prepositional phrase

Hij komt over 2 weken. ‘He's coming in two weeks' time.'

predm

predicative modifier

Hijkwam dronken thuis. ‘He came home drunk.'

predc

predicative complement

Dat smaakt lekker. ‘That tastes nice.'

ccl

complement clause

Hij zegt dat hij komt. ‘He says that he is coming.'

cclof

complement clause (choice)

Ze vraagt of je komt. ‘She asks whether you're coming.'

cvte

complement verb

Ze staat te praten. Lit. ‘She stands talking (i.e., She is talking).'

cvom

complement verb (goal)

We reizen om te leren. ‘We travel to learn.'

invaux

auxiliary verb

Ik kan lezen. ‘I can read.'

invte

semi-auxiliary verb

Hijligt te slapen. Lit. ‘He lies sleeping (i.e., He is sleeping.'

invomte

semi-auxiliary (goal)

Ik probeer om te slapen. ‘I try to sleep.'

invaanhet

progressive marker

Ik ben aan het lezen. Lit. ‘I am at reading (i.e., I am reading).'

Within the subcategorisation frame models, we varied both the syntactic positions that could be included in a subcategorisation frame and the amount of lexico-semantic information about the elements filling the syntactic positions. The syntactic positions are based on the same 23 dependency relations in Table 5. Again, we made 3 subdivisions for the types of relations: the 5 bare NP arguments (su, sup, se, objl, obj2); all 9 arguments (su, sup, se, objl, obj2, pc, ld, ldprep, predc); all 23 dependency relations. The lexico-semantic information on the syntactic position fillers could be of 4 types: (i) no lexico- semantic information; (ii) the specific preposition introducing the dependent in pc, ldPP and advPP; (iii) the semantic class of a dependent noun; (iv) both the specific preposition and the semantic noun class. For the semantic noun classes, we used the second highest ancestor of the noun in the Dutch WordNet (Vossen 1998). This resulted in 11 semantic noun classes: animate being, object, situation, action, utterance, property, thought, part, group, place, time. If the noun was not present in the Dutch WordNet, we reverted to a syntax-only subcategorisation frame feature. If the noun belonged to more than one semantic class (because of polysemy), the most frequent overall class was used.

An overview of the models can be found in Table 6, together with the abbreviations used in the rest of the article. For all models, the maximum number of context features was restricted to the 4000 most frequent context features (excluding 122 function words). Both target and context words were processed on the lemma level (i.e., generalising over word-forms). In all models, the cooccurrence frequencies were weighted with the Pointwise Mutual Information index. For all the SVS models, the similarity between target word vectors was measured with the cosine. This resulted in 16 similarity matrices for verbs, and 2 similarity matrices for nouns.

Table 6: An overview of the models and classifications

Causer and Causee (nouns)

Effected Predicate (verbs)

Context definition

feature selection

no.

clusters

feature selection

no.

clusters

Bag of words

5 words left and right: BOW5

2-100

4 words left and right: BOW4 15 words left and right: BOW15

  • 5-100
  • 5-100

Dependency-based

8 dependencies: DEPREL8

2-100

3 dependencies: Vbarel 7 dependencies: Varel 13 dependencies: rVrel

  • 5-100
  • 5-100

5-100

Subcat. frame

Syntax only 5 dependencies: 5syn 9 dependencies:: 9syn 23 dependencies: 23syn

  • 5-100
  • 5-100

5-100

Preposition information 9 dependencies: 9relprep 23 dependencies: 23relprep

  • 5-100
  • 5-100

Sem.Class information 5 dependencies: 5sclass 9 dependencies: 9sclass 23 dependencies: 23sclass

  • 5-100
  • 5-100

5-100

Prep. + Sem.Class information 5 dependencies: 5richsubcat 9 dependencies: 9richsubcat 23 dependencies: 23richsubcat

  • 5-100
  • 5-100

5-100

The similarity matrices were the input for a hierarchical cluster analysis (e.g. Everitt et al. 2001) that groups the noun and verb lemmata into semantic classes. We experimented with different numbers of classes, ranging from 2 to 100 for the Causers and Causees (all numbers from 2 to 10 and then intervals of 5, totalling 27 different clusterings), and ranging from 5 to 100 for the Effected Predicates (intervals of 5, totalling 20 different clusterings). Together with the different context definitions and feature selection criteria, this gives 54 possible semantic classifications of the Causer and Causee nouns and 240 possible semantic classifications of the Effected Predicate verbs.

  • [1] No separate SVSs were constructed for Belgian Dutch (Leuven News Corpus). As a consequence, some typically Belgian Dutch verbs or verb meanings might have been disregarded.
  • [2] For a full description of the Alpino parsing scheme, see van Noord (2006).
  • [3] Since dependency-based models, like bag-of-words models only select lexically full words ascontext words (excluding function words), we only took those dependency relations intoaccount where there is a lexically full dependant of the verb (a noun, adjective or other verb).This excludes relations like cataphoric object (pobjl) or reflexive (se), but also the clausalarguments where the dependent is not a single lexically full word.
 
Source
< Prev   CONTENTS   Source   Next >