Results of the classification experiments
In this section, we discuss the results of the classification experiments for every slot individually. In the last subsection, we also present the results for all three slots taken together.
Classification of the Causers
As mentioned in the previous section, we had two SVS models of the Causer nouns: the one with the lexical information only (BOW), and the one where the lexical information was enriched with the syntactic information about the eight dependency relations (DepRel8). For both models, we also tested different clustering solutions with the number of classes from 2 to 100. Figure 2 shows how the C index rapidly goes up from the very beginning, which indicates that the relevant semantic distinctions are captured by a relatively small number of classes. The syntactically enriched model performs much better than the simple bag-of- words model. This finding corroborates the results in Gries & Stefanowitsch 2010 (section 2.2), which also compared a bag-of-words model with a syntactically more precise one. The starting value for the syntactically enriched model with
Figure 2: The Causer: predictive power of two models, for different number of classes two classes only is already 0.69, and for six classes it is 0.80, which is considered to be good. The 100-cluster syntactically enriched solution has the highest value (C = 0.89), but its predictive power is not dramatically different from the more parsimonious classifications with a smaller number of classes.
Let us consider the classification with 6 clusters, which is quite successful in discriminating between the doen and laten observations. The classification includes a cluster with predominantly inanimate concrete and abstract nouns: cd ‘cd’, cijfer ‘digit’, herstel ‘recovery’, stem ‘voice’, aanslag ‘attack’, afwezigheid ‘absence’, resultaat ‘result’, etc. Cluster 2 (the numbering is arbitrary) contained mostly football- and music-related nouns denoting people and organisations: Feyenoord (a football club in the Netherlands), diligent ‘conductor’, speler ‘player’, orkest ‘orchestra’ etc., although there were a few exceptions, such as beurs ‘stock exchange’, which comes from economy-related articles. Cluster 3 included some proper names of conductors and common and proper nouns denoting political and other agents: Gergiev (a Russian conductor), Van Hecke (a Belgian politician), secretaris-generaal ‘General Secretary', Harnoncourt (an Austrian conductor) etc. Cluster 4 contained many geographical names, which are frequently used in newspaper articles to refer to the government metonymically: Verenigde Staten ‘the US’, Amerika, Washington, etc. Cluster 5 included mostly common nouns, which denote people in charge and organisations: regering ‘government’, minister ‘minister’, bedrijf ‘company’, trainer ‘trainer’. The sixth cluster contained only 7 nouns with very low collocability due to an extremely low frequency.
The majority of the observations that contain the Causers from Cluster 1 (inanimate entities) are instances of the construction with doen, whereas the nouns from Clusters 2-5 (people and organisations) occur more frequently with laten. Cluster 6 was too small for evaluation. The findings therefore support the results of the previous studies. The distinction between animate and inanimate Causers had very high predictive power in all previous multivariate analyses.