Methods

Dictionary graphs

To answer this question, dictionaries can be analyzed using graph theory. We have previously analyzed several English dictionaries (including Longman’s, Cambridge, Merriam-Webster and WordNet; [VIN 16]). In the present paper, we replicate and extend these results for the Wordsmyth Advanced Dictionary-Thesaurus (WADT, 70K words) as well as for three reduced versions of it [Wordsmyth Children’s Dictionary-Thesaurus (WCDT, 20K words), Wordsmyth Beginner’s Dictionary-Thesaurus (WBDT, 6K words), and Wordsmyth Illustrated Learner’s Dictionary (WILD, 4K words)], modified for specific purposes by using fewer and more frequent words [PAR 98]: http://www.wordsmyth.net.

Apart from proper names, all words used in a dictionary are defined. In a dictionary graph, there is a directional link from each defining word to each defined word. Our analyses of dictionary graphs have revealed a hidden structure in dictionaries, which has not been observed or reported previously (see Figure 5.1). If we remove recursively all those words that are defined but that do not go on to define any further word, every dictionary graph can be reduced by about 85% (the percentage reducibility becomes less, the smaller the dictionary) to a unique set of words (which we have called the Kernel, K), out of which all the words in the dictionary can be defined [BLO 08]. There is only one such Kernel in any dictionary, but the Kernel is not the smallest number of words, M, out of which all the words in the dictionary can be defined. That smallest number of words is the dictionary graph’s “minimum feedback vertex set” (see, e.g., [FOM 08, KAR 72, LAP 12], which we will call the dictionary’sMinSet (Minimal Grounding Set).

For the Wordsmyth dictionaries analyzed in this paper, the relative size of the Kernel varies from 17% of the whole dictionary in the largest dictionary (WADT) to 49% in the smallest (WILD). In all four dictionaries, the MinSet size is about a fifth of the Kernel (Table 5.1). Unlike the Kernel, however, the MinSet is not unique: There are a huge number of (overlapping) MinSets in every dictionary, each of the same minimal size, M. Each is a subset of the Kernel and any one of them, if it were grounded, would then ground the entire dictionary[1].

WADT

WCDT

WBDT

WILD

Total word meanings

73,158

20,129

6,038

4,244

First word meanings

43,363

11,626

4,456

3,159

Rest

36,196 (83%)

8,617 (74%)

3,254 (73%)

1,608 (51%)

Kernel

7,167 (17%)

3,009 (26%)

1,202 (27%)

1,551 (49%)

Satellites

1,999 (5%)

609 (5%)

258 (6%)

159 (5%)

Core

5,168 (12%)

2,400 (21%)

944 (21%)

1,392 (44%)

MinSets

1,335 (3%)

548 (4.7%)

234 (5.3%)

330 (10.4%)

Satellite-MinSets

569 (1.3%)

160 (1.4%)

67 (1.5%)

39 (1.2%)

Core-MinSets

766 (1.7%)

388 (3.3%)

167 (3.8%)

291 (9.2%)

Table 5.1. The Wordsmyth dictionaries. The Kernel’s Core (its biggest SCC) is always much larger than the Satellites (small SCCs), and each MinSet (part-Core, part- Satellites) is about a fifth of the size of the Kernel. [Wordsmyth Advanced Dictionary- Thesaurus (WADT, 70K words), Wordsmyth Children’s Dictionary-Thesaurus (WCDT, 20K words), Wordsmyth Beginner’s Dictionary-Thesaurus (WBDT, 6K words) and Wordsmyth Illustrated Learner’s Dictionary (WILD, 4K words)]; Parks et al. [PAR 98])

The Kernel, however, is not just a large number of overlapping MinSets. It has structure too. It consists of a large number of strongly connected components (SCCs). (A directed graph -- in which a directional link indicates that word W1 is part of the definition of word W2 -- is “strongly connected” if any word in the graph can be reached by a chain of definitional links from any other word in the graph.) Most of the SCCs of the Dictionary’s Kernel are small, but in every dictionary, we have analyzed so far that there also turns out to be one very large SCC, about half the size of the Kernel. We call this the Kernel’s Core (C)[2].

The Kernel itself is a self-contained dictionary, just as is the dictionary as a whole (D). K is a sub-dictionary of D. Every word in the Kernel can be fully defined using only words in the Kernel. The Core is likewise a self- contained dictionary; but, in all the full-size dictionaries of natural languages that we have so far examined,[3] the Kernel is not an SCC: Within the Core (but not the Kernel), every word can be reached by a chain of definitions from any other word in the Core.

The Kernel is a Grounding Set for the dictionary as a whole, but it is not a Minimal Grounding Set (MinSet), i.e. not the smallest number of words capable of defining all the rest of the dictionary. The Kernel’s Core is not only a MinSet for the dictionary as a whole: it is not even a Grounding Set at all. The words in the Core alone are not enough to define all the rest of the words in the dictionary, outside the Core.

In contrast, all the MinSets are, like the Core, contained entirely within the Kernel, but no MinSet is completely contained within the Core: each straddles the Core and the Satellites. Each MinSet can define all the rest of the words in the dictionary outside the MinSet, but no MinSet is an SCC: its words are not even connected (Table 5.2). Indeed, the MinSet cannot define any of the words within the MinSet: only the words outside the MinSet.

The MinSets of Dictionaries hence turn out empirically[4] to consist of two parts: words in the Core (C) (which is entirely within the Kernel) and words in the remainder of the Kernel (K); the Satellites (S). The MinSet S/C ratio varies across dictionaries, but, within a given dictionary, it is the same for all of its MinSets. The MinSets, the smallest subsets capable of defining all the rest of the dictionary, are hence part-Core and part-Satellite. The natural question, then, is: What, if anything, is the difference between the kinds of words that are in the various components of this hidden structure of the dictionary: the MinSets, the Core, the Satellites, the Kernel, and the rest of the dictionary outside the Kernel?

Schematic diagram of the hidden structure of the Wordsmyth dictionaries

Figure 5.1. Schematic diagram of the hidden structure of the Wordsmyth dictionaries: the Rest of the Dictionary, Kernel, Core and MinSets (only one shown). The words in these different structural components of the dictionary graph tend to differ in their psycholinguistic properties (Figures 5.2 and 5.3). Satellite words are more frequent than the words in the Rest of the Dictionary, more concrete and learned at a significantly younger age. The Core words, in turn, are even younger and more frequent than the Satellite words, but they are the Satellite words that are most concrete ones than the Core and the Rest of the Dictionary (Figure 5.2). (Note that the diagram is not drawn to scale: the relative size of the Kernel varies from 17% of the whole dictionary for the largest Wordsmyth dictionary (WADT) to 49% for the smallest (WILD))

Is a column entry (right) necessarily a row entry (below) ?

Dict

Kern

GS

SCC

Core

MinSet

Dictionary (D)

yes

yes

-

-

yes

-

Grounding Set (GS)

yes

yes

yes

-

-

yes

Strongly Connected Component (SCC)

-

-

-

yes

yes

-

Minimal Grounding Set (MinSet)

-

-

-

-

-

yes

Table 5.2. Table indicating which of the hidden structures are Dictionaries, Grounding Sets, Strongly Connected Components and Minimal Grounding Sets

  • [1] There may be something informative in an analogy between all the MinSets as potentialbases for all of semantic space and the infinite number of different sets of M linearlyindependent vectors that can serve as the potential bases for M-dimensional vector space.
  • [2] Formally, the Core is defined as the union of all the strongly connected components (SCCs)of the Kernel that do not receive any incoming definitional links from outside themselves. (Ingraph theoretic language: there is no incoming link into the Core, i.e. no definitional link froma word not in the Core to a word in the Core.) It turns out to be an empirical fact about all thefull-sized dictionaries we have analyzed so far, however, that their Core is itself always anSCC, and also by far the largest of the SCCs in the Kernel, the rest of which look like manysmall satellites surrounding one big planet (Figures 5.1 and 5.2).
  • [3] In some of the mini dictionaries generated in our online dictionary game, however, the Coreis not an SCC, but a disjoint union of SCCs (Figures 5.5 and 5.6).
  • [4] Most of the properties described here are empirically observed properties of dictionarygraphs, and not necessary properties of directed graphs in general.
 
Source
< Prev   CONTENTS   Source   Next >