A U-DOP approach to learning meaningful grammars
What would acquiring a grammar involve, if we use the constructionist starting point of form-meaning pairings as the basic building blocks of a language? First of all, the problem of learning the mapping has become bigger, as not only word-meaning mappings have to be learned, but also mappings between all other kinds of constructions and meanings. Secondly, the learner would have to have some mechanism for arriving at a set of schemas, productive and less so, that allow it to talk about novel events and understand novel utterances. Thirdly, the model would have to be incremental: we cannot expect a real language learner to wait until it has seen some number of utterances. As we will see, incrementality in fact does not make the problem bigger, but can function as a bootstrap for the learner.
The basic idea behind meaningful U-DOP, or p-DOP, is that a learner analyzes a sentence using meaningful, heterogeneously sized and possibly redundant hierarchical representations. It starts analyzing utterances with the fragments it knows already, and then maps the unanalyzed parts of the utterance with parts of the semantic situation that are unexpressed in the analysis, thereby filling the gaps in its knowledge. Next, it decomposes these analyses and adds them to its knowledge base.
In this experiment, we describe a learner that can deal with very simple meaning representations. For understanding more complex semantic operations, obviously more complex representations are needed. The goal, however, is not to develop an account of meaning, but rather to show how DOP can acquire symbolic structures (i.e., form-meaning pairings) of different size, complexity and abstractness. By doing so, we show how the concept of building up meaningful constructions works with an unsupervised Data-Oriented model and how the acquired representations are in line with the constructivist understanding of language. As such, we argue that DOP is a substrate-neutral learner (as long as the substrate can be structured in some graphical way), that can be thought of as domain-general.
How can we envisage the representation of an analysis in p-DOP to which the dataoriented learning algorithm applies? It has to contain hierarchical structures, meaning representations and representations of words. As a simple formalism for representing meaning, we use predicate logic, restricted to representations of predicates and entities. The semantic representations take the place of the syntactic categories of DOP as the contents of the nodes in the analyses. A node may contain a value of a variable (predicate or entity), denoted as P:WATCH for the predicate watch or e:John for the entity John, or denote an operation on its constituent parts. watch(e1, Mary), for instance, means that there is some entity that fills the slot position e1 that can be found somewhere else in the subtree. Fragments like these can be combined by logical composition. Suppose we have the fragments in Figure 6. We can compose these and the interpretation will be hit(John, Bill), as the slots with which the two small, word-like subtrees are combined, dictate which entity is the first argument (e1) and which the second (e2). Not all (legal) fragments can be composed: the subtrees in Figure 8 cannot be composed, as there is no slot in the subtree to the left to fit p:hit to: both open positions are of the category E.
Figure 6: Legal composition of three subtrees. The empty nodes (visualized with • as the node label) are necessary to preserve binarity, but do not have any analytical significance. The character before the colon in the top-node of a subtree represents the semantic category of the subtree
Figure 7: The composition of Figure 6, represented in box diagrammes. Note that the two poles of the construction correspond to Verhagen's (2009) conception of the construction as a symbolic assembly
Figure 8: Illegal composition of two subtrees
Figure 9: The starting knowledge of the model in the example
Let us look at the learning procedure taking us to an inventory of such subtrees step by step. First, the learner analyzes a sentence using its inventory of grammatical patterns. Initially, this inventory is empty, but as the learner processes more utterances, it will become bigger. Suppose for the sake of the argument that the learner knows that the word John means that there is an entity called John, or E :JOHN. This can be represented as a simple subtree, connecting a node e:John with a node John. Suppose the learner also knows that the word Mary refers to an entity called Mary.
The learner hears an utterance like John watches Mary, uttered in the context of John watching Mary and Mary walking. What the learner will try to do, then, is analyze this utterance using its known constructions, i.c. the wordmeaning pairing for John and for Mary. If the learner does not know how to analyze the utterance, it can use a meaningless binary subtree (with a very small probability) to combine any two fragments. Analyzing John watches Mary we arrive at two analyses, as follows.
In both parse trees, we arrive at an interpretation that involves at least an entity called John and one called Mary. When trying to understand what situation the utterance refers to, the learner can now exclude that of Mary walking, as it has seen a known fragment that refers to John, who is not a participant in the situation of Mary walking. The learner will then try to complete the analysis by adding parts of the semantic representation to the analyses. Because the learner does not know to what nodes of the analyses these representations belong, it will try all possible combinations and store them in its memory. The distribution of unseen meanings is guided by the constraint that the same fragment of meaning does not occur twice in the utterance. This constraint can be seen as a form of a mutual exclusivity constraint or as more simple Gricean pragmatic reasoning (i.c. balancing the maxims of quality and quantity).
To what observed situations can the partial analyses in Figure 10 apply? Recall that two situations take place, viz. John watching Mary and Mary walking. The second can be excluded, as John, who was found to be referred to in both analyses, is no part of it. This leaves us the first one, and the model will try to complete its analyses using parts of that meaning. If there were more situations compatible with the inferred meaning in an analysis, all of them would be used in the add-unseen-meaning step to complete that analysis.
Figure 10: Two analyses of John watches Mary
In our example, we have two analyses, both of which can be mapped to the situation watch(John, Mary). In both cases, the learner has found the meaning e: John and e:Mary and hence it is missing the part watch(e1, e2). This partial semantic representation can be further decomposed into p(e1, e2) and p:watch (in general: the models tries all decompositions of the unobserved parts of the meaning) and these partial representations can be distributed over the nodes in the tree that were unanalyzed according to the constraint discussed before. This is effectively a U-DOP approach to semantic representations: we take all parts of all missing semantic representations, all nodes in the tree analyses of an utterance and map them to each other, letting the statistics decide which ones would be relevant for analyzing novel utterances. For the first analysis of John watches Mary and the representations we found missing, we can add them to the partial analysis in the ways given in Figure 11.
Using these completed representations, we can update our inventory of subtrees that can be reused in analyzing new utterances. We define DOP’s decomposition operation here to take all connected subgraphs of the parse tree that have both a root node with a semantic representation, and only leaf nodes with either semantic or phonological representations in them. That is, the two subtrees in Figure 13 are excluded on these grounds (not a meaningful root, not all leaf nodes are meaningful), whereas all legal subtrees of the third completed analysis in Figure 11 are given in Figure 12. When decomposing a tree, leaf nodes retain only the type and its index, so that an indexed slot comes into existence. The non-trivial part of the decomposition operation is that a tree can be decomposed only at nodes where a meaning representation is found; nodes lacking these are taken to be ‘internal’ (signified with a dot ‘•’) and are not used in the interpretation process in any way, but are retained only to maintain (the positive computational properties of) binarity.
Figure 11: All additions of unseen meaning to the first analysis in Figure 10
Among these decompositions, we can find various structures that remind us of constructions as we know them. There are words, fully open patterns, ones with only lexical items as leaf nodes and everything in between. All of these have the same ingredients: a graph structuring the part-whole relations with phonological or semantic structure as the content of the nodes. Our procedure, of trying to parse an utterance, adding the unseen parts of the meaning to unanalyzed parts of the utterance and decomposing that thus provides a learner with a way of discovering the patterns that are useful for understanding novel language material.
However, in a more realistic setting, there may be many analyses that are incorrect and thus harm the model if their parts are added to the inventory of fragments. Do we add all of these to the knowledge base for interpreting the next utterance too? In fact, we don’t have to, as DOP provides us with a good mechanism to separate the wheat from the chaff, namely the probabilities of the analyses.
As in DOP, the most likely analysis of an utterance is the analysis that has the highest probability mass summed over all of its derivations. The probability mass of a derivation, then, is given by the joint probability of all the subtrees used. Since the model has to add unseen rules, we have to reserve some small probability for these. We do so by smoothing the probability of the seen rules, so that some probability mass becomes available for the unseen components.
Figure 12: Legal subtrees of the third analysis in Figure 11
Figure 13: Illegal subtrees of the third analysis in Figure 11
We weight all parse trees by their probability before decomposing them, so that the subtrees from the more likely analyses will be more likely to be reused in subsequent analyses. The subtrees in Figure 12 thus are not added to the inventory of patterns with a frequency of one, but with a frequency of one times the probability of the derivation. After all subtrees are added, we can recalculate the probability of every subtree in the same way as we did with DOP, viz. by dividing its (weighted) frequency by the summed (weighted) frequencies of all subtrees sharing the exact content of the root node, i.e. the semantic representation of it.
Expanding the DOP-idea to another representation, we can see what Data- Oriented Parsing can and cannot do, as well as what the assumptions of the model are. The model is not a category learner: semantic structures and their combinatoriality, and sound structures are assumed as given information to the model. The power of DOP is then to recognize productive, complex patterns in the structured data given these assumptions. To this end, the model tries all possible structures (the unsupervised component) as guided by the context, but uses this information judiciously by weighing it according to the model’s prior knowledge. The main assumptions of the model are in the construction of likely analyses, which have been argued to be driven by pragmatic processes (match with the situation and a mutual exclusivity bias), and the deconstruction of complex wholes, where slots are assumed to be necessary meaningful. This latter constraint is motivated by a learner’s desire to communicate: if a slot of a subtree specifies some meaning (an entity or a predicate) co-indexed with the subtree’s global interpretation pattern in its root node, the learner can use it productively for generating and understanding novel utterances. If no such constraint is present, the pattern cannot be used, and the learner will not bother to extract it.
-  More precisely, we used Simple Good-Turing smoothing (Gale and Sampson 1995).