Methodology
We present in this section a description of the method used to allow for the kind of grouping mentioned here above. Messages can be organized on various dimensions and according to various viewpoints: conceptual relations (taxonomic, i.e. set inclusion, causal, temporal, etc.), rhetorical relations (concession, disagreement), etc. We will focus here only on the former, assuming that messages can at least to some extent be organized via the semantics[1] of their respective constituent elements.
Put differently, in order to reveal the relative proximity or relation between a set of messages, we may consider the similarity of some of their constituent elements. Summing similarity values is a typical component of a vector space model and has been well described by [WID 04, MAN 08]. Concerning “similarity”, we need to be careful though, as the words’ similarity does not guarantee “relatedness”; it may even be one of its preconditions. Indeed, many researchers have used this feature for sentence similarity detection [BUL 07, TUR 06], but most of them based their analysis on the surface form, which may lead to erroneous results, because similar meanings can be expressed via very different syntactic categories (e.g. “use for” vs. “instrument”, “have” vs. “her”). Likewise, a given form or linguistic resource, say the possessive adjective, may encode very different meanings. Compare - his car versus his father versus his toe - which express quite different relations: ownership, family relationship, inalienable part of the human body.
What we present here is a very preliminary work. Hence, our method is designed to address only very simple cases, two-place predicates, i.e. sentences composed of two nouns (a subject and an object) and a (linking) predicate. Given a set of these kind of inputs, our program determines their proximity regardless of their surface forms. The sentences will be clustered on the basis of semantic similarity between the constituent words. This yields a tree whose nodes are categories (whose type should ideally be expressed explicitly, e.g. food, color, etc.) and whose leaves are the messages or propositions given as input.
In the following sections, we will explain in more detail our approach by taking the inputs shown in Figure 7.4(a) to illustrate our purpose. The goal is to cluster these messages by topic to create a kind of outline or topical tree. Indeed, {1, 4} address physical features (appearance), {2, 6} provide spatial information, the place where foxes live or hide (habitat), while {3, 5, 7} deal with their habits. This last category can be split into two subtopics, in our case, “theft” {3} and “consumption” {5, 7}. The result of this analysis can be displayed in the form of a tree (Figure 7.4(b))11. [2]

Figure 7.4(a) Conceptual input (messages)

Figure 7.4(b) Clustered output (topic tree)
In order to achieve this result, we have defined an algorithm carrying out the steps referred to in Table 7.2. We will describe and explain them in more depth in the following sections. Note that what we called messages here above is now called sentence which is processed by a parser.
1) Determine the role of words, i.e. perform a syntactic analysis; |
2) Find potential seed words; |
3) Align words playing the same role in different sentences; |
4) Determine the semantic proximity between the aligned words; |
5) Determine the similarity between sentences; |
6) Group sentences according to their semantic affinity (similarity). |
Table 7.2. Main steps for topic clustering
- [1] Of course, the term semantics can mean many things (association, shared elementsbetween a set of words, etc.), and which of them an author is referring to needs to be made explicit.
- [2] Note that generally we can come up with more than one tree. Any set of data allowing formultiple analyses (depending on the point of view), and multiple rhetorical effects.