 # Sequential data mining for stylistic analysis

Sequential data mining is a data mining subdomain introduced by Agrawal et al. [AGR 93], which is concerned with finding interesting characteristics and patterns in sequential databases. Sequential rule mining is one of the most important sequential data mining techniques used to extract rules describing a set of sequences. In what follows, for the sake of clarity, we will limit our definitions and annotations to those necessary to understand our experiment.

Considering a set of literals called items, denoted by I = {/1,...,in}, an itemset is a set of items X с I. A sequence S (single-item sequence) is an ordered list of items, denoted by S=(i1 ...in} where i1 to in are items.

 Sequence ID Sequence 1 < a, b, d, e > 2 < b, c, e > 3 < a, b, d, e >

Table 8.1. Sequence database SDB

A sequence database SDB is a set of tuples (id, S), where id is the sequence identifier and S a sequence. Interesting characteristics can be extracted from such databases using sequential rules and pattern mining. A sequential rule R : X ^ Y is defined as a relationship between two itemsets X and Y, such that X n Y = 0. This rule can be interpreted as follows: if the itemset X occurs in a sequence, the itemset Y will occur afterward in the same sequence. Several algorithms have been developed to efficiently extract this type of rule, such as Fournier-Viger and Tseng [FOU 11]. For example, if we run this algorithm on the SDB containing the three sequences presented in Table 8.1, we will get as a result sequential rules, such as “a ^ d, e” with support equal to 2, which means that this rule is respected by two sequences in the SDB (i.e. there exist two sequences of the SDB where we find the item a, and we also find d and e afterward in the same sequence).

In our study, the text is first segmented into a set of sentences, and then each sentence is mapped into two sequences: one for function words appearing in order in that sentence, and another sequence for the Part-Of- Speech tags resulting from its syntactic analysis. For example, the sentence “J’aime ma maison oh j’ai grandi.” will be mapped to < je,ma,oh,je > as a sequence of French function words, and will be mapped to < PRO:PER, VER:pres, DET:POS, NOM, PRO:REL, PRO:PER, VER:pres, VER:pper, SENT> as a sequence of Part-Of-Speech tags. “je ^ oh”, “ma ^ ou,je” or “DET:POS, NOM ^ SENT” are examples of sequential rules respected by these sequences. The whole text will produce two sequential databases, one for the function words and another for the Part-Of-Speech tags. The rules extracted in our study represent the cadence authors follow when using function words in their writings for instance. This gives us more explanatory properties about the syntactic writing style of a given author than frequencies of function words or Part-Of-Speech n-grams could offer. 