Classification scheme

In the current approach, each text was segmented into a set of sentences (sequences) based on splitting done using the punctuation marks of the set {‘.’, ‘!’, ‘?’, ‘:’, ‘...’}, then the corpus was Part-Of-Speech tagged and

function words were extracted. The algorithm described in Fournier-Viger and Tseng [FOU 11] was then used to extract sequential and association rules over the function words and the Part-Of-Speech tag sequences from each text. These rules will help us gather not only sequential information from the data, but also structural information, due to the fact that a text characterized by long sentences will result in more frequencies of the rules.

Each text is then represented as a vector RK of frequencies of occurrence of rules, such that RK = {rx,r2,...,rK} is the ordered set, by decreasing

normalized frequency of occurrence of the top- K rules in terms of support in the training set. Each text is also represented by a vector of normalized frequencies of occurrence of function words and Part-Of-Speech tag 3-grams. The normalization of the vector of frequency representing a given text was done by the size of the text. Our aim is first to compare the classification performance of the top- K function word sequential rules (SR) with the function words frequencies. Second, to compare the classification performance of the top- K sequential rules of Part-Of-Speech tag with the 3-gram frequencies.

Given the classification scheme described above, we used SVMs classifier to derive a discriminative linear model from our data. To get a reasonable estimation of the expected generalization performance, we used 5-fold crossvalidation. The dataset was split into five equal subsets; the classification was done five times by taking four subsets for training each time and leaving out the last one for testing. The overall classification performance is taken as the average performance over these five runs. In order to evaluate the attribution performance, we used the common measures used to evaluate supervised classification performance: we have calculated precision (P), recall (R) and F -measure Fp, where TP stands for true positive, TN for true negative,

FP for false positive and FN for false negative:

We consider that precision and recall have the same weight, and hence we set P equal to 1.

< Prev   CONTENTS   Source   Next >