Introduction and motivation
Authorship attribution is the task of identifying the author of a given document. The authorship attribution problem can typically be formulated as follows: given a set of candidate authors for whom samples of written text are available, the task is to assign a text of unknown authorship to one of these candidate authors [STA 09].
Chapter written by Mohamed Amine Boukhaled and Jean-Gabriel Ganascia.
This problem has been addressed mainly as a problem of multi-class discrimination, or as a text categorization task [SEB 02]. Text categorization is a useful way to organize large document collections. Authorship attribution, as a subtask of text categorization, assumes that the categorization scheme is based on the authorial information extracted from the documents. Authorship attribution is a relatively old research field. A first scientific approach to the problem was proposed in the late 19th Century, in the work of Mendenhall in 1887, who studied the authorship of texts attributed to Bacon, Marlowe and Shakespeare. More recently, the problem of authorship attribution gained greater importance due to new applications in forensic analysis and humanities scholarship [STA 09].
To achieve high authorship attribution accuracy, we should use features that are most likely to be independent of the topic of the text. There is an agreement among different researchers that function words are the most reliable indicator of authorship. There are two main reasons for using function words in lieu of other markers. First, because of their high frequency in a written text, function words are very difficult to consciously control, which minimizes the risk of false attribution. The second is that function words, unlike content words, are more independent of the topic or the genre of the text, and hence we should not expect to find great differences of frequencies across different texts written by the same authors on different topics [CHU 07]. The Part-Of-Speech-based markers are also shown to be very effective because they partly share the advantages of function words [STA 09].
Despite the fact that function word-based markers are state-of-the-art, they basically rely on the bag of words assumption, which stipulates that text is a set of independent tokens. This approach completely ignores the fact that there is a syntactic structure and latent sequential information in the text. This is partly true for Part-Of-Speech n-grams as well, since they are based on the underlying assumption stipulating that text is a set of independent n- tokens’ segments. De Roeck et al. [DER 04] have shown that frequent words, including function words, are not distributed homogeneously over a text. This provides evidence of the fact that the bag of words assumption is invalid. In fact, critiques have been made in the field of authorship attribution charging that many works are based on invalid assumptions [RUD 97] and that researchers are focusing on attribution techniques rather than coming up with new style markers that are more precise and based on less strong assumptions.
In an effort to develop more complex yet computationally feasible stylistic features that are more linguistically motivated, Hoover [HOO 03] pointed out that exploiting the sequential information existing in the text could be a promising line of work. He proved that frequent word sequences and collocations can be used with high reliability for stylistic attribution. In another study, Quiniou et al. [QUI 12] showed the interest of sequential data mining methods for the stylistic analysis of large texts. They claimed that relevant and understandable patterns that may be characteristic of a specific type of text can be extracted using sequential data mining techniques.
In this line of thought, here we study the problem of authorship attribution in classic French literature. Our aim is to evaluate the effectiveness of style markers extracted using sequential data mining techniques for authorship attribution. In this contribution, we focus on extracting style markers using sequential rule mining. We compare results given by these new style markers with that of the state-of-the-art features like function words frequencies and Part-Of-Speech n-grams, and we assess whether this type of marker is sufficient for accurate identification of authors.
The rest of the chapter is organized as follows. In section 8.2, we give a theoretical overview of the computational authorship attribution process. Then, in section 8.3, we present our working hypothesis and its corresponding stylistic markers. In section 8.4, we make a projection of the sequential data mining problem in our context, and we explain how the sequential rule-based style markers are extracted. The experimental evaluation settings are presented in section 8.5 in which we describe the dataset used in the experiment, and then present the employed classification scheme and algorithm. The results and discussions are presented in section 8.6. Finally, section 8.7 concludes the chapter.