Menu
Home
Log in / Register
 
Home arrow Language & Literature arrow COGNITIVE APPROACH TO NATURAL LANGUAGE PROCESSING
Source

Experimental setup

In this section, we present the experimental setup of our approach. We first describe the dataset used in the experiment, and then present the classification scheme and algorithm employed for this experiment. The results and discussion are presented in the next section.

Dataset

To test the effectiveness of sequential rules over Part-Of-Speech tags and function words for authorship attribution, we used texts written by Balzac, Dumas, France, Gautier, Hugo, Maupassant, Proust, Sand, Sue and Zola. This choice was motivated by our special interest in studying the classic French literature of the 19th Century, and the availability of electronic texts from these authors on the Gutenberg project website[1] and in the Gallica electronic library[2]. Our choice of authors was also affected by the fact that we want to cover the most important writing styles and trends from this period. For each of the 10 authors mentioned above, we collected four novels, so that the total number of novels is 40. The next step was to divide these novels into smaller pieces of texts in order to have enough data instances to train the attribution algorithm. Researchers working on authorship attribution on literature data have been using different dividing strategies. For example, Hoover [HOO 03] decided to take just the first 10,000 words of each novel as a single text, while Argamon and Levitan [ARG 05] treated each chapter of each book as a separate text. In our experiment, we chose to slice novels by the size of the smallest one in the collection in terms of the number of sentences. This choice respects the condition proposed by Eder [EDE 13] that specifies the smallest reasonable text size to achieve good attribution; more information about the dataset used in the experiment is presented in Table 8.2.

Author Name

# of words

# of texts

Balzac, Honore de

548778

20

Dumas, Alexandre

320263

26

France, Anatole

218499

21

Gautier, Theophile

325849

19

Hugo, Victor

584502

39

Maupassant, Guy de

186598

20

Proust, Marcel

700748

38

Sand, George

560365

51

Sue, Eugene

1076843

60

Zola, Emile

581613

67

Table 8.2. Statistics for the dataset used in our experiment

  • [1] http://www.gutenberg.org/
  • [2] http://gallica.bnf.fr/
 
Source
Found a mistake? Please highlight the word and press Shift + Enter  
< Prev   CONTENTS   Next >
 
Subjects
Accounting
Business & Finance
Communication
Computer Science
Economics
Education
Engineering
Environment
Geography
Health
History
Language & Literature
Law
Management
Marketing
Mathematics
Political science
Philosophy
Psychology
Religion
Sociology
Travel