Algorithm efficiency comparison

The corpora

In order to compare the association lists with the LSA lists, we have prepared three distinct corpora to train the algorithm. The first consists of 51,574 press notes of the Polish Press Agency and contains over 2,900,000 words. That corpus represents a very broad description of reality, but can be somehow seen as restricted to only a more formal subset of the language. This corpus will be referred to as PAP.

The second corpus is a fragment of the National Corpus of Polish [PRZ 11] with 3,363 separate documents spanning over 860,000 words. That corpus is representative in the terms of the dictionary of the language; however, the texts occurring in it are relatively random, in the sense that they are not thematically grouped or following some deeper semantic structure. This corpus will be referred to as the NCP.

The last corpus is composed of 10 short stories and one novel Lalka (The Doll) by Boleslaw Prus - a late 19th Century novelist using a modern version of Polish similar to the one used nowadays. The texts are split into 10,346 paragraphs of over 300,000 words. The rationale behind this corpus was to try to model some historically deeply rooted semantic associations with such basic notions as dom. This corpus will be referred to in as PRUS.

All corpora were lemmatized using a dictionary-based approach [KOR 12].

< Prev   CONTENTS   Source   Next >