Data

The analysis presented in this paper is based on usage data from historical and modern corpora (summarised in Table 1) and from attestations from broadcast media and the web.

Table 1: The corpora used in this investigation (arranged chronologically by start date)

Corpus

Written or spoken

Contents

Total tokens

Period

Early Modern Dutch Corpus (EMDC)

written

diaries, drama, prose (fiction, academic, non-academic)

c. 300,000

16th—19th century

Eindhoven

written (83.3%), spoken (16.7%)

journalism, popular science, fiction, speech

c. 720,000

1960-1973

Corpus Gesproken Nederlands (CGN)

spoken

conversations, broadcasts, lectures, speeches; read-aloud texts

c. 9 million

1991-2003

INL 27 Miljoen Woorden Krantencorpus (INL 27 Mil.)

written

journalism (NRC Handelsblad)

c. 27 million

1994-1995

The Early Modern Dutch Corpus (EMDC) was compiled especially for the project of which the research reported here forms a part. The aim was to produce a balanced corpus of written language use covering a variety of text types, from relatively informal egodocuments to formal academic prose. To this end, a corpus was compiled from texts held by the online Digitale Bibliotheek voor de Neder- landse Letteren (DBNL); the corpus contains three sub-corpora of 100,000 tokens

each: the Gouden Eeuw ‘golden age’ (16th and 17th centuries combined, following the distinction made in the DBNL), the 18th century and the 19th century. Each of these sub-corpora consists of five genre-defined sub-corpora of 20,000 tokens each: diaries, drama, fictional prose, academic prose and non-academic prose. The genres and registers covered were chosen so as to exploit the various text types covered within the DBNL as far as possible.

The orthography of the corpus examples, both historical and modern, is reproduced unaltered throughout this chapter.

 
Source
< Prev   CONTENTS   Source   Next >