Types and Tokens
We use word type to refer to any unique word, a string of letters delineated by spaces or punctuation marks, and word token to refer to any instance of a specific word type that occurs or reoccurs in the transcript regardless of its identity. For each transcript, text length is measured by the number of word tokens, while vocabulary size is measured by the number of word types.1 In 1982, the 20 interviews consisted of 17,707 types and 72,560 tokens, and in 2017, the 20 transcripts contained 17,134 types and 90,414 tokens. Of the roughly 17,000 word types in each recording year, more than half (11,688 in 1982 and 11,337 in 2017) occurred only once, emphasising that many of the words speakers use are indeed quite rare. In our corpus, however, because the same interview questions are used for all speakers and across both time periods, there is considerable overlap in the topics spoken about, e.g., hobbies, favourite books and films, making Spdtzle ‘Swabian egg noodles’ and Maultaschen ‘Swabian ravioli’, and local activities and festivals.
The most straightforward measure for investigating differences in word use between texts is the size of the vocabulary (Baayen 2001). However, vocabulary size is dependent on text length, which, for the present study, is the length of the interview. Quite naturally, the longer the interview, the greater the opportunity for the speaker to utter a new word. Simple ways to sidestep this problem are to either base the analysis on a comparison of texts that are the same length or to plot interpolated vocabulary growth curves side-by-side for texts of differing lengths (Baayen 2001, 2008). Due to the nature of our spontaneously spoken sociolinguistic interviews, we chose the latter approach, vocabulary growth curves are projected by counting the number of tokens within equally spaced measurement points throughout the text (referred to as token time) and graphing the corresponding count of word types. This curve depicts how vocabulary increases throughout the text, which is typically quite steep at first and then flattening as more and more different word types are encountered. By plotting two vocabulary growth curves side-by-side, core properties of the different dynamics between types and tokens become available for visual inspection and statistical evaluation.
For the analysis of frequency of use as one progresses through a text or corpus, statistical methods based on the urn model (Johnson and Kotz 1977) have the disadvantage that they build on the assumption that words are used independently in text. As shown by Baayen (1996), topical cohesion in discourse can lead to a substantial divergence between model prediction and actual vocabulary development. In the present study, we therefore opted for using a randomisation-based method. To avoid artefactually enhancing vocabulary growth estimates that would arise by within-text randomisation and its concomitant destruction of topical structure, we opted for randomising the order of complete interviews. This is a natural choice, as the set of interviews does not have any intrinsic order and is not governed by an overall cohesive narrative. For a given analysis, we permuted the sequence of entire interviews 50 times. For each of the 50 permutations, we calculated the vocabulary size at ten equally spaced measurement points, called text chunks (due to the varying lengths of the interviews, we used 100 text chunks for dialect and 200 for the standard language). For each text chunk, we applied the Wilcoxon test to evaluate whether vocabulary sizes at a given token time differed significantly between 1982 and 2017. We also added outer polygons to the permutation-based vocabulary sizes to provide non-parametric confidence intervals indicating the uncertainty regarding vocabulary size. The following section presents our analysis and the results.