Human performance measures
This study proposes that language models can be benchmarked by item- level performance on three datasets that are openly available in online databases. Predictability was taken from the Potsdam Sentence Corpus1, first published by Kliegl et al. [KLI 04]. The 144 sentences consist of 1,138 tokens, available in Appendix A of [DAM 09], and the logit-transformed CCP measures of word predictability were retrieved from Ralf Engbert’s homepage1 [ENG 05]. For instance, in the sentence “Manchmal sagen Opfer vor Gericht nicht die volle Wahrheit” [Before the court, victims tell not always the truth.], the last word has a CCP of 1. N400 amplitudes were taken from the 343 open-class words published in Dambacher and Kliegl [DAM 07]. These are available from the Potsdam Mind Research Repository . The EEG data published there are based on a previous study (see [DAM 06] for method details). The voltage of 10 centroparietal electrodes was averaged across up to 48 artifact-free participants from 300 to 500 ms after word presentation for quantifying the N400. SFD are based on the same 343 words from Dambacher and Kliegl [DAM 07], available from the same source URL. Data were included when this word was only fixated for one time, and these SFDs ranged from 50 to 750 ms. The SFD was averaged across up to 125 German native speakers [DAM 07].