Three flavors of language models
Language models are based on a probabilistic description of language phenomena. Probabilities are used to pick the most fluent of several alternatives, e.g. in machine translation or speech recognition. Word n-gram models are defined by a Markov chain of order n -1, where the probability of the following word only depends on previous n -1 words. In statistical models, the probability distribution of the vocabulary, given a history of n -1 words, is estimated based on n-gram counts from (large) natural language corpora. There exist a range of n-gram language models (see, e.g., Chapter 3 in [MAN 99], which are differentiated by the way they handle unseen events and perform probability smoothing). Here, we use a Kneser- Ney [KNE 95] 5-gram model. For each word in the sequence, the language model computes a probability p in ]0; 1[. We use the logarithm log(p) of this probability as a predictor. We used all words in their full form, i.e. did not filter for specific word classes and did not perform lemmatization. N-gram language models are known to model local syntactic structure very well. Since only n-gram models use the most recent history for predicting the next token, they fail to account for long-range phenomena and semantic coherence (see [BIE 12]).
Latent Dirichlet Allocation (LDA) topic models [BLE 03] are generative probabilistic models representing documents as a mixture of a fixed number of N topics, which are defined as unigram probability distributions over the vocabulary. Through a sampling process like Gibbs sampling, topic distributions are inferred. Words frequently co-occurring in the same documents receive a high probability in the same topics. When sampling the topic distribution for a sequence of text, each word is randomly assigned to a topic according to the document-topic distribution and the topic-word distribution. We use Phan and Nguyen’s [PHA 07] GibbsLDA implementation for training an LDA model with 200 topics (default values for a = 0.25 and P = 0.001) on a background corpus. Words occurring in too many documents (a.k.a. stopwords) or too few documents (mistyped or rare words) were removed from the LDA vocabulary. Then, retain the per document topic distribution p(z | d) and the per topic word distribution p (w | z), where z is the latent variable representing the topic, d refers to a
full document during training - during testing d refers to the history of the current sentence - and w is a word. In contrast to our earlier approach using only the top three topics [BIE 15], we here computed the probability of the current word w given its history d as a mixture of its topical components p (w | d) = p(w | z)p(z | d) . We hypothesize that topic models account for
some long-range semantic aspects missing in n-gram models. While Bayesian topic models are probably the most widespread approach to semantics in psychology (e.g. [GRI 07]), latent semantic analysis (LSA) is not applicable in our setting [LAN 97]: we use the capability of LDA to account for yet unseen documents, whereas LSA assumes a fixed vocabulary and it is not trivial to fold new documents into LSA’s fixed document space.
While Jeff Elman’s [ELM 90] seminal work suggested early on that semantic and also syntactic structure automatically emerges from a set of simple recurrent units, such an approach has received little attention in language modeling for a long time, but is currently of interest to many computational studies. In brief, such Neural Network Language Models are based on the optimization probability of the occurrence of a word, given its history using neural units linking back to themselves, much as the neurons in the CA3 region of the human hippocampus [MAR 71, NOR 03]. The task of language modeling using neural networks was first introduced by Bengio et al. [BEN 03] and received at that point only little attention because of computational challenges regarding space and time complexity. Due to recent advancement in the field of neural networks - for an overview, see [MIK 12] - neural language models gained more popularity, particularly because of the so-called neural word embeddings as a side product. The language model implementation we use in this work is a recurrent neural network architecture similar to the one used by Mikolov’s Word2Vec toolkit [MIK 13]. We trained a model with 400 hidden layers and hierarchical softmax. For testing, we used the complete history of a sentence up to the current word.