PREDICTING ACCEPTABILITY WITH DIFFERENT DNN MODELS

In addition to lstm and tdlm Lau et al. (2020) experiment with three transformer language models. These are gpt2. bert, and xlnet (Yang et al., 2019). These models are equipped with large pre-trained lexical embeddings, and the)' apply multiple self-attention heads to all input words. As we noted in previous chapters, bert processes input strings without regard to sequence, in a massively parallel way, which permits it to efficiently identify large numbers of co-occurrence dependency patterns among the words of a string.

lstm and gpt2 are unidirectional, and so they can be used to compute the probability of a sentence left to right, according to the formula

Lau’s TLS regression for no context vs real context ratings

Figure 4.9 Lau’s TLS regression for no context vs real context ratings.

bert is bidirectional, and it predicts words for both their left and right contexts. It requires the formula

This equation does not yield true probabilities, as its values do not to sum to 1 (normalising these values to genuine probabilities is intractable). Instead these values provide confidence scores of likelihood, xlnet can be applied either unidirectionally or bidirectionally. Table

Lau’s TLS regression for random context vs no context ratings

Figure 4.10 Lau’s TLS regression for random context vs no context ratings.

4.2 gives the details of the models that Lau et al. (2020) test on their annotated context and non-context sets.

The)' use Lau et al. (2015) and Lau, Clark, and Lappin (2017)’s three scoring functions, and they add an additional one, PenLP, which is a length penalty. Its parametric value a = 0.8 is set experimentally, on the basis of work in machine translation. Lau et al. (2020)!s scoring functions are shown in Table 4.3.

Lau et al. (2020) compute two human performance estimates to serve as upper bounds on accuracy of a model. The first, ub[, is Lau et al. (2015) and Lau, Clark, and Lappin (2017)’s one-vs-rest annotator correlation, discussed in Chapter 3. They select a random annotator’s rating, and compare it to the mean rating of the rest, using Pearson’s r. They repeat this procedure for a 1000 trials to get a robust estimate of the mean correlation, ub-2 is a half-vs-half annotator correlation, where for

TABLE 4.2 Model Architectures, Parameters, Size, and Training Corpora

Model

Configuration

Casing

Size

Training Data

Architecture

Encoding ^

^ Par am.

Tokenisation

Corpora

lstm

RNN

Unidir.

60M

Uncased

0.2GB

Word

Wikipedia

tdlm

RNN

Unidir.

SOM

Uncased

0.2GB

Word

Wikipedia

gpt2

Transformer

Unidir.

340M

Cased

40GB

BI’E

WebText

bertes

Transformer

Bidir.

340M

Cased

13GB

WordPiccc

Wikipedia, BookCorpus

bertuca

Transformer

Bidir.

340M

Uncased

13GB

WordPiccc

Wikipedia, BookCorpus

xlnet

Transformer

Hybrid

340M

Cased

120GB

Sentence-

Piece

Wikipedia, BookCorpus, Giga5 ClucWcb, Common Crawl

TABLE 4.3 Lau et all. (2020)’s Sentence Acceptability Scoring Functions

Scoring Function

Equation

LogProb

Mean LP

PenLP

NormLP

SLOR

P(s) is the sentence probability, computed using either the uni-prob or bi-prob formula, depending on the model, Pu(s) is the sentence probability estimated by a unigram language model, and a = 0.8.

Lau’s TLS regression for no context vs random context ratings

Figure 4.11 Lau’s TLS regression for no context vs random context ratings.

each sentence they randomly split the annotators into two groups, and compare the mean ratings between the groups.

Lau et al. (2020) present model performance for the annotation sets in which outlier ratings (>2 standard deviations) have been removed. This filtering does not significantly affect the model accuracy scores, but it does increase the simulated human upper bound correlations. For completeness, they present the upper bound one-vs-rest correlations for both outlier filtered (ub^uba), and outlier unfiltered (ubf.ubjf) test sets.

The results of their model prediction experiments are given in Tables 4.4-4.G, for null (h0), real (h+), and random (h_) contexts, respectively. M+ indicates that the model M was tested with context input, and M0 is for M tested without context input at test time, bert is subscripted to show if it was tested on text with, or without uppercase spelling preserved, xlnet is indexed for unidirectional, or bidirectional design.

TABLE 4.4 Lau et al. (2020)’s Model Performance for Null Contexts

Rtg

Encod.

Model

Log Pro b

Mean LP

PenLP

NormLP SLOR

0.29

0.42

0.42

0.52

0.53

0.30

0.49

0.45

0.61

0.63

0.30

0.49

0.45

0.60

0.61

0.30

0.50

0.45

0.59

0.60

0.33

0.34

0.56

0.38

0.38

0.38

0.59

0.58

0.63

0.60

0.31

0.42

0.51

0.51

0.52

0.36

0.56

0.55

0.61

0.61

0.51

0.54

0.63

0.55

0.53

0.53

0.63

0.67

0.64

0.60

0.59

0.63

0.70

0.63

0.60

0.60

0.68

0.72

0.67

0.63

0.52

0.51

0.66

0.53

0.53

0.57

0.65

0.73

0.66

0.65

0.75 / 0.66

0.92 / 0.88

TABLE 4.5 Lau et al. (2020)’s Model Performance for Real Contexts

Rtg

Encod.

Model

Log Pro b

Mean LP

PenLP

NormLP SLOR

0.29

0.44

0.43

0.52

0.52

0.31

0.51

0.46

0.62

0.62

0.30

0.50

0.45

0.59

0.59

0.30

0.50

0.46

0.58

0.58

0.32

0.33

0.56

0.36

0.37

0.38

0.60

0.59

0.63

0.60

0.30

0.42

0.50

0.49

0.51

0.35

0.56

0.55

0.60

0.61

0.49

0.53

0.62

0.54

0.51

0.52

0.63

0.66

0.63

0.58

0.58

0.63

0.70

0.63

0.60

0.60

0.68

0.73

0.67

0.63

0.51

0.50

0.65

0.52

0.53

0.57

0.65

0.74

0.65

0.65

0.73 / 0.66

0.92 / 0.89

TABLE 4.6 Lau et al. (2020)!s Model Performance for Random Contexts

Rtg

Encod.

Model

Log Pro b

Mean LP

PenLP

NormLP SLOR

0.28

0.44

0.43

0.50

0.50

0.27

0.41

0.40

0.47

0.47

0.29

0.52

0.46

0.59

0.58

0.28

0.49

0.44

0.56

0.55

0.32

0.34

0.55

0.35

0.35

0.30

0.42

0.51

0.44

0.41

0.30

0.44

0.51

0.49

0.49

0.29

0.40

0.49

0.46

0.46

0.48

0.53

0.62

0.53

0.49

0.49

0.52

0.61

0.51

0.47

0.56

0.61

0.68

0.60

0.56

0.56

0.58

0.66

0.57

0.53

0.49

0.48

0.62

0.49

0.48

0.50

0.51

0.64

0.51

0.50

0.75 / 0.68

0.92 / 0.88

The bidirectional models significantly outperform the unidirectional models across all three context types, when PenLP, rather than SLOR is the scoring function. This suggests that large lexical embeddings and bidirectional context training render normalisation by word frequency unnecessary. Model architecture rather than size is the decisive factor governing performance, bert and xlnet approach estimated individual human performance, as specified by ubi, on the prediction of sentence acceptability task for the three context sets. They surpass it for ubf on the null and real context sets.

One might suggest that round-trip MT introduces a systematic bias into the types of infelicities that appear in the Lau et al. (2020) test sets, which could influence the performance of their models. To control for such a possible bias they test the bidirectional transformers, with PenLP, on the test set of AMT annotated Adger examples, discussed in Chapter 3. The three bidirectional model scores, with PenLP, are: gpt2 = 0.45, bertcs = 0.53, and xlnetbi = 0.58. While these scores are lower than those for the round-trip MT test sets, the)' indicate a strong correlation with human judgements. It is important to note that they are achieved for an out of domain task. The models are trained on naturally occurring text , but they are tested on artificially constructed examples. As we observed in Chapter 3, the linguists’ examples are, in general, much shorter than the sentences in the models’ training corpora. In fact they are, on average, less than seven words. The difference between the training and test corpora is a significant factor in determining a model’s performance on the sentence acceptability task in this case.

SUMMARY AND CONCLUSIONS

In this chapter, I have looked at recent work on the sentence acceptability task in which sentences are crowd source annotated both out of context , and embedded in different types of document contexts. The first set of experiments compared null to real document contexts, and t hey tested two types of LSTM LM on the prediction task. One is a simple LSTM, while the other incorporated a topic model. The latter conditioned the prediction of a word in a sequence on both the topic of the sentence and the preceding words. The topic model enhanced LSTM outperformed the simple LSTM, and the addition of document context prefix as test sentence input improved the correlations for both types of LSTM.

Linear regression on the two annotations sets revealed a puzzling compression effect , in which ratings for sentences assessed in context are raised at the bottom end of the scale, but lowered at the higher extreme. This effect was also observed in the unrelated task of rating paraphrase candidates in, and out of context. Predicting acceptability for mean in context ratings is more difficult than the out of context case. This seems to be due to the fact that the judgements are pushed closer to together towards the centre of the rating scale, rendering them less separable.

The second set of experiments that I discussed annotated sentences for null, real, and random contexts, providing three distinct test sets. Detailed analysis of these dataset using both linear and total least square regression shows that the compression effect observed in the earlier work is a real property of the data. Testing the datasets for statistical significance of this effect indicates that both cognitive load and discourse coherence are involved in the in-context ratings. Processing context information induces a cognitive load for humans, which creates a compression effect on the distribution of acceptability ratings. This effect is present in both real and random context sets. If the context is relevant to the sentence, a discourse coherence effect uniformly boosts sentence acceptability. This factor is present only in the real context set.

The second set of experiments tested a variety of DNN LMs, which included the lstm and tdlm used in the first experiment, and three transformer models. gpt2, bert, and xlnet. The role of case in spelling, and document context input at test time was also considered. The bidirectional transformers outperformed the unidirectional models on the sentence acceptability prediction task. The best bidirectional models approached estimated individual human performance on this task. These models did almost as well for real context ratings as for null context judgements. Random contexts reduced the models’ performance more significantly, but even in this case the level of correlation with mean human ratings was robust.

The bidirectional models were tested on AMT annotated Adger sentences to control for the possibility of MT induced bias. While their scores for this set were lower than for the annotated round-trip MT Wikipedia test sets, they remained robust and significant, particularly in view of the fact that this was a strongly out of domain experiment. Bidirectional transformers offer promising models for performing complex NLP tasks that require substantial amounts syntactic and semantic knowledge.

 
Source
< Prev   CONTENTS   Source   Next >