Machine Learning and the Sentence Acceptability Task
GRADIENCE IN SENTENCE ACCEPTABILITY
The use of neural language models in NLP tasks that require syntactic knowledge raises the question of the relationship between grammaticality and probability. A. Clark and Lappin (2011) and Lau, Clark, and Lappin (2017) show that grammaticality cannot be directly reduced to probability. Such a reduction requires specifying a threshold probability value к such that only sentences with a probability > к are grammatical. Such a threshold entails that the cardinality of the set of grammatical sentences is finite, which is not the case for any natural language. Assume, for example, that given a language model M for a set of sentences S (finite or infinite), only sentences in S with probability 0.1 or higher in M are grammatical. The probability distribution that M generates allows only for a finite S' C S of sentences in which, for any s' € S', Pm (s') >0.1 (in this case |S"| < 10). Clearly this result holds for any choice of M, S, and threshold probability value.
Grammaticality is a theoretical property, which is not directly accessible to observation. Speakers’ acceptability judgements can be observed and measured. These judgements proride the primary data for most linguistic theories. An adequate theory of syntactic knowledge must be able to account for the observed data of acceptability judgements. The experimental work that I discuss in this chapter, and the following one, measures and predicts speakers’ judgements on the acceptability of sentences. These are elicited through AMT crowd-source experiments in which annotators rate sentences in an AMT Human Intelligence Task (HIT) for naturalness. Sentence acceptability provides evidence for the grammatical status of a sentence.
The same argument that A. Clark and Lappin (2011) and Lau, Clark, and Lappin (2017) use to show that grammaticality cannot be reduced directly to a probability value threshold applies to acceptability. However, it is possible to construct models in which probability distributions provide the basis for predicting relative acceptability, and through it, degree of grammaticality. I will explore these models in the next two chapters.
Lau, Clark, and Lappin (2014, 2015, 2017) (LCL) present extensive experimental evidence for gradience in human sentence acceptability judgements. They show that crowd-sourced judgements on round- trip machine translated sentences from the British National Corpus (BNC) exhibit both aggregate (mean) and individual gradience. They use Google Translate to map the sentences in their test sets into four target languages, Norwegian, Spanish, Chinese, or Japanese, and then back into English. The purpose of round-trip MT is to introduce a wide variety of infelicities into some of the sentences, to ensure variation in acceptability judgements across the examples of the set. Each HIT contains one original, non-translated sentence, which is used to control for annotator fluency. LCL test three modes of presentation of their HITS: binary, four categories of naturalness, and a sliding scale with 100 underlying points. They find a high Pearson correlation (>0.92) between the three modes of presentation, and so they adopt the four category HIT format for subsequent experiments. Fig 3.1 displays a four category sentence acceptability rating HIT.
Figs 3.2 and 3.3 give the histograms of mean acceptability ratings in four category and slider modes of presentation, respectively. Figs 3.4 and
3.5 show the histograms for individual four category and slider ratings, respectively.
Lau et al. (2015) and Lau, Clark, and Lappin (2017) also demonstrate that crowd-sourced judgements on linguists’ examples from Adger (2003), in which semantic/pragmatic anomaly has been filtered out, display the same sort of gradience for both mean and individual acceptability ratings. Figs 3.G and 3.7 display the mean acceptability ratings (with four category presentation) for the good and the starred sentences
Figure 3.1 Four category AMT sentence acceptability HIT.
in Adger (2003), while Figs 3.8 and 3.9 show the individual ratings for these examples.
One might think that humans have a tendency to treat all classifiers as gradient. In fact, this is not the case. Lau et al. (2015) and Lau, Clark, and Lappin (2017) experiment with non-linguistic classifiers. They show that while AMT annotators judge bod)’ weight as gradient in drawings of human figures (Fig 3.10), their judgements of even vs odd natural numbers are sharply binary (Fig 3.11).
Figure 3.2 Four category mean acceptability ratings for BNC sentences.
Figure 3.3 Slider mean acceptability ratings for BNC sentences.
Figure 3.5 Slider individual acceptability ratings for BNC sentences.
Figure 3.4 Four category individual acceptability ratings for BNC sentences.
Figure 3.6 Mean acceptability ratings for good filtered Adger (2003) sentences.
Figure 3.7 Mean acceptability ratings for starred filtered Adger (2003)
Figure 3.8 Individual acceptability ratings for good filtered Adger (2003) sentences.