AN EXAMPLE APPLICATION

Let’s illustrate the use of DNNs in an NLP task, with some recent work on semantic paraphrase. Bizzoni and Lappin (2017) construct a composite neural network to classify sets of sentences for paraphrase proximity. They develop a corpus of 250 sets of five sentences, where each set contains a reference sentence and four paraphrase candidates. The)' rate each of the four candidates on a five-point scale for paraphrase proximity to the reference sentence. Every group of five sentences illustrates (possibly different) graduated degrees of the paraphrase relation within the reference sentence. Their rating labels correspond to the following categories; (1) two sentences are completely unrelated; (2) two sentences are semantically related, but they are not paraphrases; (3) two sentences are weak paraphrases; (4) two sentences are strong paraphrases; (5) two sentences are (type) identical.

Here are two examples of these ranking labels.

  • • Reference Sentence: A woman feeds a cat
  • — A woman kicks a cat. Score: 2
  • — A person feeds an animal. Score: 3
  • — A woman is feeding a cat. Score: 4
  • — A woman feeds a cat. Score: 5
  • • Reference Sentence: I have a black hat
  • — Larry teaches plants to grow. Score: 1
  • I have a red hat. Score: 2
  • — My hat is night black; pitch black. Score: 3
  • — My hat’s color is black. Score: 4

This annotation scheme sustains graded semantic similarity assessment, while also allowing for binary classification of a pair of sentences, scored independently of the other pairs in the reference set. The scores of two paraphrase candidates represent relative proximity to the reference sentence.

Bizzoni and Lappin (2017) train their classifier DNN for both binary and gradient classification of pairs of sentences for paraphrase. They train it on 701 pairs of sentences from the corpus, and they test it on 239 pairs.

The paraphrase classifier consists of three main components:

  • (i) two encoders, one for each of the sentences in a reference sentence- candidate pair, that consist of a CNN, a max pooling layer, and an LSTM,
  • (ii) a merge layer that concatenates the sentence vectors which the encoders produce into a single vector, and
  • (iii) several dense, fully connected layers that apply sigmoid functions to generate a softmax distribution for the paraphrase classification relation between the two input sentences.

The DNN uses the pre-trained lexical embeddings of Word2Vec (Mikolov et ah, 2013). The CNN of the encoder identifies relevant features of an input sentence for the classification task. The max pooling

Paraphrase Encoder. From Bizzoni and Lappin (2017)

Figure 1.11 Paraphrase Encoder. From Bizzoni and Lappin (2017).

layer reduces the dimensions of the vector that the CNN generates. The LSTM uses the sequential structure of the sentence vector to highlight features needed for the task, and to further reduce the dimensionality of the input vector. The LSTM produces a vector that is passed to two fully connected layers, the first one with a 0.5 dropout rate. Output from half the neurons, randomly selected, of this layer is discarded in training, to avoid overfitting. The structure of the paraphrase encoder is shown in Fig 1.11, and the composite classifier DNN is displayed in Fig 1.12.

Paraphrase Classifier. From Bizzoni and Lappin (2017)

Figure 1.12 Paraphrase Classifier. From Bizzoni and Lappin (2017).

Bizzoni and Lappin (2017) assess the accuracy of both binary and gradient classification on the basis of their annotation of the test set sentence pairs on a five-point scale. The binary classifier takes a softmax prediction of a score above a threshold of 2 as an instance of paraphrase. The gradient classifier predicts a paraphrase score from the scale through the softmax probability distribution over this relation. They use the Pearson coefficient to evaluate the correlation between the classifier’s scores and the ground truth annotations.[1]

TABLE 1.1 Binary Accuracy and Gradient Correlation for the Paraphrase Classifier

к

Accuracy

1

70.10

2

67.01

3

79.38

4

73.20

5

67.01

6

72.92

7

66.67

8

75.79

9

64.21

10

73.68

Average

71

к

Pearson

1

0.51

2

0.63

3

0.59

4

0.62

5

0.61

6

0.72

7

0.59

8

0.67

9

0.54

10

0.67

Average

0.61

The)’ apply ten-fold cross-validation to test the robustness of accuracy and correlation. This involves successively partitioning the corpus into training and test components over ten different splits, in order to insure the robustness of test results. The accuracy and correlation scores for the paraphrase classifier are given in Table 1.1. Given the small size of the corpus, these are encouraging results for the classifier.

SUMMARY AND CONCLUSIONS

In this chapter we have briefly looked at the main types of DNN that are currently driving deep learning applications. We started with generic feed forward networks, and considered how back propagation, with gradient descent as an error reduction procedure, is used to train DNNs

a negative correlation. It is specified by the formula

where p is the Pearson correlation between X and Y. cov is their covariance, and a is their respective standard deviations. An alternative measure, the Spearman correlation, assesses the correlation between the ranked values of X and Y. These two statistical metrics may yield distinct correlation values for the same pairs of random values. However, in all of my joint work on DNNs predicting mean human judgements which I discuss here, we have found that Pearson and Spearman correlations converge on approximately the same values. Therefore, we report only Pearson correlations.

against a ground truth standard. We have seen how RNNs, enriched with long short-term memory through filtering and update functions, recognise patterns in sequences. CNNs use convolution and max pooling to progressively reduce the dimensions of feature vectors, and to produce successively more abstract feature maps, corresponding to higher level properties of input data.

We observed that an attention layer was introduced into Seq2Seq systems to allow an LSTM encoder to track and adjust the associations among the elements of its context vectors and the components of the output vector of the decoder LSTM. Transformers dispense with recurrent processing entirely, and rely on multi-head attention units to drive their feed forward mechanisms. This permits them to achieve greater accuracy in learning, but it requires much larger pre-training on corpora for lexical embeddings. Transformers are converging on task general architectures that can be applied across a variety of applications, through either fast learning methods (GPT-3) or fine-tuning (BERT), for specific tasks.

We then considered Bizzoni and Lappin (2017)’s composite DL architecture for paraphrase assessment. This system combines LSTMs and CNNs in one encoder, and two such encoders in a classifier. It illustrates the use of different DNN elements within a single processing system in order to perform a challenging semantic task.

DNNs have become increasingly powerful through the use of multi- head attention, and large-scale pre-trained embeddings. This has facilitated transfer learning, where a DNN trained for one task can be easily adapted to others with the addition of fine-tuning layers. Through attention-driven architecture and pre-trained embeddings DNNs have come closer to domain general learning procedures that achieve a high level of performance across several domains, with limited task-specific training.

Over the past 15 years DNNs have moved from a niche technique of machine learning to the leading framework for work in AI. DL has achieved rapid progress across a wide range of AI tasks, approaching, and in some cases, surpassing human performance on these tasks. It has generated significant advances in several areas of NLP in which more traditional, symbolic methods have not yielded robust wide coverage systems after many years of work. These results are of cognitive interest to the extent that they show how it is, in principle, possible to effectively acquire certain types of linguistic knowledge through largely domain general learning devices.

  • [1] “The Pearson correlation coefficient (also known as Pearson’s r) is a statisticalmeasure of the linear correlation between two random variables X and Y. A positivevalue indicates the degree of positive correlation and a negative value, the extent of
 
Source
< Prev   CONTENTS   Source   Next >