ELEMENTS OF DEEP LEARNING
DNNs learn an approximation of a function f(x) = у which maps input data x to an output value y. These values generally involve classifying an object relative to a set of possible categories or determining the conditional probability of an event, given a sequence of preceding occurrences. Deep Feed Forward Networks consist of
- (i) an input layer where data are entered,
- (ii) one or more hidden layers, in which units (neurons) compute the weights for components of the data, and
- (iii) an output layer that generates a value for the function.
DNNs can, in principle, approximate any function and, in particular, non-linear functions. Sigmoid functions are commonly used to determine the activation threshold for a neuron. The graph in Fig 1.1 represents a set of values that a sigmoid function specifies. The architecture of a feed forward network is shown in Fig 1.2.

Figure 1.1 Sigmoid function.

Figure 1.2 Feed Forward Network.
From Tushar Gupta, “Deep Learning: Feedforward Neural Network”, Towards Data Science, January 5, 2017.
Training a DNN involves comparing its predicted function value to the ground truth of its training data. Its error rate is reduced in cycles (epochs) through back propagation. This process involves computing the gradient of a loss (error) function and proceeding down the slope, by specified increments, to an estimated optimal level, determined by stochastic gradient descent.
Cross Entropy is a function that measures the difference between two probability distributions P and Q through the formula:
H(P,Q) is the cross entropy of the distribution Q relative to the distribution P. —EXr~p log Q(x) is the negative of the expected value, for x,
given P, of the natural logarithm of Q(x). Cross entropy is widely used as a loss function for gradient descent in training DNNs. At each epoch in the training process cross entropy is computed, and the values of the weights assigned to the hidden units are adjusted to reduce error along the slope identified by gradient descent. Training is concluded when the distance between the network’s predicted distribution and that projected from the training data reach an estimated optimal minimum.
In many cases the hidden layers of a DNN will produce a set of non- normalised probability scores for the different states of a random variable corresponding to a category judgement, or the likelihood of an event. The softmax function maps the vector of these scores into a normalised probability distribution whose values sum to 1. The function is defined as follows:

This function applies the exponential function to each input value, and normalises it by dividing it with the sum of the exponentials for all the inputs, to insure that the output values sum to 1. The softmax function is widely used in the output layer of a DNN to generate a probability distribution for a classifier, or for a probability model.
Words are represented in a DNN by vectors of real numbers. Each element of the vector expresses a distributional feature of the word. These features are the dimensions of the vectors, and they encode its co-occurrence patterns with other words in a training corpus. Word embeddings are generally compressed into low dimensional vectors (200-300 dimensions) that express similarity and proximity relations among the words in the vocabulary of a DNN model. These models frequently use large pre-trained word embeddings, like word2vec (Mikolov, Kombrink. Deoras, Burget, &: Eernocky, 2011) and GloVe (Pennington, Soc-her, Sz Manning, 2014), compiled from millions of words of text.
In supervised learning a DNN is trained on data annotated with the features that it is learning to predict. For example, if the DNN is learning to identify the objects that appear in graphic images, then its training data may consist of large numbers of labelled images of the objects that it is intended to recognise in photographs. In unsupervised learning the training data are not labelled.[1] A generative neural language model may be trained on large quantities of raw text. It will generate the most likely word in a sequence, given the previous words, on the basis of the probability distribution over words, and sequences of words, that it estimates from the unlabelled training corpus.
TYPES OF DEEP NEURAL NETWORKS
Feed Forward Neural Networks take data encoded in vectors of fixed size as input, and they yield output vectors of fixed size. Recurrent Neural Networks (RNNs) (Elman, 1990) apply to sequences of input vectors, producing a string of output vectors. The)' retain information from previous processing phases in a sequence, and so they have a memory over the span of the input. RNNs are particularly well suited to processing natural language, whose units of sound and text are structured as ordered strings. Fig 1.3 shows the architecture of an RNN.

Figure 1.3 Recurrent Neural Network
Simple RNNs preserve information from previous states, but they do not effectively control this information. They have difficulties representing long-distance dependencies between elements of a sequence. A
Long Short-Term Memory network (Hochreiter &; Schmidhuber, 1997) is a type of RNN whose units contain three types of information gates, composed of sigmoid and hyperbolic tangent (tanh) functions.[2]
- (i) The forgetting gate determines which part of the information received from preceding units is discarded;
- (ii) the input gate updates the retained information with the features of a new element of the input sequence; and
- (iii) the output gate defines the vector which is passed to the next unit in the network.
Fig 1.4 displays the architecture of an LSTM.

Figure 1.4 LSTM.
From Christopher Olah’s blog Understanding LSTM Networks, August 27, 2015.
In a convolutional neural network (CNN, Lecun. Kavukcuoglu, Farabet, 2010) input is fed to a convolutional layer, which extracts a feature map from this data. A pooling layer compresses the map by reducing its dimensions, and rendering it invariant to small changes in input (noise filtering). Successive convolutional + pooling layers construct progressively higher level representations from the feature maps received from preceding levels of the network. The output feature map is passed to one or more fully interconnected layers, which transform the map into a feature vector. A softmax function maps this vector into a probability distribution over the states of a category variable. Fig 1.5 illustrates the structure of a CNN.'

Figure 1.5 CNN.
Fiom Sumit Saha “A Comprehensive Guide to Convolutional Neural Networks—the ELI5 Way”, Towards Data Science, December 15, 2018.
Attention was developed to solve a problem in seq2seq neural machine translation, which uses an encoder-decoder architecture. In earlier versions of this architecture an RNN (or LSTM) encoded an input sequence as a single context vector, which a decoder RNN (LSTM) mapped to a target language sequence. Information from the previous hidden states of the encoder is lost, and all the words in the encoder’s output vector are given equal weight when it is passed to the decoder.
Bahdanau, Cho, and Bengio (2015) introduce an attention layer that computes relative weights for each of the words in the input sequence, and these are combined with the context vector. The attention mechanism significantly improves the accuracy of seq2seq word alignment. It learns the relative importance of elements in the input in determining 7 [3]
correspondences to elements in the output sequence. Self-attention identifies relations among the elements of the same sequence, which enhances the capacity of the DNN to recognise long-distance dependency patterns in that sequence. Fig l.G displays an attention level interposed between an encoder (in this case, a bidirectional RNN) and a decoder (a unidrec- tional RNN). This component assigns different weights to the elements of the context vector, which the encoder produces as the input to the decoder.

Figure 1.6 Encoder-Decoder System with Attention. From Bahdanau et al. (2015).
Fig 1.7 is an attention alignment map for an English-French machine translation system. The number of squares and the degree of shading for word pairs, along the diagonal line of the map, indicate the relative attention allocated to these pairs.
Transformers (Vaswani et al., 2017) dispense with recurrent networks and convolution. Instead they construct both encoders and decoders out of stacks of layers that consist of multi-head attention units which provide input to a feed forward network. These layers process input sequences simultaneously, in parallel, independently of sequential order.

Figure 1.7 Attention Alignment for Machine Translation. From Bahdanau et al. (2015).
However, the relative positions of the elements of a sequence are represented as additional information, which a transformer can exploit. The attention-driven design of transformers has allowed them to achieve significant improvements over LSTMs and CNNs, across a wide range of tasks. Fig 1.8 shows the architecture of a multi-head feed forward transformer.
Transformers are pre-trained on large amounts of text, for extensive lexical embeddings. Many like OpenAI GPT (Radford, Narasimhan, Salimans, Sz Sutskever, 2018) have unidirectional architecture. GPT-2 (Solaiman et al., 2019) is a large transformer-based language model that OpenAI released in 2019. It is pre-trained on billions of words of text, with 1.5 billion parameters, where these support large-scale embeddings.

Figure 1.8 Architecture of a Transformer.
It is unidirectional in that the likelihood of a word in a sentence is conditioned by the likelihood of the words that precede it in the sequence. The probability of the sentence is the product of these probabilities, computed with the following equation.
In 2020 OpenAI released GPT-3 (Brown et ah. 2020). It retains GPT- 2!s architecture, but it is greatly increased in size, pre-trained for 175 billion parameters. It has been tested on a variety of NLP applications for learning, with limited exposure to examples of the solved task. Brown et al. (2020) report that GPT-3 shows promising results for zero-, one- and few-shot learning. In the latter case the model is exposed to 10-100
training examples. This is an important result because it shows that a large-scale pre-trained transformer can learn to perform some NLP tasks with very limited training. However, smaller transformer models that are fine-tuned on task-specific data still outperform GPT-3 on most of these tasks, even with few-shot learning. The notable exception is the generation of news reports. The rate at which human judges succeeded in identifying GPT-3 generated text as artificially produced was only slightly above chance. It is also important to keep in mind that GPT- 3 achieves gains in fast learning for new tasks, at the cost of massive training on large corpora.
BERT is a bidirectional transformer trained to predict a masked token from both its left and right contexts (effectively it predicts the word in a blank between two contexts). It also uses the same generic parameters from its training for each task to which it is applied, and it is then fine-tuned for a particular task. BERT’s architecture is shown in Fig 1.9, and its training regimen is indicated in Fig 1.10 (Devlin, Chang, Lee, &: Toutanova, 2019).

Figure 1.9 BERT.
From Devlin et al. (2019).

Figure 1.10 Training BERT.
From Rani Horev (2018), “BERT-State of the Art Language Model for NLP”, Lym.AI, November 7, 2018.
- [1] See A. Clark and Lappin (2010) for a detailed discussion of supervised andunsupervised learning in NLP.
- [2] Where a logistic sigmoid function is a sigmoid function that generates valuesfrom 0 to 1, which can be interpreted as probabilities, tanh is a rescaled logisticsigmoid function that returns outputs from —1 to 1. A logistic sigmoid function isdefined by the equation A tanh function is defined by the equation
- [3] This diagram of a CNN appears in several publications, and on a number ofwebsites. I have not managed to locate the original source for it.