# Gaussian Mixture Model

Probability density functions are often estimated by modelling them as Gaussians. Only two parameters are required to define a Gaussian model where *ц* is the mean of the distribution and *a* is the standard deviation of the distribution.

Univariate Gaussian Distribution

Multivariate Gaussian Distribution

The mixture of Gaussians results in an accurate distribution of data. They only require relatively few parameters for estimation and can be learned from a dataset that is relatively small. A Gaussian mixture with *к* Gaussians Aq, each with its own *ft, a* and weight *Wi* specifying the relative importance of the distribution Aq can be defined as

Gaussian mixtures are used to estimate the underlying class probability distribution by automatically learning the complex object motion trajectories. Video trackers and sign language measurements gathered from gloves wired with sensors are instances of object trajectories. Principal Component Analysis (PCA) of these trajectories are built, and this representation for each class is segmented and then fitted into the GMM. The recognition system is made more robust and noise tolerant by making the training set diverse; this results in a complex Probability Density Function (PDF).

The Expectation Maximization (EM) algorithm is used to fit mixtures of Gaussians to the data. It is an estimation problem consisting of a two-step iterative process; namely the E-step and the M-step. The means, variances and mixing coefficients are initialized and the initial log likelihood is evaluated.

1. E-step: The responsibilities are evaluated given the current

parameter estimates.

2. M-step: The parameters are re-estimated using the current responsibilities.

The log likelihood of the training data is defined as follows:

The log likelihood is evaluated after every iteration and the iteration repeats until the convergence criteria is satisfied. The EM algorithm is convergent monotonically and is used to detect and find a local maximum. Data from the training set is pruned, merged and split using a model splitting process to automatically estimate the number of modes as twice the maximum sub trajectories in all the trajectories for the class. The product of mixing the weight of a mode and the number of samples determines the number of input samples required for effective estimation of mode parameters. The new trajectories are classified by computing the log likelihood, once the GMM has been trained. The trajectory is classified into the class that is depicted by the GMM with the maximal likelihood [21].

A multimodal framework combines both manual and non- manual signs. Since it is challenging to extract a good feature set from a multimodal system, a classifier is required to pick only certain class samples. HMM-GMM models have been popularly used to recognize gestures. HMM is used as a temporal classifier and models the sequential dependencies in data. The output density of each state is calculated by defining a GMM for every state of that model.

The HMM is then trained for each sign gesture sequence. The output probabilities are re-estimated using the Baum-Welch algorithm and the Viterbi algorithm is used to perform recognition. The HMM-GMM approach mandates a systematic approach to tune the parameters by varying both the number of hidden states as well as the Gaussian mixture components. The role played by facial expressions has been understated in the field of SLR. However, the complexity of feature interpretation and understanding in the case of facial expressions results in them being disregarded. [21] claims both facial expression and hand gestures are essential to deliver information about the gesture.

While dealing with a large vocabulary SLR, the main challenge for sign language recognition is the presence of a large search space. The proposed solution is a fuzzy decision tree with heterogeneous classifiers. A clivide-and-conquer method is used to hierarchically classify sign language attributes using multiple classifiers at each stage [7]. Unsustainable candidates are eliminated using a one- or two-handed classifier using a GMM. A Finite State

Machine (FSM) based method is used as hand shape classifier. Finally, a Self-organizing Feature Maps/Hidden Markov Model SOFM/HMM classifier is used to tackle signer independent difficulties. Signer independent difficulties are the challenges posed by effective extraction of common features from different signers and model convergence difficulties.

A GMM can determine if a gesture is demonstrated using one or two hands. This is concluded by the GMM by observing the left hand, which stays motionless for most one-handed signs, resulting in a very stable data. Therefore, the most stable frame is obtained. In the case of a two-handed sign, all the frames that include the left hand, which is in motion are included as a part of the training data. A motionless left hand allows the GMM to conclude that it is a one-handed sign and the training data is extracted accordingly. Training data only uses the position and orientation information of left hand for classification. Once the candidate words are classified under their corresponding one- or two-handed classes, these are used by the hand shape classifier.

# Neural Networks

ANNs are inspired by the biological neuron. CNNs are a variant of ANN, influenced by the functioning of the human brain’s visual cortex. The neurons in a CNN are modeled based on the biological nerve cells, connecting the receptive field, which are local regions of the visual field. Discrete convolutions are performed on the image with filter values as trainable weights to accomplish this connection. Feature maps are formed as a result of the application of multiple filters on each channel combined with the activation functions of the neurons [3].

One of the biggest advances is the use of Neural Networks in place of Gaussian Mixture Models (GMMs) [34, 27, 10]. GMMs are typically used to provide a representation of the relationship between HMM state and input feature. Any required level of accuracy can be achieved by modeling the probability distributions if the GMM has enough of a number of mixtures. However, GMMs pose some serious drawbacks. Firstly, they have proven to be statistically inefficient for modelling data that lies around a non-linear region in the data space [10]. Secondly, the number of parameters could drastically increase even if the feature size increases by a small amount. This could result in a performance degradation at higher dimensions. Neural Networks can provide a solution for both the above challenges posed by GMMs [22].

Vision-based analysis for Sign Language Recognition captures signs from a video camera. The camera captures signs acted out with a coloured glove. The system acquires the images, pre- processes the images, extracts features and finally performs gesture recognition using Recurrent Neural Networks (RNN). RNNs provide considerable advantage over feedforward networks especially for applications that demand temporal processing [26].

RNNs are a class of Artificial Neural Network ANN in which the connection between the nodes form a directed graph along the temporal sequence. RNNs have feedback connections within the other layers of the network and itself. This enables the network a characteristic of local memory, which in turn allows the network to store patterns and sequences and present this to the network more than once. The input pattern is forwarded along the network while the recurrent activations are propagated backwards to the context layer.

The two famous architectures for recurrent networks are: Elman and Jordan networks.

The Elman network is a three-layer network with context units. The hidden layer is a recurrent layer containing a recurrent link from all the nodes and itself in addition to standard feedforward connections. The Jordan network is akin to The Elman network in most aspects. The major exception is that while a feedback loop from the output layer to the hidden layer is not a part of the architecture in Elman Network, Jordan Network contains a feedback loop. This makes the network’s behaviour stable and suitable for gesture recognition. The processing nodes use sigmoid as the activation function.

# Convolutional Neural Network: Hidden Markov Model

Hidden Markov Models have been a tool for standard pattern recognition since they can infer sequences of hidden states from time-varying signals. Although HMM dominates the field of automatic speech recognition, it remains rather unpopular in the field of Computer Vision. This can be attributed to its poor image modelling capabilities by GMMs, which are generally used in frameworks that use HMM, as compared to CNNs.

This chapter focuses on an approach which integrates CNNs in the HMM framework [18].

- • While embedding a deep CNN into HMM, the output of the CNNs are treated as true Bayesian posteriors.
- • Hidden states of sign words are used as underlying targets to train the CNN in a top down approach.
^{[1]}

The output in most CNN approaches is evaluated based on the correlation with the ground truth, often ignoring the temporal structures or domains. Frame level labels are often a prerequisite for CNNs. It is difficult to annotate datasets with frame level labels when dealing with real time footage, video representations or real- life datasets.

Video input is given as a sequence of images for which the model learns an unobserved sequence of words that best fit the corresponding sign. This sequence is found using the Bayes theorem, which finds the class with maximum posterior probability. The number of hidden states in HMMs is established, which is then used to model the sign words.

A dynamic programming-based tracking approach is applied to the images as a part of preprocessing, and later input to the CNN. Image preprocessing is used to track the right hand across a set of frames since the right hand plays a dominant role in signing. The distortion suffered by the video is compensated by enlarging the crop size. The images are processed pixel-wise and the average of all the images in the training data is subtracted.

The video is input as a contiguous stream of frames or images *x'f =* Automatic SLR tries to learn and model *xf*

sequence of images which best fit u'j^{v} unknown sign words. Sign words are assumed to occur in a monotonous fashion, unlike translation to spoken language from sign language, where rearrangements are necessary. Best fit sequence is found using Baves Decision rule and the objective is to maximize the posterior probability *P(w^x{): x? ->* [rcfjopt = arg max{P(wj^{Y}|*f)}

The true class posterior probability when modelled by generative models is decomposed into a combinat ion of two knowledge sources - a product of language model and visual model. The problem is modelled using HMM, a stochastic FSA to account for the temporal variation of the input. A first order Markov assumption is made and the Viterbi algorithm is used to maximize the posterior probability using the following equation:

The output probability of the HMM *p(x'[* 1^{А}Ъ * ^{w}i* ) is modelled

by the CNN which have been known to model images better than generative models like HMM. The CNN models the posterior probability p(.sj.'r) since it is a discriminative model. CNN is used in a hybrid approach to model the posterior probability for a hidden state *s* given an input *x.*

A pooled state transition model that defines the transitions in the HMM in Balds structure (left-to-right structure; forward, loops and skips across at most just one state, and two subsequent states share the same class probabilities) is applied across all sign-words *P{s _{t}s_{t}-i)* is employed. The HMM models the garbage class as an ergodic state - a state that is aperiodic and positive-recurrent - with independent transition probabilities to make it more flexible, so as to make its insertion between sign words easier.

The architecture of the CNN employed incorporates three classifying layers. The network includes two intermediary auxiliary classifiers besides the final classifier. This encourages discrimination in lower stages of the network. The total loss is inclusive of the loss from these auxiliary classifiers. Each classifier is preceded by a dropout layer and all non-linearities are rectified linear units (ReLU).

A frame-state-alignment is to be obtained at the CNN training phase. A training and validation set is generated using this alignment to evaluate the accuracy per frame and stop the training at a good point, generally before the last few iterations. The model that results in the maximum accuracy on the automatic validation set is chosen once the CNN is trained. All three classifiers are used to estimate the iteration that performs best. The hybrid CNN-HMM includes a normalized exponential function whose resulting posteriors are used in the HMM as observation probabilities. In the tandem CNN-HMM approach the activations from the last layer before the softmax that yields the highest accuracy on the validation data are employed. Features are extracted from both train and test datasets for the Tandem system because a HMM-GMM system will be retrained with these features.

The HMM is based on the freely available state-of-the-art open source speech recognition system RASR [10]. The system performance is measured in Word Error Rate (WER), which is based on the Levenshtein alignment which computes the desired number of insertion, deletion and substitutions between reference and hypothesis sentence which results in a transformation of the hypothesis into the reference sequence.

- [1] The hybrid CNN-HMM system is trained end to end, byoptimizing the weights by just considering the video inputand the gloss output.