PROPOSED DEEPFAKE VIDEOS DETECTION FRAMEWORK
Face DeepFakes detection could be discerned as a two-class classification system. In this system, the input video has to be flagged as either real or fake. Figure 4.4 represents the overall schematic of the developed DeepFake detection technique. The proposed technique is composed of three stages, namely, face detection in given frames, feature extraction on detected face regions, and feature classification. In this study, the cascade face detector is used to detect faces . Specifically, the Viola- Jones algorithm is employed in the cascade face detector where the algorithm detects the faces, noses, eyes, and mouth . The detected face regions are then cropped and resized to 224 x 224 in order to be compatible with the input of the ResNet-50 model. The ResNet-50 model is a deep CNN architecture that contains 50 layers. Most of these layers are convolutional layers, and few layers are pooling layers. The FC layer (fcl,000) of the ResNet-50 model is utilised to extract deep features for the cropped face regions. The extracted features are 1,000 dimensional, which are given to the sequential input layer of the LSTM classifier. The biLSTM layer comes after sequential input layer. FC, softmax, and classification layers follow each other to detect the fake faces.
In the following, we detail the CNNs, LSTM, and ResNet.
FIGURE 4.4 Proposed DeepFake video detection framework.
Convolutional Neural Networks (CNNs)
A CNN architecture is usually made up of different layers, which are utilised consecutively to construct different architectures corresponding to different tasks. These layers may be convolution, pooling, normalisation, and FC layers . The convolution layers are performed to produce features via input pictures. Let X'~' be the features extracted from the previous layers, bj be the training bias operated to avoid overfitting and к be the learnable kernels . The feature map output is evaluated as follows:
where M, indicates the input map choice, and /(.) is the activation operator (function). The pooling layer is performed to accomplish the feature maps down sampling, which is transmitted through the convolution layer. In the literature, different pooling techniques like mean and maximum pooling are utilised. The computational nodes are reduced via pooling layers, and pooling layers estop the overfitting problem in the CNN structure . The pooling is identified as follows:
where the down(.) function conducts the down sampling operation. It should be noted that down sampling provides a summary of topical features that are then used in the following layers. FC layers pass all connections with all activations in the foregoing layer. FC layers supply distinguishing properties to classify the input frame into different classes. The FC layers’ activations are calculated using matrix multiplication followed by the bias . The CNN’s training is conducted by employing an optimisation scheme in (Equation 4.3). For neural networks, adaptive moment estimation (ADAM) and Stochastic gradient descent accompanied by momentum (SGDM) are two acknowledged training methods. The weights in the SGDM method are updated on a regular basis for each training set to achieve the target at the earliest point :
where W, a, and L denote the weights, learning rate, and the loss function, respectively. Through the CNN training, new weights are computed as follows:
The optimiser of ADAM uses the mean of the second moments of slopes, updates the learning rate in each iteration, and adopts the learning rate parameter predicated on the mean of the first moment in the RMSProp method .
Long Short-Term Memory (LSTM)
The LSTM is an exclusive type of RNNs [68-70]. LSTM is usually considered far robust than feed forward neural networks and RNNs because of memory blocks and recurrent connections in the recurrent hidden layer . The LSTM is very effective in classification and regression problems [69,70]. Memory blocks of the LSTM have self-connected memory cells, which at every time step store the transient states of the network. Information flow is supplied via an input to memory units/cells. Then, it passes from there to the other units by the gates. A forget unit (gate) is employed to scale internal condition/state of the cell before adding to the memory cell as an input. It is performed by repeating the memory cell itself and, if necessary, sets anew or omits memory of the cell. The forget gate is controlled by an activation function with a one-layer neural network identified as below:
where C(,_0 h^^x,, andft, are the previous LSTM block memory, the output of previous block, the input sequence, and the bias vector, respectively. The logistic sigmoid function and the weight vector assigned for each input are denoted as «and W, respectively. The activation operator is implemented to the foregoing memory structure/ block. It determines the preceding memory structure/block effect on the ongoing LSTM with element-wise accumulation (multiplication). The value of activation set/ vector output is checked and if it is almost zero, then preceding memory is forgotten.
In the input gate, a simple neural network produces a new memory by taking into account the impact of preceding memory block and the tanh activation function. The related process is as follows:
where i„ bh and W indicate outcome of the input gate, the bias vector, and weights, respectively. h(t-1) shows the outcome of preceding block, C(/-l) demonstrates the foregoing LSTM memory, and a parameter denotes the activation function [70-72]. In respect of the output (outcome) gate, it can be considered as a branch where outcome of the ongoing LSTM structure/block is generated by considering the following formulas:
Residual Neural Network (ResNet)
The ResNet was developed by He et al. with 152-layer-deep CNN architecture . The ResNet attempts to address the vanishing gradient problem occurring during back-propagation of CNN. The ResNet architecture presented residual connections (skip connections) to prevent loss of information during deep network training. Skip connection technique enables to train very deep networks to improve the model performance. The residual blocks are the main building blocks of the ResNet architecture. The architecture of ResNet contains connections through residual blocks, while the consecutive hidden layers are connected to another one in shallow neural networks. The preservation of the gained knowledge throughout training session and increasing the network capacity resulting in speeding up the time of the training of the model are two of the most significant advantages of residual connections in the architecture of ResNet. In this study, we focused on the ResNet-50, which is the residual DL network with 50 layers.