EXPERIMENTS

This section presents an empirical evaluation of developed DeepFakes detection framework.

Datasets

Two publicly available datasets, i.e., DeepFakeTIMIT and Celeb-DF, were used in this study.

4.4.1.1 DeepFakeTIMIT Dataset

The dataset of DeepFakeTIMIT [9] consists of two equal sized subsets of low-quality (LQ) and high-quality (HQ) DeepFakes generated using the dataset of VidTIMIT [70]. Both LQ and HQ subsets include 320 videos with the pixels of 64 x 64 and 128 x 128, respectively. In this work, we used HQ subsets as it is more difficult subset and showing results on this subset shows the efficacy of the proposed framework.

4.4.1.2 Celeb-DF Dataset

The Celeb-DF dataset [73] consists of 590 and 5,639 real and DeepFake videos, respectively. It corresponds to more than two million video frames. The real videos are taken from YouTube videos of 59 celebrities with different gender, age groups, and ethnic groups. The generation of the DeepFake videos is done by face swapping for each pair of the specified 59 subjects. Figure 4.5 shows some examples of real video frames from both datasets, whereas Figure 4.6 shows the corresponding detected face regions. Similarly, Figure 4.7 shows examples of fake video frames from both datasets, and Figure 4.8 shows the corresponding detected fake face regions.

Figures of Merit

A face DeepFakes detection framework is subject to two kinds of errors, i.e., false rejection rate (FRR) and FAR. The FRR is percentage of real samples classified as DeepFakes, while the FAR is percentage of DeepFake samples incorrectly classified as real samples. In this work, the efficacy of the developed framework was evaluated using FRR, FAR, and EER. EER metric describes the accuracy of the system when FAR is equal to FRR, i.e., FAR% = FRR%.

Real video frames

FIGURE 4.5 Real video frames: the first row shows some frames from Celeb-DF dataset and the second row is some frames from DeepFakeTIMIT dataset [9].

RE 4.6 Detected face regions on given frames. The first row shows some detected faces for Celeb-DF dataset and the second row shows some detected faces for DeepFakeTIMIT dataset [9]

FIGU RE 4.6 Detected face regions on given frames. The first row shows some detected faces for Celeb-DF dataset and the second row shows some detected faces for DeepFakeTIMIT dataset [9].

Fake video frames. The first row shows some fake frames from Celeb-DF dataset and the second row shows some fake frames from DeepFakeTIMIT dataset [9]

FIGURE 4.7 Fake video frames. The first row shows some fake frames from Celeb-DF dataset and the second row shows some fake frames from DeepFakeTIMIT dataset [9].

Detected face regions on given fake frames. The first row shows some detected fake faces for Celeb-DF dataset and the second row shows some detected fake faces for DeepFakeTIMIT dataset [9]

FIGURE 4.8 Detected face regions on given fake frames. The first row shows some detected fake faces for Celeb-DF dataset and the second row shows some detected fake faces for DeepFakeTIMIT dataset [9].

Experimental Protocol

All empirical analyses were performed on a workstation that contained the Intel(R) Xeon(R) CPU E5-1650 @3.60 GHz 64 GB memory and NVIDIA Quadro M4000 GPU. We utilised the MATLAB (R2018b). As it was mentioned earlier, the detected face regions were resized to 224 x 224 for being compatible with the input of the ResNet-50 model. The rescaled face regions were fed into ResNet-50 to extract features. The dimension of sequence intake (input) layer was set to be 1,000. The size of hidden elements (units) of the biLSTM layer was chosen to be 100. The outputs of FC were two. In Table 4.2, training criterion, values, and parameters are presented. The 75% of datasets were used in training of the proposed method, and rest 25% of datasets were used for testing of the proposed method. Using Table 4.2 and the parameters described in there, the training procedure was conducted out 7,560 recurrences. The number 63 was set as the iteration number in every epoch. The zero- centre normalisation was employed to normalise the data before it was applied to the LSTM network. To this end, each feature-wise average and standard divergence of entire sequence calculation were performed. Then, for every training input/sample, the average value was deducted along with division by standard deviation. “Adam” solver, in this study, was selected as the training procedure for the LSTM network.

TABLE 4.2

Training Variables Values

Methods

Techniques

Maximum epoch

120

Mini-batch size

20

Initial learning rate

0.001

Learn rate schedule

Piecewise

Learn drop period

100

Learn drop factor

0.001

Gradient threshold

1

Experimental Results

In Table 4.3, we report the evaluation of the devised DeepFake video detection framework, for both datasets, in terms of EER and FRR@FAR10% by employing the threshold when FAR = 10%. It can be seen in Table 4.3 that for both datasets, the proposed method was able to produce reasonable results. For example, 2.4217 EER(%) and the 0.0795 FRR@FAR10% (%) scores were obtained for DeepFakeTIMIT dataset, while 0.5014 EER(%) and the 0 FRR@FAR10% (%) scores were obtained forCeleb-DF dataset. These scores are quite low when compared with other methods in the literature.

We further compared the performance of the proposed DeepFake video detection method with some of the existing methods from the literature. Table 4.4 shows some results that were obtained by using various local descriptors adopted in Ref. [37] such as LBR Pyramid of Histogram of Oriented Gradients (PHOG), SIFT. CENTRIST, BSIF, Local Phase Quantisation (LPQ), BGP, Quaternionic Local Ranking Binary Pattern (QLRBP), FDLBP. and Speeded Up Robust Features (SURF) [37].

TABLE 4.3

Performance Evaluation of the Developed Method on Both Dataset in Terms of EER (%) and FRR<№FAR10% (%)

Dataset

EER (%)

FRR@FAR10% (%)

DeepFakeTIMIT

2.4217

0.0795

Celeb-DF

0.5014

0.00

TABLE 4.4

Performance Comparison of the Proposed Method with Existing Methods on DeepFakeTIMIT Dataset in Terms of EER (%) and FRR#FAR10% (%)•’

Method

EER (%)

FRR@FAR10% (%)

IQM + SVM [9]

8.97

9.05

LPB [37]

17.16

43.02

FDLBP [73]

37.19

88.30

QLRBP [73]

27.70

59.49

BGP [73]

13.33

15.99

LPQ [73]

13.69

16.53

BSIF [73]

60.88

93.69

CENTRIST [73]

11.43

13.12

PHOG [73]

89.70

100

SIFT [73]

57.58

95.43

SURF [73]

67.26

98.17

Proposed method

2.42

0.080

a Bold values are indicating the best results.

TABLE 4.5

Accuracy Comparison of the Proposed Framework with Prior Techniques on Celeb-DF Dataset in Terms of EER (%)

Method

EER (%)

Face X-ray-Blended [74]

31.16

Xception-based method [16]

59.64

Face X-ray-Blended+FaceForensics [74]

26.70

Proposed method

0.5014

In Table 4.4, it can be observed that among local image descriptors employed in Ref. [37], the CENTRIST descriptor attained the better classification accuracy of 11.43% EER(%) and 13.12% FRR@FAR10% (%). Moreover, the BGP and LPQ descriptors produced 13.33% EER(%) and 15.99% FRR@FAR10% (%) and 13.69% EER(%) and 16.53% FRR@FAR10% (%), respectively. The worst achievement was produced by the PHOG descriptors, where 89.70% EER(%) and 100% FRR@ FAR 10% (%) scores were obtained. However, the proposed method outperformed the other considered local descriptors such that 2.43% EER(%) and 0.080% FRR@ FAR10% (%) scores were obtained by the proposed technique. Similarly, the proposed technique achieved better results than the image quality feature-based method developed in Ref. [9]. This high performance of the method presented in this study was obtained because of the deep features where both colour and texture features were combined in the deep CNN model.

In Table 4.5, we report a comparison of the proposed framework with the prior techniques on Celeb-DF dataset in terms of EER (%). In the table, we could see that the developed framework outperformed prior techniques in detecting DeepFake videos. For instance, the proposed method attained 0.5014% EER, while Face X-ray- Blended [74] and Xception-based method [16] achieved 31.16% and 59.64%, respectively. The authors in Ref. [74] proposed Face X-ray technique based on blending operation and CNNs with different training procedures such as blended forgeries/ images (i.e., Face X-ray-Blended) and state-of-the-art manipulations (Face X-ray- Blended+FaceForensics). On the other hand, the authors in Ref. [16] employed pretrained XceptionNet [50] with fine-tuning for DeepFake video detection. Compared to the proposed method, the frameworks developed in Refs. [74] and [16] require large training datasets in order to attain lower error rates, as also pointed out by the authors and other publications. However, the framework developed in this study comparatively demands a smaller number of training samples to obtain good performances.

 
Source
< Prev   CONTENTS   Source   Next >