INDEXING FACIAL IMAGES

Human face is one of the most natural choices for biometric recognition. Authentication using facial data has been in place from the time people started using photographs [8]. Face is suitable for law enforcement and surveillance, such as CCTV control, suspect tracking, shoplifting, and investigation. The authentication process has been fairly manual till the early 2000s after which the automated face recognition using facial database has been started. Face recognition is a visual pattern recognition problem that takes a face as a three-dimensional input that may have pose, illumination, or emotion variation and identifies it on the basis of its two-dimensional representation. An automatic face recognition system broadly involves four modules, face detection, alignment, feature extraction, and lastly, matching and decision-making. The block

Block diagram of a face recognition system [27]

FIGURE 11.3 Block diagram of a face recognition system [27].

diagram of the face-recognition process is shown in Figure 11.3. Face detection aims at identifying face from an image and segmenting out the face portion from the background. However, in case of videos, face needs to be tracked down. After getting the face from the image/video, it is aligned in a uniform coordinate system. The face is then normalised geometrically and photometrically to account for pose and illumination variation, respectively. The pose variation is addressed by normalizing the face using the localization points of the face components, such as eyes, nose, mouth, and facial outline. After normalization, representative yet discriminating features are extracted from the face. Lastly, matching is performed using the extracted features, thus making the feature extraction process highly important [27]. Features can be referred to as shallow and deep features [53]. Shallow features are the ones that are extracted using handcrafted local image descriptors such as SIFT, LBP, and HOG and are then concatenated to form a representation that describes the face as a whole. Deep features, on the other hand, are extracted from a learned function, also referred to as a deep neural network, that takes a face image as input and outputs the salient features describing that image.

Face recognition can be done as either a verification or an identification process based on the type of application it is targeted for. Apart from identification of the facial images, there has been a demand to upscale the database and search ideal match in that large database. For example, social networking websites such as Facebook and Instagram have a large number of users who upload millions of images on daily basis. Now the task in hand is to auto-tag the people in the images. Also, in criminal investigations, it is required to find a match of a probe image from a database containing millions of images. These processes are very compute-intensive as they require the number of verifications proportional to the size of the database. The efficiency decreases with increase in the size of the database. To fix the efficiency deterioration, there is a need to develop a strategy that can perform pre-filtering on the database in constant time to produce a small fixed-length candidate set of fingerprints having probabilistic guarantees of hit rate. This is achieved by indexing the database.

Predictive Hash Code

Hashing methods, which refer to learning binary code representations with Hamming distance calculation, have been used lately for retrieval of images in large- scale databases. These methods speed up the searching process, but due to variations of illumination, pose, and expression in facial images, the hashing codes tend to become unstable. Therefore, to apply hashing on facial features, feature should be

Block diagram and the architecture,

FIGURE 11.4 Block diagram and the architecture, (a) Block diagram depicting learning of predictable binary codes [22]; (b) architecture of the CNN that is utilized to improve the predictability of the binary code [22].

predictable even with the presence of facial variations. To address this, a predictable hash code (PHC) that embeds facial features into Hamming space has been proposed in Ref. [22]. The code is learned in such a way that inter-class distance is maximised while minimizing the intra-class distance. To do so, mean face of each class is identified. But faces of the same class can also suffer from huge variations due to the presence of pose, illumination, and expression, thus making hamming distance a strong constraint. To address this, it has been enforced that the codes corresponding to facial images of the same person must be similar to the code of the mean face of the same subject. To account for maximizing inter-class distance between the codes, it has been ensured that the codes corresponding to the mean faces of different classes are orthogonal to each other. Expectation maximization has been utilised to find the linear mapping of the face image to a predictable binary code. The process is depicted in Figure 11.4a. This is implemented as two different models: one that uses Ll-norm and the other uses L2-norm. Ll-norm outputs a sparse mapping while L2-norm gives a dense mapping. A convolutional neural network (CNN)-based architecture has also been used that takes the non-preprocessed grey-scale face image as input and outputs its feature representation. It is employed to enhance the predictability of the binary codes and is trained using a softmax layer that has nodes equal to the number of classes. The architecture of the proposed network is given in Figure 11.4b (images taken from Ref. [22]).

11.2.2 Results

The aforementioned technique has been tested on three publicly available standard data sets, FRGC [54], AR [57], and YouTube Celebrities [52] data set. A subset of facial images from FRGC data set, i.e. first 20 images of each subject, has been taken for experimentation. This makes it a total of 3,720 images collected from 186 subjects, which are cropped to a size of 32 x 32. Some cropped facial images of the same person are shown in Figure 11.5a. The experiment has been conducted in two phases, closed set and open set scenario. The first 10 images of 100 subjects have been taken as a training set. In the closed set scenario, the training set has been considered

Facial images available in some of the popular open-source facial database,

FIGURE 11.5 Facial images available in some of the popular open-source facial database, (a) Cropped facial images taken from FRGC dataset [54]; (b) face images of a subject from AR database [57]; (c) cropped facial images of three subjects taken from YouTube Celebrities dataset [52].

as gallery and the remaining 10 images of the same subjects have been taken for testing. However, in the open set scenario, the first 10 images of the remaining 86 subjects have been considered for gallery and the remaining 10 images have been used as query images. The proposed PHCs are denoted using PHC-L2 and PHC-L1 (L2-norm and Ll-norm, respectively). The proposed technique has been compared using popular hashing methods, such as Locality Sensitive Hashing (LSH) [18], Spectral Hashing (SH) [62], Iterative Quantization (ITQ) [20], Linear Discriminative Analysis Hash (LDAH) [59], Binary Reconstruction Embedding (BRE) [35], Kernel- Based Supervised Hashing (KSH) [39], and Fast Supervised Hashing (FastH) [38]. For ITQ, its supervised version (CCA-ITQ) and unsupervised version (PCA-ITQ) are included. Recognition accuracy of the proposed technique PHC-L2 and PHC-L1 on FRGC data set for closed set and open set is shown in Ref. [22] that supersedes all other methods such as LSH, SH. ITQ. LDAH, BRE, KSH, FastH, CCA-ITQ, and PCA-ITQ.

The AR database [57] consists of 4,000 images collected from 126 individuals. It covers images with different illumination, facial expressions, and occlusion. But in the proposed technique, the experiment has been conducted by taking eight facial images of 100 subjects each. These images have been down-sampled to a size of 28 x 23. Some of the down-sampled images from AR database are shown in Figure 11.5b. Images of first 50 subjects have been used for training while the images from remaining 50 subjects have been used for testing. Of the testing images, the first four of each subject have been used as probe images and the remaining four have been used as gallery images. The recognition accuracy vs. the number of bits used for feature on the YouTube celebrities data set as shown in Ref. [22] depicts that in the initial phase, supervised hashing methods perform better than the unsupervised ones because of

TABLE 11.1

Predictability Analysis (Average Recognition Accuracy (in %) + Standard Deviation) for the Proposed Feature Representation [22]

Feature Size (Bits)

32

64

128

Pixels

21.13 ± 3.01

46.31 ± 1.43

54.82 ± 2.68

58.37 ±4.36

CNN

59.13 ±2.10

74.29 ± 3.20

80.94 ± 1.65

83.15 ±0.68

the presence of illumination and pose variation in the frontal images of both the sessions. This was an open-set problem because the training and testing images are from different subjects. Hence, it can be said that for such problem, larger number of bits are required to attain good recognition rate.

The third data set, i.e. YouTube celebrities data set [52], consists of video clips instead of images. It has 1,910 clips from 47 individuals collected from YouTube. Forty-one images have been collected for every person by clipping three videos each, and then, these images have been cropped to a size of 30 x 30 as shown in Figure 11.5c. Six images for every person have been used for testing, thus making it a total of 44,172 and 239,997 images in the testing and training set, respectively. The training images have been used for training the CNN that outputs a 1,152-dimensional feature vector for every image. The database has also been augmented by flipping the images for better training of the network. The similarity is computed between every pair of test and train set using Hamming distance. The plot between recognition accuracy vs. number of bits for AR and YouTube Celebrities data set as shown in Ref. [22] shows that PHC-L2 has highest accuracy per bit as compared to all other methods such as LSH, SH, ITQ, LDAH, BRE, KSH, FastH. CCA-ITQ, and PCA-ITQ. The predictability (accuracy ± standard deviation) of the feature representation using the proposed methods (both pixels and CNN) has also been analysed for different feature length (number of bits). The result is shown in Table 11.1. A higher recognition rate with lower standard deviation indicates that the proposed features are predictable. It has also been observed that the code obtained from CNN features tends to attain better recognition rate (68.79%) than the code obtained by using the image pixels directly (58.37%).

 
Source
< Prev   CONTENTS   Source   Next >