Multi-Spectral Samples Pre-Processing

The PAD method proposed in Tolosana et al. (2019) included a handcrafted conversion of the four SWIR samples to an RGB image, as depicted in Figure 5.3 (bottom). In spite of the low error rates reported in that work, such a manual pre-processing presents several drawbacks. On the one hand, the linear transformation from three to four channels was optimised in terms of the average pixel intensity variance for the bona tides (i.e., intra-class variability, to be minimised) and the corresponding differences between bona fide and attack presentations (i.e., inter-class variability, to be maximised). Such a line of thought optimises the problem from a human vision perspective and leads to a single input which will be further processed by different CNN models, which will in principle learn different features starting from the input data. In addition, it should be noted that a single transformation is applied to the whole image, even if the finger may have a non-uniform illumination, as shown in Figure 5.2, and some PAI species may only cover part of the finger.

In contrast to the aforementioned handcrafted conversion, we propose to let the network itself convert the four grey-scale input channels into RGB images (i.e., tensors) comprising three channels. This way, the network can apply different linear and non-linear combinations to each region of the image and learn the most suitable features for the following layers. To that end, we include at the beginning of each CNN model the pre-processing module shown in Figure 5.3 (top). This new convolutional layer has a four-dimensional tensor as input, a stride of one in order to preserve the image size, and a filter of size P x P px. The value of P needs to be optimised ad hoc for each model, since different CNN models may learn features at different scales during training. In addition, to facilitate convergence, batch normalisation and a ReLu activation function are added to the convolutional layer. The corresponding parameters will be trained together (i.e., end-to-end) with the last layers of the pretrained models, or the full residual network trained from scratch, so that the updates can propagate through the whole network in each training epoch.

CNN Models

We consider five different CNN models, whose architectures are shown in Figure 5.4. First, we analyse the three models studied in Tolosana et al. (2019), namely: (i) the 5-layer ResNet, (ii) a reduced version of MobileNet (Howard et al. 2017), and (iii) the pre-trained VGG19 (Simonyan and Zisserman 2015). In addition, we also study two further CNN architectures: (iv) a reduced version of MobileNetV2 (Sandler et al. 2018), which is an improved version of MobileNet, and (v) the pre-trained VGGFace (VGG16) (Parkh, Vedaldi, and Zisserman 2015), which has been trained on facial images, thus containing skin, instead of training it on the more general ImageNet database as the remaining pre-trained models. All strategies have been implemented under the Keras framework using Tensorflow as back-end, with a NVIDIA GeForce GTX 1080 GPU. An Adam optimiser is considered with a learning rate value of 0.0001 and a loss function based on binary cross-entropy.

CNN architectures. From left to right

FIGURE 5.4 CNN architectures. From left to right: (a) the residual CNN trained from scratch using only the SWIR fingerprint database (319,937 parameters); (b) the pre-trained MobileNet- based model (815.809 parameters): (c) the pre-trained MobileNetV2-based model (437,985 parameters, see Figure 5.5 for details on the bottle-necks); (d) the pre-trained VGG19-based model (20,155,969 parameters); and (e) the pre-trained VGGFace-based model (20,155,969 parameters). All pre-trained models are adapted using transfer learning techniques over the last white- background layers. Also, the first convolutional layer (purple) (i.e., “InputProc”, see Figure 5.3) is trained for all networks. This figure is extracted from Gomez-Barrero and Busch (2019).

Residual Network Trained from Scratch: As already pointed out, the first approach is focused on training a residual CNN (He et al. 2015) from scratch.

A residual connection consists of reinjecting previous representations into the downstream flow of data, by adding a past output tensor to a later output tensor. These connections help preventing information loss along the data-processing flow and allow the use of DNN architectures, decreasing their training time significantly (He et al. 2015; Szegedy, Ioffe, and Vanhoucke 2016). The five- layer ResNet utilised is depicted in Figure 5.4 (left). As it may be observed, in order to be able to train it from scratch with a small training set, it comprises only five layers. In addition, two residual connections with pointwise convolutions are added. Batch normalisation is also applied right after each convolution and before the ReLu activation in order to facilitate convergence. MobileNets and Transfer Learning: The main feature of both MobileNet (Howard et al. 2017) and MobileNetV2 (Sandler et al. 2018) is the use of depth- wise separable convolutions. These layers perform a spatial convolution on each channel of their input, independently, before mixing the output channels via a pointwise (i.e., 1 x 1) convolution. This is conceptually equivalent to separating the learning of spatial features, w'hich will show correlations in an image, and the learning of channel-wise features, given the relative independence of each channel in an image. An additional advantage of this type of convolutions is that they require fewer parameters and computations, thereby allowing a fast training using less data. In both MobileNet networks, downsampling is directly applied by the convolutional layers that have a stride of 2 (represented by /2 in Figure 5.4), instead of adding some kind of pooling between layers.

With respect to MobileNet, the main contribution of MobileNetV2 is the use of residual connections and inverted bottlenecks (see Figures 5.4 and 5.5). These blocks model the hypothesis of the low dimensionality of the manifold of interest on which the discriminative information extracted by the internal layers of the network lies. To account for this, linear bottleneck layers are introduced in the model, and the residual connections are established between the aforementioned bottlenecks (i.e., in contrast to more common approaches w'here the residuals connect layers with a higher number of filters or output channels).

Finally, given the depth of both MobileNet models and the limited amount of data available, out of the 13 blocks of MobileNet, we decided to keep only eight. Similarly, out of the 16 bottlenecks of MobileNetV2, 12 are used. In addition, the last two blocks (depicted in white) are re-trained. VGGs and Transfer Learning: Finally, two different VGG-based models have been studied, VGG19 (Simonyan and Zisserman 2015) and VGGFace2 (Parkh, Vedaldi, and Zisserman 2015). These networks are older and simpler than the MobileNets; however, due to its simplicity, VGG19 is still one

Three-layer structure of the bottleneck residual block of MobileNetV2, where t denotes the expansion factor, and c and s the number of filters and stride of the last convolutional layer

FIGURE 5.5 Three-layer structure of the bottleneck residual block of MobileNetV2, where t denotes the expansion factor, and c and s the number of filters and stride of the last convolutional layer. This Figure is extracted from Gomez-Barrero and Busch (2019).

of the most popular network architectures, providing very good results in a wide range of competitions. In fact, VGG19 showed a superior performance with respect to MobileNet for fingerprint PAD in Tolosana et al. (2019).

Both VGG inspired models consist of blocks of two to four convolutional layers separated by max pooling layers to reduce the dimensionality of the data, and thereby facilitate convergence during the training stage. Whereas VGG19 comprises 19 different layers, VGGFace is based on the smaller VGG16 model, including 16 layers. In addition, the latter has been trained on facial databases acquired in the wild (i.e., modelling realistic scenarios in opposition to controlled environments with frontal poses and fixed illumination). Therefore, VGG19 has been pre-trained on a multi-class task (ImageNet) in contrast to the two class problem of face recognition for VGGFace. For our study, the last fully connected layers have been replaced with two fully connected layers (with a final sigmoid activation function). In addition, the last three convolutional layers, depicted in white in Figure 5.4, are re-trained in both models.

It should be finally noted that the fully connected layers trained on ImageNet (Krizhevsky, Sutskever, and Geoffrey 2012) or facial classification tasks have been removed from all MobileNet and VGG-based architectures and substituted by the corresponding fully convolutional layers with sigmoid functions for a binary classification task.

Score Level Fusion

As it was already observed in Tolosana et al. (2019), different CNN models are more robust to specific PAI species than others. Therefore, the fusion of the final PAD score output by several models yields a higher detection performance. In our case (see Section 5.6 for more details), we found that the optimal results are achieved fusing three different models: VGGFace, VGG 19, and MobileNetV2. Therefore, we define the final PAD score as follows:

where a + /? < 1 are the weights assigned to VGGFace and VGG19, respectively.

< Prev   CONTENTS   Source   Next >