EXPERIMENTAL EVALUATION

Baseline: Handcrafted RGB Conversion

The DET curves for each CNN model are plotted in Figure 5.6. In all cases, the handcrafted RGB conversion is depicted in dashed dark blue lines, and different filter sizes P in the range [5,50] are shown in solid thin lines. In addition, the best configuration in terms of P and the corresponding APCER,)2% values is highlighted with a thicker solid line.

Compared to the results presented by Tolosana et al. (2019) and Gomez-Barrero and Busch (2019), we may observe a detection performance drop for MobileNet and

DET curves for each individual CNN model approach [handcrafted (RGB) and proposed], and different filter sizes P. (a) ResNet from scratch; (b) MobileNet; (c) MobileNetV2; (d) VGG19; (e) VGGFace

FIGURE 5.6 DET curves for each individual CNN model approach [handcrafted (RGB) and proposed], and different filter sizes P. (a) ResNet from scratch; (b) MobileNet; (c) MobileNetV2; (d) VGG19; (e) VGGFace.

ResNet. This is due to increased challenge pose by the newly acquired database: the number of bona fide samples has been multiplied by two and the number of PA samples by eight. In addition, the focus on the PA fabrication has now been set on the most challenging attacks. Therefore, the APCER02% has increased from 19.91% to 47.20% for MobileNet and from 6.79% to 48.99% for ResNet.

On the other hand, the higher resolution of the SWIR images captured with a new sensor leads to a considerable improvement for the remaining three CNN models. New' data allow to use the standard image size for all CNNs: 224 x 224 px, instead of the reduction to 58 x 58 px, or 118 x 118 px used in the previous works. This in turn results in APCER,, 2% values of 12.63% for VGG19, 44.02% for VGGFace, and 83.99% for MobileNetV2, whereas a BPCER of 0.2% could not be achieved by those models in Gomez-Barrero and Busch (2019) (i.e., APCER,, 2% = oo for all three models).

Input Pre-Processing Optimisation

In spite of the aforementioned enhancement, the detection rates for the handcrafted RGB conversion are far from the state-of-the-art, with the only exception of VGG19. However, this changes w'hen the input pre-processing module described in Section

5.4.2.1 is included in the CNN models. As it may be observed in Figure 5.6, the APCER,)-,,; are improved for all filter sizes P shown, reaching values below 3% for VGGFace’ and MobileNetV2.

For the particular case of the ResNet trained from scratch (Figure 5.6a), the APCER,, 2% can be reduced to 14.08% for P = 5 (i.e., relative improvement of 73%). In addition, the smaller the value of P, the bigger the improvement. On the other hand, the best detection performance for low' APCERs is reached by P = 7. Therefore, depending on the application scenario (i.e., convenience is preferred over security, or vice versa), different P values could be selected.

Regarding MobileNet (Figure 5.6b), the APCER,,m can be further decreased to 10.61% for P=ll (i.e., 78% relative improvement) at the cost of not being able to achieve APCERs under 0.5% for all РФ 13. On the other hand, even if the performance of MobileNetV2 is considerable w'orse than that of MobileNet for the handcrafted RGB conversion, in this case, a state-of-the-art APCER(l2<7r of 2.46% (i.e., 97% relative improvement) can be obtained for P = 9. In this case, the performance for low APCERs (e.g., of 0.2%) is also optimised for the same filter size.

Finally, we can see in Figure 5.6d that VGG19 achieves an APCER,, 29t of 3.04% for P = 1 (i.e., 76% relative improvement), similar to the performance reported for MobileNetV2. In addition, the BPCER remains lower for low APCERs than MobileNetV2, thereby yielding a more stable system for different operating points. This is also the case of VGGFace, w'hich achieves an APCER„2% of 2.74% for P = 20 (i.e., 94% relative improvement) and an even lower BPCER around 3% for any APCER < 0.2%. We may thus conclude that the VGG-based models achieve a higher overall performance for this particular PAD task as it w'as already pointed out by Tolosana et al. (2019). In addition, since VGGFace has been pre-trained on facial databases, it is able to achieve lower BPCERs than any other CNN model. It does so for a filter sizeP= 20, in comparison to the smaller filter sizes between 5

and 11 found to be optimal for the remaining models. This means that VGGFace focuses on features captured at a lower resolution and will therefore complement other models in an eventual fusion to achieve more robust results.

Final Fused System

Given the similarities between both Mobilenet models and the very superior performance of MobileNetV2, only this latter model is further considered for a score level fusion. Similarly, the ResNet trained from scratch reports, together with MobileNet, the worst results among the CNN models tested, since it has not been able to deal with larger images using only five layers. Therefore, it is also excluded from the final fused scheme.

Keeping those thoughts in mind, only MobileNetV2, VGG19, and VGGFace have been considered for the final fusion. First, the CNN models have been fused on a two by two basis, with no significant improvement of the detection accuracy. On the contrary, when the three networks are fused with a = 0.18 and /i = 0.58 (i.e., the weights are 18% for VGG19, 58% for VGGFace, and 24% for MobileNetV2), the detection performance improves, as shown in Figure 5.7a. In particular, a final APCER0M of 1.16% can be achieved. That is, only 24 PA samples are misclassified when only two bona fide samples in 1,000 are wrongly detected as attacks. On the other hand, since for low APCERs VGGFace shows lower error rates than any of the other models, the performance in that area of the DET plot is lower for the fused scheme. As a consequence, if the deployment scenario requires very low APCERs, for instance 0.2%, a fusion of the aforementioned CNN models with different filter sizes can yield better results, as depicted in Figure 5.7b. In this case, a BPCER()2% of 1.10% is obtained - that is, only 77 bona fide and four attack presentation samples are misclassified. Therefore, depending on the application scenario, different models will be chosen and fused to optimise the performance of the system for the particular case study.

DET curves for the score level fusion of the best configurations found in Figure 5.6. (a) Fusion optimal APCER,,,,,; (b) fusion optimal BPCER

FIGURE 5.7 DET curves for the score level fusion of the best configurations found in Figure 5.6. (a) Fusion optimal APCER,,,,,; (b) fusion optimal BPCER,,,,;.

(a) Bona fide sample and (b) PA sample of a conductive silicone overlay captured at all wavelengths, with the corresponding final fused PAD scores

FIGURE 5.8 (a) Bona fide sample and (b) PA sample of a conductive silicone overlay captured at all wavelengths, with the corresponding final fused PAD scores.

Since the main aim of the ODIN Program is the achievement of convenient PAD systems, we will further analyse the APCEs and BPCEs made by the optimal APCER0 2.; fusion. Figure 5.8 show's a bona fide and a sample of one of the most challenging PAIs for conventional fingerprint capture devices: an overlay made w'ith conductive silicone. As it may be observed, the trend showm by the bona fide across the acquired wavelengths, with a darkening effect, is not reflected on the conductive silicone material, which thus yields the highest possible PA score: 1.

Now, in order to see to w'hat extent the CNN models complement each other, the PAD scores of all APCEs and the lowest BPCE scores are plotted in Figure 5.9: the fused scores are included in the x axis, the individual scores for each CNN model

Score analysis for

FIGURE 5.9 Score analysis for (a) the BPCEs yielding the highest fused PAD scores, and (b) all APCEs (24). The decision threshold S for BPCER = 0.2% is depicted with a dashed black line. The fused PAD scores are depicted on the x axis, and the individual CNN scores are included in the y-axis.

on the у axis. The decision threshold <5 for a BPCER of 0.2% is depicted with a dashed horizontal line: the BPCEs show PAD scores over 8 and the APCEs below <5. First, we may see in Figure 5.9a for the BPCEs that the PAD scores reported by VGGFace are always higher than S and in fact extremely close to the maximum PAD score of 1. In addition, at least one of the other CNN models also misclassifies the sample. Therefore, even if the third model is able to classify the sample as a bona fide presentation, this represents only 18%-24% of the final score. Therefore, the fused scheme is not able to correctly classify the sample. On the other hand, for the 24 APCEs (see Figure 5.9b), VGGFace reports in most cases (14) a PAD score higher than <5 (i.e., correct decision). However, in almost all cases (22), MobileNetV2 outputs a PAD score below 8, and in 14 cases even below 0.4. Similarly, VGG19 yields a PAD score below 0.4 for 18 of the APCE samples. Therefore, given that the threshold 8 is set at 0.77 in order to achieve a low BPCER of 0.2%, those samples are not detected as attacks by the fused system. It should be also noted that for all APCEs where the fused score s is lower than 0.2, all CNN models have also reported very low scores, thereby making it infeasible to detect those samples for any reasonable BPCER.

The APCEs are summarised in Table 5.4, and the corresponding samples for each PAI species are presented in Figure 5.10. A significant number of errors stems from the orange playdoh fingers: over 63% of the test samples are not detected. Furthermore, for six of them, the corresponding PAD scores remain below 0.03, and all scores s below 0.2 depicted in Figure 5.9b correspond to this PAI species. In order to detect those samples, the detection threshold 8 would have to be placed close to 0, thereby significantly increasing the BPCER of the system. This thus remains an open challenge for the PAD approach described in this Chapter. On the other hand, for the remaining PAI species reporting some APCEs, it is only one or two samples out of up to 275 samples included in the test set. Therefore, we may conclude that the proposed method is robust against these PAI species. Moreover, one of the main issues reported in Gomez-Barrero and Busch (2019) has now been tackled with a new capture device. In that work, out of the 222 PA samples included in the test set, three APCEs were reported for a full finger made of silicone with conductive coating and a conductive silicone overlay. Now, a total of 83 samples are included in the test set for such full fingers and 232 conductive silicone overlays. All those PA samples were correctly detected.

TABLE 5.4

Summary of the APCEs of the Fused Scheme Including the PAI Species

Type

Material

# Samples

# APCEs

Full finger

Playdoh orange

30

19 (63.3%)

Dragonskin

275

1 (0.36%)

Overlay

Dragonskin

89

2 (2.2%)

School glue white

14

1 (7.1%)

Silicone two part

64

1 (1.6%)

Samples acquired from all the PAI species which are partly not detected by the fused approach,

FIGURE 5.10 Samples acquired from all the PAI species which are partly not detected by the fused approach, (a) Orange playdoh, ,v = 0.0040; (b) Dragonskin overlay, .v = 0.2253; (c) Silicone overlay, s = 0.2957; (d) Dragonskin finger, s = 0.6039; (e) School glue white overlay, s = 0.7669.

 
Source
< Prev   CONTENTS   Source   Next >