# THE PROPOSED FRCNN-GAN MODEL

The working principle involved in the presented FRCNN-GAN model is shown in Figure 9.1.

Initially, the FRCNN-GAN model acquires the images from public places by the use of OpenMV Cam M7 Smart Vision Camera. It captures the images and stores it in the memory. Then, the FRCNN-GAN model executes the face recognition process using Faster RCNN model, which identifies the faces properly in the captured image. Then, the GAN-based FSS module is employed to synthesize the recognized face and generate the face sketch. Finally, the generated face sketch and the sketches that exist in the forensic databases are compared and the most relevant image is identified.

FIGURE 9.1 Block diagram of FRCNN-GAN model.

## Data Collection

At the data collection stage, the proposed method makes use of 5G-enabled IoT devices called OpenMV Cam M7 Smart Vision Camera for data collection purposes. It comprises an OV7725 image sensing able to capture images at 640 x 480 8-bit greyscale or 320 x 240 16-bit RGB565 images at 30 FPS.

It involves an OpenMV camera that has a 2.8-mm lens on a standard M12 lens mount. It comprises a microSD card socket of 100 Mbs for read or write purposes. The SPI bus runs up to 54 Mbs and allows easy streaming of the image data. The sample image is depicted in Figure 9.2.

FIGURE 9.3 Overall architecture of Faster RCNN model.

## Faster R-CNN-Based Face Recognition

It is an extremely upgraded version of R-CNN that is quicker and highly accurate in processing. The main alteration of Faster R-CNN is to utilize CNN for generating the object proposal in place of Selective Search in the earlier phase. It is known as *RPN.* At the higher level, RPN initially implied a base CNN network VGG-19 for extracting the features from the images. RPN yields image feature map as an input and makes a collection of object schemes sets, respectively, with an object score value as outcome. The minor network allocates object classifier scores sets and bouncing boxes directly to every object position. Figure 9.3 shows the overall structural design of Faster RCNN model. The steps involved in the Faster R-CNN are as follows:

• An image is taken and passed into the VGG-19 and the feature map as output for the image is obtained. RPN is employed on the feature maps. It is returned to the object proposals, including the object score. ^{[1]}

• Finally, the approaches are used in a fully connected (FC) layer. It contains a softmax layer and linear regression layer at its top for classifying and resultant the bounding boxes to objects.

The RPN begins with the input image provided in the base of the CNN. The applied image initially resized to the smallest stride is 600 px through the larger stride not beyond 1,000 px. The outcome characteristics of the backbone network are generally shorter than the applied image based on the step of the backbone network. The feasible backbone network utilized in this effort is VGG16. It indicates that two successive pixels in the backbone outcome features signify two points of 16 pixels separately in the applied image. For every point in the feature map, the network learns whether an object exists in the applied image in the respective location and determines the size of the object. It can be performed by positioning “Anchors” sets on the applied image to every location on the outcome feature map from the backbone network. Such anchors stipulate probable objects in different sizes and feature ratios at this place. In total, nine feasible anchors in three dissimilar feature ratios and three various sizes are positioned on the applied image at point A on the outcome feature map. Anchors utilized have three scales of box regions 128^{2},256^{2}, and 512^{2} and three aspect ratios of 1:1, 1:2, and 2:1.

As the network travels by every pixel in the outcome feature map, it verifies whether such *к* respective anchors crossing the applied image essentially have objects, and improving these anchors help attaining bound boxes as “Object proposals” or area of interest. Initially, a 3 x 3 convolutional layer with 512 units is used on the backbone feature map to provide a 512-d feature map to each location. It can be followed by two familial layers: a 1 x 1 convolutional layer with 18 units to object classifiers, and a 1 x 1 convolutional with 36 units to bounded box regressor. The 18 units in the classifier division provide an outcome with size (FI, W, 18). These outcomes are utilized to offer a possibility of all points in the backbone feature map that comprises object inside every nine of the anchors at that time. The 36 units in the regression portion are applied to offer the four regression coefficients of every nine anchors for each point in the backbone feature map. These regression coefficients are utilized for enhancing the anchors that comprise objects.

- • An anchor is considered “negative” when its IoU with each ground-truth boxes is lesser than 0.3. A residual anchor (either positive or negative) is ignored to RPN trained.
- • A training loss to the RPN is a multitask loss provided by:

- • The regression loss L
_{re}(f,,*t ‘)*is stimulated entirely and the anchor comprises an object, that is the ground truth*p-*is 1. The word t, is the outcome forecast of the regression layer and contains four variables [*t*,_{x}*t*_{}}, t_{n}, t_{h}], - • The regression coefficients are employed for the anchors to accurate localization and offer proper bounded boxes.
- • Each box is sorted as per their
*els*scores. Next, nonmaximum suppression (NMS) is utilized with a 0.7 as threshold value. The top-down bounded boxes that contain an IoU of higher than 0.7 with one or more bounding box are ignored. Therefore, the maximum-score bounded box is taken to the overlap box group.

The Fast R-CNN contains the CNN (usually pretrained on the ImageNet classifier task) with its last pooling layer exchanged through “ROI pooling” layer and its last FC layer is swapped by two separations—a *(К +* 1) category softmax layer branch and a category-specific bounding box regression branch.

- • The applied images are initially sent via the backbone of CNN for generating the feature map. In addition to the test time effectiveness, one more main purpose is to use an RPN as a proposal generator. It offers benefits of
*weight distributing among the RPN and Fast R-CNN detector backbones.* - • Then, the bounding box approaches from the RPN are applied for pooling features from the backbone feature map. It can be performed via ROI pooling layer. An ROI pooling layer processes by (a) captivating the area equivalent for a method in the backbone feature map; (b) separating these areas into a static sub-windows count; and (c) executing max-pooling on this sub-windows for providing a static size outcome.

Rol pooling is a neural net layer utilized for object detection process. It was initially recommended by Ross Girshick in April, 2015. *It is a process of detecting objects by widely applying CNN. Its aim is to carry out max-pooling on the input of unusual sizes for obtaining fixed-size feature maps (e.g. 7* X *7).* It has accelerated the training as well as the testing process. It manages maximum detection accuracy. The results from the Rol pooling layer obtain a size of (*N*, 7, 7, 512), where *N* is the approaches count from the RP technique. Subsequent to sending it to the two FC layers, a feature is provided into the sibling classifier and regression branches. It is noticeable that the classifiers and detection divisions are not similar to RPN. Here, the classifier layer has C units in all classes in the detection task. A feature is sent to a softmax layer for attaining the classifiers scores—the possibility of a suggestion related to every class.

## GAN-Based Synthesis Process

Initially, the notations to the FSS are defined. Provide a test (observed) image *t,* the objective is generating the result s taking on *M* pairs of train face sketches and photos. The conditional GAN studies a nonlinear mapping from test image *t* and arbitrary noise vector z, for the result *s, Q*: *{t,* ^ —> s rather than *{z}* -> s as GAN does. A generator *Q* is studied for generating the results that could not be decided from “real” images by a discriminator *'D *that is train for differentiating the generator’s “fakes”.

The objective of conditional GAN is written as follows:

where A is for balancing the GAN loss as well as the regularization loss and the GAN loss is determined as follows:

The conditional GAN loss is utilized for encouraging less blurring and is represented as follows:

It is adapted to the generator as well as discriminator structures from individuals in the type of convolutional-Batch Norm-ReLu.

In this sketch synthesized by GAN, it maintains fine texture. But the noise appears with the fine texture because of the pixel-to-pixel mapping. For removing this noise, the sketch synthesized s and placed back onto the training sketches. Each face image is arranged and cropped for the identical size (250 x 200) based on the eye centers as well as mouth center.

Assume X,,..., *X _{M}* indicates

*M*training sketches. Initially, every training sketch and the sketch s are split into patches (patch size:

*p)*through an overlapping (overlapping size:

*o)*among neighboring patches. Assume

*s,j*signifies the

*(i,*j)th patch from s, where 1 <

*i < R,*1 <

*j < C.*Now,

*R*and C refer to the patch count in the path of rows and columns correspondingly to an image. As the sketch synthesized s has extremely same texture as that by training sketches, it has recreated the sketch s in a data-driven approach based on the Euclidean distance of image patches.

To sketch a patch *s _{Uj},* it initially explores the

*К*closer neighbors from every training sketch X,,...,

*X*around the location (i,

_{M}*j*) with respect to their Euclidean distance among patch intensities. As there is disarrangement among various face sketches, it widens the explore area based on their respective place

*(i, j*) by Z pixels about its top, bottom, left, and right directions. So, it is (2

*1*+ 1) X

*(21*+ 1) patches on all training sketches to match. To sketch patches

*s*it chooses

_{U)},*К*candidate neighbors from each

*M(2l +*l)

^{2}training sketch patch, indicated as X/

_{(},...,X,^. The recreation method is written as an easy regularization linear least-squares formulation as given in equation (9.5):

where *W, _{tj} =(w_{i}]_{j},...,W_{i}^{K}j)^{T}* is the recreation weight. It has closed-form result as given in equation (9.6):

where X,_{;} *еШ ^{2хК}* is the matrix of

*К*neighbors and 1 is the vector of each Is. It recreates the sketch patch

*s,j*as given in equation (9.7):

Finally, each recreated patch s,_{;} (1 * is arranged into a complete sketch s through overlapping area average.*

*[1] An Rol pooling layer is utilized in this method for reducing each proposal to thesimilar size.*