Convolutional neural network-based face recognition using non- subsampled shearlet transform and histogram of local feature descriptors

Received Feb 6, 2021 Revised Sep 23, 2021 Accepted Oct 4, 2021 Face recognition has been using in a variety of applications like preventing retail crime, unlocking phones, smart advertising, finding missing persons, and protecting law enforcement. However, the ability of face recognition techniques reduces substantially because of changes in pose, illumination, and expressions of the individual. In this paper, a novel face recognition approach based on a non-subsampled shearlet transform (NSST), histogram-based local feature descriptors, and a convolutional neural network (CNN) is proposed. Initially, the Viola-Jones algorithm is used for face detection and then the extracted face region is preprocessed by image resizing operation. Then, NSST decomposes the input image into a low and high-frequency component image. The local feature descriptors such as local phase quantization (LPQ), pyramid of histogram of oriented gradients (PHOG), and the proposed CNN are used for extracting features from the low-frequency component of the NSST decomposition. The extracted features are fused to generate the feature vector and classified using support vector machine (SVM). The efficiency of the suggested method is tested on face databases like Olivetti Research Laboratory (ORL), Yale, and Japanese female facial expression (JAFFE). The experimental outcomes reveal that the suggested face recognition method outperforms some of the state-of-the-art recognition approaches.


INTRODUCTION
Face recognition has grabbed noticeable attention in several areas like surveillance, information security, and entertainment [1]- [3] due to its uniqueness, low-cost, and easy accessibility compared to other biometric approaches. Face recognition is a process of recognizing an individual from the available face database [4]. A general face recognition methodology comprises pre-processing, extracting features, and classification stages. The pre-processing step involves operations like image de-noising, scaling, image registration, face detection, and normalization. In the feature extraction phase, features are obtained for efficient image representation and visual description. Feature extraction plays a major role in computer vision applications like face recognition [5]- [7], texture analysis [8], [9], and sketch synthesis [10]- [12]. A precise image feature should be both robust and discriminative to distinct variations like noise and illumination changes. The last step of the face recognition system is the classification which incorporates robust classifiers

Non-subsampled shearlet transform
Traditional multiscale methods like wavelets, curvelets, and contourlet transforms are unable to capture the anisotropic features in multidimensional data. These problems are overcome by shearlets since they can efficiently represent the data in multidimensional phenomena [44]. Let dimension = 2, the discrete shearlet transform can be given as (1), where is a group of basis functions that satisfies 2 ( 2 ), indicates the anisotropy matrix, is a shear matrix, , , are scale, dimension, and shift parameters. Both , and are invertible matrices with size 2 × 2 and |det | = 1. For each > 0 and ∈ , the matrices and are given by (2), the matrix controls the scaling of shearlet and controls the orientation of shearlet. For = 9, = 1, (2) becomes the basic function ̂( 0) for shearlet transform, for any = ( 1 , 2 ) ∈̂2, 1 ≠ 0 is given by (4), here ̂ is the Fourier transform of . ̂1 ∈ ∞ ( ), ̂2 ∈ ∞ ( ) are both wavelets. The NSST decomposition consists of multi-scale and multi-directional factorization steps. To achieve multiscale factorization, the non-subsampled laplacian pyramid (NSLP) is utilized and it consists of a dual-channel non-subsampled filter bank to ensure multi-scale property, which separates the input image into low and high-frequency components. Implementation of successive NSLP decomposition is done to decompose the low-frequency component repeatedly and hence singularities in images are found. Similarly, to realize, multi-directional factorization improved shearing filters are used. In our proposed approach, initially, we detect the face region and then resize it to 64×64 and then NSST is applied to it. Figure 2(a) shows the input face image, Figure 2(b) gives the detected face region, and the low-frequency sub-band component from the NSST is shown in Figure 2

Local phase quantization
LPQ is a well-known local texture feature descriptor and used to extract the textual details, which are robust to blurring [45]. Initially, LPQ performs short time Fourier transform (STFT) to obtain the phase details for every pixel of the source image and then encrypts the corresponding phase information. Finally, estimates the distribution of the encrypted details to get the LPQ features. The mathematical description of LPQ is described as: Let us assume that ( , ) be an original image. Then the spatial invariant blurring of the image ( , ) is obtained by a convolution operation (5): where ( , ) is a blurred image, ℎ( , ) is the point spread function (PSF) and ⨂ represents the convolution [46]. The Fourier representation of (5) is given by (6), where ( , ), ( , ), and ( , ) are the Fourier transforms of ( , ), ( , ), and ℎ( , ) respectively. After that, the phase information of the blurred image is attained by the following expression, where ∠ ( , ), ∠ ( , ), and ∠ ( , ) are the phases of ( , ), ( , ), and ℎ( , ) respectively. When the PSF, ℎ( , ) is centrally symmetric, its phase has only two values and is represented by (8), thus, the phase invariance between ( , ) and ( , ) is obtained as (9), However, in LPQ the phase details are evaluated over the × neighborhood region of image ( , ). To obtain these local spectra features estimate the STFT by (10), where and indicate the neighborhood region. LPQ finds the phase detail at frequency points 1 = ( , 0), 2 = (0, ), 3 = ( , ), 4 = ( , − ) using STFT, where is a small integer that obeys (9). The acquired results are arranged as (11) where { } represents the real part of and { } denotes the imaginary part of . The textural details can be obtained by encrypting the elements in as (13), where is the quantization of the ℎ element in , given by (14) Finally, the LPQ is obtained by detecting the distribution histogram of the encoded values . In the proposed method, after applying NSST on the face detected image, the obtained low-frequency sub-band component is applied to LPQ to obtain the blur insensitive texture features. The detected face region from the input face image is given in Figure 3

Pyramid of histogram of oriented gradients
For effective face recognition, we require shape information along with texture details. To obtain such shape information we apply the PHOG descriptor which is built by utilizing the histogram of oriented gradients (HOG) features and pyramid representation of the images [47]. HOG descriptor is used to find the local shape of the objects in images and pyramid representation addresses spatial structure. The image is split into tiny regions (cells) and HOG features [48] are computed for every spatial region. The cells are split recurrently to maintain the local shape information completely. The extracted features from all the cells are integrated to form the final HOG features and they are concatenated with the pyramid structure to incorporate the details associated with the spatial design. Canny edge detection algorithm was utilized to identify the edges in the face image, and then the face image is split into cells by following the quad-tree concept. Let the be the number of levels, and be the number of bins for HOG features, then the dimension for PHOG descriptor is given by * ∑ 4 =0 . In this work, we choose 3( = 0,1,2) number of levels and the number of bins as 8, then the resultant feature vector has a size 168. Figure 4

Proposed convolutional neural network
Convolutional neural networks have attained noticeable progress in image classification and they have been utilized in face recognition applications because they can extract robust facial features. The CNNs are generally made up of three types of layers namely, convolutional, pooling, and fully connected layers. A convolutional layer includes many convolutional kernels that are utilized to generate different feature maps. After each convolutional layer, a pooling layer is utilized that decreases the dimension of the feature maps and thus reduces the computational complexity of the CNN model. A fully connected layer considers all the neurons in the previous layer and associates them with every neuron of the current layer.
The architecture of the proposed CNN is shown in Figure 5. The proposed convolutional neural network contains three convolutional, three pooling, and two fully connected layers. The input to the proposed CNN is a 64x64x1 grayscale image. The first convolutional layer has six 5x5 filters and the convolution stride is set to one pixel. Thus, the output of the first convolutional layer contains six feature maps with size 60x60. Here, ReLU non-linear activation function is used in the convolutional layer. After each convolutional layer, max-pooling is accomplished over a 2x2 window, with stride two. Hence, the outcome of the maxpooling1 is feature maps with a 30x30 dimension. In each convolutional layer, the stride is considered as one whereas for max-pooling layers it is taken as two. The depth of the second and third convolutional layers is eight and ten, with output feature map dimensions 26x26x8 and 9x9x10 respectively. Maxpooling2 and maxpooling3 layers generate an output of 13x13x8 and 5x5x10 respectively. The last two layers are fully connected layers with 200 and 120 hidden units.

EXPERIMENTAL RESULTS AND DISCUSSION
The capability of the suggested method, with different filters for the Laplacian pyramid decomposition [49], is tested using three face databases: i) ORL [50], ii) Yale [51], and iii) Japanese female facial expression (JAFFE) [52]. In every class of the database, 70% of images were utilized for training and the rest of the images were used for testing. While training the proposed CNN, stochastic gradient descent has been used for optimization with a base learning rate of 0.0001, and the maximum number of epochs as 20. Each experiment was done 10 times with the chosen datasets and the average recognition rate was given.
The ORL database comprises 40 different subjects. Each subject contains ten different images with distinct lighting environments, facial expressions, and attributes. In total ORL database includes 400 images each with 112x92 image resolution. To observe the effect of various NSLP filters [49], the recognition rate of the proposed approach for different filters is tabulated in Tables 1-4. The 'kos' filter produces a recognition rate of 97.61% with SVM, and 97.45% with the KNN classifier on the ORL database as given in Table 1. On the Yale database, the suggested method attains a recognition rate of 97.88% for SVM and 96.52% for KNN. Also, the recognition rate achieved is 98.75% with SVM and 97.48% with KNN on the JAFFE database. By using the 'pyr' filter face recognition rate achieved is 99.32% with SVM, 98.67% with KNN on the ORL database as tabulated in Table 2. On the Yale database, the face recognition rate is 98.72% with SVM and 97.85% with the KNN classifier. On the JAFFE database, the suggested method attains a recognition accuracy of 99.45% with SVM and 98.54% with KNN. The recognition rate of the proposed method for the 'pyrexc' and 'maxflat' filters is tabulated in Tables 3 and 4 respectively.  Tables 1-4, it is observed that, among the chosen NSLP filters, 'pyr' has given a good recognition rate for the proposed technique. The performance metrics for the proposed method with different classifiers on the three databases are given in Table 5. The ROC curve of the proposed face recognition system for the ORL database is shown in Figure 7   From the values of Tables 1-4, it is inferred that the proposed technique achieves a maximum face recognition rate of 99.32%, 98.72%, and 99.45% on ORL, Yale, and JAFFE databases respectively with 'pyr' filter. The comparison of the recognition rate for the proposed face recognition system on the ORL, Yale, and JAFFE databases with some of the existing methods shown in Table 6 appendix, to show its effectiveness.

CONCLUSION
A reliable and effective face recognition system using the NSST, the histogram of local feature descriptors, and CNN is proposed. The significant contribution of this work is presenting a novel method using histogram-based local feature descriptors, and CNN features on a transformed image for robust face recognition. NSST decomposes the input face image, into low and high-frequency sub-band components using the Non-Subsampled Laplacian Pyramid. Histograms of the local feature descriptors namely LPQ, PHOG, and the deep features from CNN are obtained from the low-frequency sub-band component and concatenated to form the feature space. In our proposed method compared to KNN classifier SVM produces better results on the chosen face databases. The experimental results reveal that the suggested method effectively recognizes the faces with different illuminations, poses, and expressions. Compared to some of the existing approaches, the proposed method achieves a better recognition rate.  Table 6. Comparison of recognition rate (%) of the proposed method with some of the existing methods Method ORL Yale JAFFE PCA [21] 89.50 EFLDA [21] 93.00 CLDA [21] 94.06 PCA image reconstruction+LDA+SVM [23] 97.48 GFDBN [32] 94.98 DIWTLBP [33] 97.00 DSDSA [42] 98.00 Proposed 99.32 OPR [34] 94.15 PLR [35] 96.23 Yin [41] 95.02 RDCDL [38] 97.22 DSDSA [42] 98.16 Proposed 98.72 FLLEPCA [20] 94.98 Single 2D-NNRW [31] 97.00 PSO [36] 98.80 Proposed 99.45