Online dictionary learning for car recognition using sparse coding and LARS

The bag of feature method coupled with online dictionary learning is the basis of our car make and model recognition algorithm. By using a sparse coding computing technique named LARS (Least Angle Regression) we learn a dictionary of codewords over a dataset of Square Mapped Gradient feature vectors obtained from a densely sampled narrow patch of the front part of vehicles. We then apply SVMs (Support Vector Machines) and KMeans supervised classification to obtain some promising results.

INTRODUCTION Vehicle make and model recognition systems (abbreviated VMMR) are part of intelligent transportation systems (abbreviated ITS), a field that has seen significant advances since the last twenty years. An intelligent transportation sytem is the integration of "smart" communication and information processing technologies to the transportation system . Althought it usually encompasses all modes of transports, it is used to describe road, vehicle and driver interactions and aims to make them more dynamical, by enabling better and faster decision making for both road users and supervisers (if any). The push for better ITS and the massive investments made in it by countries like Japan or the USA naturally sowed the fields of innovation in all its related domains. Not only that but the gap in computational power, video-sensing and miniaturization technologies since the early nineties or even since the last decade continues to widens steadily, leading to more dense and powerful devices along with more sophisticated and performant VMMR software. The most widespread taxonomy of VMMR methods is split into two broad categories, namely appearance based and model based. Appearance based methods rely on photometric information and the features extracted from it (edges, corners,gradient...), whereas model based methods seek to obtain a good geometrical representation of the objects in the image often relying on stereo vision and two dimensionnal techniques for primitive extraction. Here we use the front part of the car which is considered to be the most discriminative [1][2][3][4][5][6][7][8][9][10][11], but different parts have been used too like a combination of different parts [12], the rear [13,14], the logo [15] and even 3D modeling [16]. In this paper we build an appearance based VMMR system on the most discriminative part of the car located at the front using various feature extraction and classification techniques. We consider the VMMR problem in a bag of features framework that combines local Journal homepage: http://ijai.iaescore.com Intell   ISSN: 2252-8938  Ì  165 low-level features, sparse coding, dictionary learning and support vector machines (SVM). These local lowlevel features are used to form a global descriptor, which can be seen as a mid-level feature as mentioned in [17] and are also extensively used in deep neural networks [18,19]. Sparse coding originaly used fixed dictionaries until Olshausen and Field [20,21] where they used data to generate dictionaries representing an underlying hidden structure of said data. Raina et al. [22], compare PCA and sparse coding, building a dictionary using unlabeled images to obtain sparse representations of labeled images and feed them to a support vector machine. Yang et al. [23], build the dictionary from SIFT descriptor data instead of raw images and use a one-againstall multiclass linear SVM. In Zeiler et al. [24], a dictionary is built by stacking deconvolution and max pooling layers and uses conjugate gradients to update the filters (K filters for each layer) of the hierarchical dictionary, each deconvolution layer seeks to minimize the reconstruction error of an input image under a sparsity penalty. In general the aforementioned methods use the alternating or sequential optimization approach which updates either the dictionary or the coding matrix while fixing the other. More modern methods like Dir of Rakotomamonjy [25] jointly optmize the dictionary D and the coding matrix A which improves the algorithm runtime.

THE BAG OF FEATURES METHOD
The general bag of features framework follows the six steps detailed below and shown in Figure 1. The bag of features although supervised by nature (the use of SVMs) relies more or less exclusively on a unsupervised learning method to generate said features in its argualbly most important step, i.d. the dictionary generation or learning step. Dictionary learning can be seen as matrix factorization problem along side other methods such as principal component analysis (PCA), clustering or vector quantization [26,27], non-negative matrix factorization (NMF) [28,29], archetypal analysis [30], or independent component analysis (ICA) [31][32][33][34].

Patch extraction
To streamline the process of extracting the relevant image patches as seen, we use [35] algorithm as illustrated in Figure 2. Simple and straightforward it is based on the use of horizontal gradient, morphological operations and connected component analysis to solve the license plate location in the image, which in turn enables us to obtain the front part of the vehicle for further processing.Then we sample local areas of the image patches acquired previously either densely [36,37] or sparsely [38][39][40]. In our implementation we choose to use dense grid coverage of 2x2, 4x4, 8x8 or 16x16 patches over the image with no scaling and because of the tight segmentation, background clutter is kept to a strict minimum. However there are better sampling methods as pointed out by Nowak et al. [41] than dense sampling, many regions of the image have low discriminative power (most of them in fact) which hurts the performance.

Feature description
This process maps the input pixels from a sparse or dense sample, see Figure 3, into feature vectors, the map may usually be gradient based like Scale Invariant Feature Transform (SIFT) [39], Speeded-Up Robust Features (SURF) [40], Histogram of Gradients (HOG) [37] and Laplacian of Gaussian (LOG) [42] descriptors or statistically based like Principal Component Analysis (PCA) [43] or Linear Discriminant Analysis (LDA) [44,45]. Here we use the Square Mapped Gradients (SMG) [1], which is a gradient based feature descriptor:

Dictionary generation
This step is crucial to the overall performance of the system as it computes codewords from the feature descriptors of the second step, these codewords then form the dictionary as shown in Figure 4, the standard way of generating codewords (i.e.,dictionary) consists of simply clustering over the feature vectors set (using K-Means) [46], however other unsupervised [46,47,17] and supervised [48] methods are used as well. Dictionary learning in this paper follows the algorithm described in [49]. in which given a finite set of feature vectors X = [x 1 , ..., x n ] ∈ R m * n , optimize the following problem also known as the Lasso [50]: where D ∈ R m * k is an overcomplete dictionary (k > m, k being the number of codewords and m the dimension of the feature vectors), α = [α 1 , ..., α n ] ∈ R k * n the coding matrix, λ the l 1 regularization parameter and ||α|| 1 the l 1 Lasso penalty which induces sparsity [49] in the coding matrix α.

Feature coding
With the dictionary and feature vectors in hand, we can obtain the coding matrix which generally describes the relationship between the feature vectors and the codewords (dictionary), where each feature vector activates a number of codewords thus forming a coding vector either with binary (hard vector quantization) or continuous elements (soft vector quantization, sparse coding) [51]. In our setup we use [51] call reconstruction based coding using a part of the codewords to describe the features via solving a least-square optimization problem, here the Lasso (2), in fact both the dictionary and the coding matrix are obtained during the optimization step.

Feature pooling
The coding vectors obtained previously, are pooled over the whole image to form one pooling vector who is the final representation of the input image. Between the two popular pooling strategies used in the literature namely max pooling and average pooling, it has been shown [17] that max pooling is superior. Given α = [α 1 , ..., α n ] ∈ R k * n the coding matrix and i = [1, ..., N ] ∈ N m the indice of feature location per image m = 1, ..., M , we obtain the following max pooled vector: where z m ∈ R k is the vector representing the whole image m.

3.
ONLINE DICTIONARY LEARNING The algorithm described in this section in algorithm 1 is assuming a training set composed of independent and identically distributed samples and alternates at each loop between computing the decomposition α t of the training sample x t over the dictionary D t−1 obtained during the previous iteration and updating the dictionary D t by minimizing the function: Algorithm 1 Online dictionary learning.
Precondition: x ∈ R m p(x) (random variable and an algorithm to draw i.i.d sample of p), λ ∈ R(regularization parameter), D 0 ∈ R mxk (initial dictionary), T (number of iterations) Compute D t using Algorithm 2, with D t−1 as warm restart, so that . 9: end for 10: return D T (learned dictionary)

Sparse coding
The sparse coding problem in (2) with fixed dictionary is an l1-regularized linear least-squares problem. In the case of dictionary columns with low correlation, simple methods based on coordinate descent with soft thresholding [52,53] are enough. But the columns of the dictionary are more often than not highly correlated, and it is proved that a Cholesky-based implementation Lars algorithm [54,55] that provides the entire regularization path can be as fast as simpler soft thresholding based methods while having a higher accuracy. for j ← 1 to k do 3: Update the j-th column to optimize for (6) 4: end for 7: until convergence 8: return D (updated dictionary)

Dictionary update
The dictionary update algorithm uses block coordinate descent with warm restarts, does not require any learning rate tuning and is parameter-free and since the vectors are sparse, the coefficients of A are diagonaly concentrated the block coordinate descent more efficient. In practice, Algorithm 2 updates each j-th column of D sequentially and (6) gives the solution of the dictionary update with respect to the j-th column, while keeping the other ones fixed under the constraint d T j d j ≤ 1. It has been shown that this optimization problem converges to a global optimum [56]. This algorithm uses a warm restart but other approaches have been proposed to update D, for instance, [57] suggest using a Newton method on the dual of (5).

THE VEHICLE LICENSE PLATE LOCATION ALGORITHM 4.1. Horizontal gradient
It is known that the most discriminative part of a vehicle, mainly its front and its rear are composed of horizontal lines [58], while the license plate region has a predominance of vertical lines. We apply the Sobel operator to obtain the vertical lines [59]. Figure 5(b) shows the resulting image of horizontal gradient detection on the original image in Figure 5(a). And to have a clearer emphasis on the license plate region, due to the great concentration of high valued pixels, we apply a mean filter [60] to the image. The final stage of the horizontal gradient phase can be seen in Figure 5(c).

Filtering
The goal of this phase is to darken every non license plate region. Morphological operations proposed in [61,62] are applied to the image to darken high valued regions that don't fit the expected size. Small non-VLP salient regions are darkened by a morphological opening operation as shown in Figure 5(d). The joint application of mean filter and opening operation causes undesirable variation among pixel values of license plate region though, perceived as gaps between its vertical saliences. We restore smoothness in these values which is equivalent to replenish the artificial created gaps using a morphological closing operation. Big non license plate regions are also darkened by a top-hap filtering,so that big salient regions will have their pixel's values lowered as shown in Figure 5(e). As a result the license plate region may exibit unwanted artifacts in that the space between the letters and the numbers may be less salient. And the binarization process may therefore perform poorly splitting the license plate region into multiple parts, that problem can be fixed by applying a closing operation on the image. We then remove the saliences in the licence plate's borders that may appear during the horizontal gradient step to find the tightest boundary of the license plates characters. We finally close the filtering phaseby applying an erosion operation followed by a dilation operation.

Adjustement
The adjustement phase generates potential license plate regions in a binary image. The first step is to binarize the image to separate salient regions from the background using Otsu's method [63] to automatically define the binarization threshold and minimize the chance of a miss as depicted in Figure 5(g). A problem that could arise is that it is possible that the contrast between dark non license plate regions and the brighter license plate region is not big enough and could result in a "botched" binarization. In the connected component Ì  ISSN: 2252-8938 analysis we remove any region that is not license plate shaped from the binary image and finally we maintain only the license plate region candidates bounding boxes. We also eliminate candidates that are intercepting each other because the license plate region is essentially composed of vertical edges that do not intercept each other. Other vertical edge dominant candidates like signs that contant letters are not close enough to their regions intercept on the other hand candidates originated from background noise have a greater chance to touch each other given their random location. For the remaining candidates, we first cut the initial corresponding bounding box off of the monochrome image of the filtering phase like shown in Figure 5(f) and consider it as a separate candidate image. Second, a new binarization takes place with Otsu's method and the candidate's boundary is expanded to include lower valued pixels by a dilation operation. The resulting binary image may still contain undesired regions, which are erased by an erosion operation followed by a dilation. Finally, the resulting candidate bounding box is checked to see if minimum dimension constraints are satisfied if not, the candidate is discarded unless it is the last one, in which case it is kept. If there is no candidate or all of them are discarded, a histogram equalization is performed in the original image, to get a contrast improvement, and the location process is repeated from its very beginning. Multi-class classification is achieved here by using a "one against one" scheme with probability estimates [64][65][66] thus training for k classes k(k-1)/2 classifiers, one for each pair of classes. Given k classes, estimate the pairwise class probabilities: where f is the decision value at x and A, B are estimated by minimizing the log likelihood of training data. With all ther i,j we can obtain the p i by optimizing the following objective function: for any x we assign it to the class of its highest p i .

Supervised K-means
The k-means [67] algorithm is one the simplest clustering algorithms which given a set of data x aims to separate it into k different clusters by optimizing the following objective function: where c i is the cluster center. The standard implementation uses Lloyd's algorithm [68] which consists of the following two steps: (a) Initial step: where the centroids are initialized.
(b) Assignment step: where each data point is assigned to a specific cluster based on its distance with the cluster's cent (c) Update step: where we compute the new centroids for each cluster. In our implementation, we skip the update step so the centroids are initialized once, using only the training data. Then each test point is assigned to a specific cluster using its Euclidian distance to the respective centroid.

6.
EXPERIMENTAL RESULTS Using the COMVis car dataset [7] we build our set by clipping automatically from each image the most discriminative region of interest for make and model recognition, the front part of the car using [35] algorithm for license plate detection. In our experiment we usefive car makes, Suzuki, Toyota, Mitsubishi, Hyundai and Honda with respectively 153, 78, 36, 37 and 97 images. Note that these images contain different models, 8 for Suzuki, 4 for Toyota, 2 for Mitsubishi and Hyundai and 5 for Honda, thus increasing the overall difficulty of the recognition task due to unbalanced datasets. We build our feature matrix by concatenating the feature vectors (2x2, 4x4, 8x8, 16x16 blocks) of each image and then optimize the Lasso (2) using [49] and the least angle regression [69] (LARS) algorithms to obtain the dictionary and the coding matrix. The pooling strategy (3) applied to the coding matrix gives us a representing vector for each image. These vectors are then fed to a linear kernel support vector machine (SVM) via a cross-validation framework holding out 80% as training data, this process is repeated one hundred times to obtain the recognition rates of Tables 1 and 2. The results here are obtained using the square mapped gradient descriptor (1) applied to a 128x512 image thus obtaining 128x1024 resulting matrix, which is a concatenation of the image under the SMG descriptor along direction x and the image under the SMG descriptor along y. This new image is considered the input of the feature matrix generation step. In Table 2, we use supervised k-means instead of the SVM classifier which achieves similar performance as the use of SMG+SVM setup without dictionary learning with an accuracy of 85.01±2.68, compared to 80.66±2.78 for the SMG+KMeans setup without dictionary learning.

CONCLUSION
Despite an amelioration of performance the optimization cost can be quite heavy especially in an online framework. Rigamonti et al [70], showed that enforcing sparsity is not helpful for recognition rates when extracting the features and that the same performance can be achieved through the use of plain convolution, however sparsity is still important when learning the filters. They did use a different equation than (2), where the matrix-vector product is replaced by a convolution and a two step algorithm where learning the filters and extracting the features are separated. The problem remains that sparsity do not dramatically increase performance nor is it required to obtain good results. Even then and to put sparsity in perspective, it must be inserted in a bigger framework like in hierarchical models, deep belief networks (DBNs) [71] to show its usefulness.