A deep learning based stereo matching model for autonomous vehicle

ABSTRACT


INTRODUCTION
Autonomous vehicles are a prominent research topic in the computer vision. It is necessary to correctly measure the three-dimensional (3D) view of the surrounding region of the vehicle in real time to make a driving decision. Precision of the depth map is crucial for the safety measure of autonomous vehicles. In these vehicles the depth details of the surrounding region are usually extracted using the hardware like light detection and ranging sensors. These sensors are expensive to install and also have certain drawbacks that may lower the standard of the depth information. These sensors do not provide additional information like traffic light color which plays a major role in decision making. Computer vision based stereo matching could be an alternate solution to overcome this drawback. The aim of stereo matching is to find matching pixels of images from different viewpoints and then estimate the depth [1]- [3]. It finds its applications in augmented reality, robotics, 3D reconstruction [4]- [9]. Stereo vision tries to imitate the process in the human eye and the human brain. A scene taken from two cameras displaced horizontally will form two slightly separate projections. Disparity is the horizontal displacement in an object. A map that contains displacement of all pixels in an image is known as disparity map. Depth of a scene can be estimated from this disparity map. In recent past many stereo algorithms were proposed [10]- [12]. A classic stereo algorithm mainly follows three steps namely: computing the pixel wise features, construction of cost volume followed by postprocessing. Traditional stereo matching methods are grouped as local, global and semi-global methods. Local methods rely on low level pixel features to compute the similarity in the cost computation step. They estimate the correspondence by means of a window or support region [13]- [15]. Since the pixel wise characterization play a major factor, a wide variety of these representations are used by researchers varying from a simple rgb representation of pixels to the other descriptors like census transform, scale invariant feature transform. Segment based super pixel technique is proposed in [16]. After finding the edges and matching cost, adaptive support weight is used in cost aggregation. It proposes dual path refinement to correct disparities. Stereo matching based on adaptive cross area and guided filtering with orthogonal weights (ACR-GIF-OW) is proposed in [17]. These techniques are computationally less expensive but do not produce accurate results in the texture less, discontinuous and occluded areas.
A Global methods handle texture less regions or uneven surfaces by including smoothness cost. Global methods make use of global energy function. The energy function is minimized step by step to compute disparity by assuming matching as a labelling problem. The pixels are considered as nodes and disparity estimated is considered as labels. The global methods use data and smoothness term to compute the energy function to produce smooth disparity. Graph cut [18], dynamic programming [19] and belief propagation [20] are the classic global matching algorithm. A tree structure is proposed in [21] named pyramid-tree that performs cross regional smoothing and handling region of low texture. In addition, they used log angle for cost computation which is robust to inconsistencies. The performance of global methods is limited because these approaches depend on hand-crafted features and hence do not produce accurate results.
Convolutional neural network (CNN) is popular in different vision [22]- [24] applications. These methods are widely used in stereo matching. It improves the performance as compared to traditional methods. Kendall et. al. [25] the authors presented an architecture that learns disparity without regularization. Features are extracted automatically using CNN without any manual intervention. These features are used to perform stereo matching, that can handle texture less regions or uneven surfaces. Eigen et. al. [26] made use of basic neural networks to determine depth of a scene. They used AlexNet architecture to generate coarse map. Another network is followed that performs local refinements. The work proposed in [27] included the process of multi-stage framework that combined random forests and CNN. An architecture named neural regression forest is used to find depth from single input image. It allows parallel training of all CNN. Finally, a bilateral filter was used to obtain a refined disparity map. A similar concept is presented in [28] where many tiny neural networks were trained across overlapping patches. DispNet is one of the basic networks used for disparity estimation. A cascading residual learning network is used in [29] that extend the DispNet structure. It is obtained by using DispFullNet and DispResNet. The initial stages of CNN uses DispNet with an additional up convolution module. This help to extract more information. The next stage generates residual signal that helps in refinement. A trainable network is explained in [30]. It uses a robust differentiable patch match internal structure that discards most disparities without performing cost volume evaluation fully. This reduces search space and increases memory and time efficiency. The main drawbacks of existing methods are that the ill posed regions are not handled effectively. In the proposed method CNN is combined with optimization technique. CNN is used to replace the the hand-crafted term with the learned features. The output of CNN is used to calculate the unary and smoothness cost. Smoothness cost is added by taking the information from the neighboring pixels. Smoothness cost estimates the contrast-sensitive information to get a smooth disparity map. Post processing is performed to handle occlusion.
In stereo vision, the areas visible in one view may not be visible in another. It is often difficult to reconstruct such regions in one image by looking at the other. The losses computed in these areas are noisy, leading to inaccurate results specifically in the occluded areas. Disparity refinement is implemented to enhance the accuracy of matching in ill posed areas. The left-right consistency check is the common method used to identify and handle the outliers. Even though several methods were proposed in the past to enhance the efficiency of matching, the low accuracy problem especially in the ill posed areas has not been handled very well. In order to handle these areas, post processing is performed by means of a generative adversarial network (GAN) model put forward by Goodfellow [31]. GAN is a structure used for training generative model. It uses the concept of min-max game. The two models namely generator model and a discriminative model is used to analyze the distribution of data. The generator tries to understand the distribution which is almost same to the real distribution of data. The ability to generate high quality image by GAN makes it applicable in several image processing applications. An encoder decoder structure is used for training in reconstructing the images. This model can produce various realistic representation of input by altering the attribute values. A conditional adversarial network [32] can be used for image translation. This translation converts the image from one representation to the other such as day to night. We propose a hybrid CNN based deep stereo network model (CDSN) to estimate the disparity map that can produce accurate results. Loopy belief propagation is used to compute initial disparity map from features extracted from CNN. A generative neural network is used to handle the ill posed regions in the disparity map. The generated images look more realistic and closer to ground truth disparity map. The obtained result show that the proposed CDSN model handle the ill posed regions like discontinuities in the image boundaries and occluded areas effectively. The proposed model outperforms the other existing techniques on Middlebury dataset [33]. The paper is organized in a manner, section 2 explains the proposed CDSN model. Section 3 depicts the results of proposed model. The conclusions of the paper are presented in section 4.

METHOD
A CNN based model is proposed for stereo matching to find disparity map. The features extracted from CNN is used to compute the unary cost and smoothness cost. Global energy function is adapted to get the initial disparity map. A GAN model is used to handle ill posed region. Table 1 depicts the list of symbols with its description. The flow chart of proposed model is displayed in Figure 1.

CNN feature extraction
Conventional algorithms for stereo matching focuses on hand crafted features which leads to inadequate image information. CNN is used for the various vision problems including stereo matching. The CNN can extract local context better, hence it is robust to any photometric differences. The feature descriptors are extracted from rectified stereo images using a pre-trained visual geometry group (VGG-16) model [34]. The VGG-16 model is trained using ImageNet dataset which contains 14 million labelled images that are of high resolution that belong to 1,000 classes. The output of the 9th layer is used for stereo matching in the proposed model as it presents an appropriate feature space for computing disparity. VGG-16 uses a max pool layer that select the maximum element from the input map using a filter of 2×2. The first and second layers include 64 channels of 3×3 kernel size which is followed by max pool function of stride 2, 2. The third and fourth layers include 128 channels of 3×3 kernel followed by max pool function of stride 2, 2. The next three layers include 256 channels of 3×3 kernel that is followed by max pool function of stride 2, 2. Eighth and ninth layers include 512 channels of 3×3 kernel size. An N-dimensional feature vector is obtained for every location of pixel.

Initial disparity map estimation
The extracted feature descriptors are used to determine the matching cost of every pixel in left feature map. We search horizontally along the right feature map for the best matching value. The matching unary cost is calculated using the Euclidian distance of two feature descriptors using (1).
Unary cost may not yield optimal result in the texture less, repetitive patterns, discontinuity regions. The smoothness cost is used to smoothen the unary cost. Many smoothening techniques is proposed in the recent past. Most of these methods use random variables to have the disparity of a pixel, which encodes smoothness cost based on some standard constant. The smoothness cost is estimated based on neighbouring pixel information. The smoothness cost penalizes the inconsistent disparity values. The smoothness cost is computed using (2), Let represent pixels in the image. The initial disparity map of each pixel ∈ is estimated using energy function The proposed method uses max product variation of loopy belief propagation (LBP) [20] to obtain the best disparity map. LBP is an algorithm based on assigning label to each pixel imposing global constraints and message passing. This is an iterative method where the messages are passed to left, right, top and bottom in each iteration. In each iteration t, the message is passing from pixel i to pixel j using (4), Here represents all neighbours of except Belief is calculated by (5).
The values ranges from 0 to maximum disparity range and represent neighbours of pixel . The smooth disparity is obtained for iteration that minimizes the ( ). It is observed that the minimization of energy became constant after 10 iterations. Hence the proposed algorithm used 10 iterations.

Disparity refinement using GAN
The GAN network is used to refine the disparity. This refinement model is used to handle ill posed regions. The GAN can perform learning task automatically by identifying various patterns or irregularities from the input data. GANs have the ability to handle missing data such as occluded pixels in the disparity map. The two sub models in GAN are generator and discriminator. The generator model generates new  Figure 2. The proposed model uses Pix2Pix GAN model [32]. Pix2Pix GAN is simple and can produce high quality images for image translation applications. The efficiency of this GAN as compared to other GAN like CycleGAN [35] and DualGAN [36] is explained in the ablation study. The generator in Pix2Pix is a convolutional network that accepts initial disparity map as the input image and passes it through several convolution and up-sampling layers. Finally, it produces a refined disparity map, where all the occluded areas are filled with valid data. The U-Net auto encoding generator model is trained using adversarial loss that encourages it to create reasonable image. The encoder and decoder are made up of blocks of convolutional, activation layers and batch normalization layers. The generator is updated by loss that is generated between generated image and ground truth image. This information helps generator model to create more reasonable image that is similar to ground truth. The generator G is trained so as to generate output which can be differentiated from ground truth image by a discriminator D. The GAN objective is represented as, Here denote a ground truth image, represent the generated image and represent the initial disparity map * = ( , ) aims to decrease the objective and aims to increase the objective. The generator G tries to move the generated image closer to ground truth image using loss 1 which is calculated as The final objective is represented as * = ( , ) + 1 ( ) The visual arti-facts were reduced for the value of =100. The network is trained by images from the Middlebury dataset [33]. The network is tested for 100, 200, 300, 400 epochs. The best disparity map is achieved for 300 epochs. The output from the generator is fed to the discriminator together with ground truth image. The gradient loss is calculated with respect to generator and discriminator to update the model. The trained model is tested to yield a best disparity map. It is observed from the results that best disparity map was obtained by handling the ill posed regions. Figure 3 shows the performace of the model with respect to training loss and training accuracy. Figure 3(a) dipicts training loss and training accuracy against the number of epochs is shown in Figure 3(b). Lower the loss better is the accuracy.
To measure the efficacy of the model proposed, we deployed and tested our model on Dual Intel Xeon E5-2609V4 8C 1.7 GHz 20M 6.4 GT/s with 128GB memory, Dual NVDIA Tesla P100 graphics processing unit (GPU) with 3584 cores and maximum of 18.7 TeraFLOPS. The proposed CDSN model is evaluated on Middlebury dataset images. These images are pre-processed and rectified stereo images. The output of the 9th layer pre-trained VGG-16 architecture is used for estimating initial disparity map using loopy belief propagation. Initial disparity map is estimated using python programming. GAN is implemented using Pytorch. The Adam optimizer is used to train the Pix2Pix GAN for 300 epochs to handle the ill posed regions. The learning rate has been initialised to 0.0002. The complexity of GAN model is summarized in the Table 2.

RESULTS AND DISCUSSION
The proposed model is analyzed for the images taken from Middlebury datasets namely "Jade plant", "Piano", "Pipes", and "Recycle". The test images with resolution are shown in Table 3. Middlebury 2014 dataset contains 33 scenes that are classified into training, additional images and test images. Certain images are used more than once under various exposure. A very high-resolution images is the salient feature of the dataset. Ground truth maps and images are given at quarter, half and full resolution.

Qualitative comparison
The qualitative results for estimating disparity map is depicted in Figure 4. From the top to bottom: Jade plant, piano, pipes, and recycle. Figure 4(a) shows the left image, Figure 4(b) shows the right image, Figure 4(c) represent the ground truth image and Figure 4(d) represent the estimated disparity map.

Quantitative comparison
The percentage of bad matching pixel (PBMP) and root mean square error (RMSE) metrics were used for quantitative analysis. Lower values of PBMP and RMSE indicates better efficiency. PBMP is calculated, For evaluations purpose, we compared CDSN model with existing stereo matching model. The compared matching models are: deep pruner [30] ACR-GIF-OW [17], and efficient stereo matching by logangle and pyramid-tree (LPSM) [21]. The occluded areas are not dealt efficiently in [30]. Stereo matching proposed in [17] is computationally less expensive but do not produce accurate results in the texture less, discontinuous areas. Stereo matching proposed in [21] rely on hand-crafted cost matching and hence results produced are not accurate. The Middlebury evaluation leader board results of existing methods are used for comparison. The PBMP and RMSE results of the proposed model and existing techniques are shown in Table 4 and Table 5 respectively. The PBMP and average RMSE results of the proposed CDSN model is less than all three compared method. Hence the proposed model outperforms the compared method and hence suitable for disparity map estimation.

Ablation study
We executed ablation study by comparing the proposed model with the models like CycleGAN and DualGAN. CycleGAN is a technique that performs image translation without using paired examples. This GAN uses unsupervised training. DualGAN is made up of two generators and two discriminators. It is trained to translate images from source to target and target to source. The various metric used are absolute relative distance (ARD), squared relative difference (SRD) and RMSE. Lower values indicate better performance. We find the efficicency of the proposed model is significantly high which is presented in the Table 6.

CONCLUSION
This paper presents a novel CNN based model for stereo matching to estimate disparity map from rectified stereo images which is useful in autonomous vehicles. The features extracted from CNN is used to compute the unary cost and smoothness cost. The initial disparity map is obtained using loopy belief propagation, which is then refined using a GAN model to handle the ill posed regions. It is found that the proposed model based on CNN generated disparity maps which are smoother than those generated using naive model and the ill posed regions are handled well using GAN network. The proposed model is evaluated qualitatively as well as quantitatively on various images from Middlebury stereo data set. The results determine that proposed model achieves best disparity map and outperforms existing methods.