A high frame-rate of cell-based histogram-oriented gradients human detector architecture implemented in field programmable gate arrays

ABSTRACT


INTRODUCTION
Image processing has been widely utilized to detect humans [1], [2]. However, detecting humans in an image is challenging due to their various and wide range of appearance variables [3]. To recognize the existence of humans accurately, extensive computations are required. The system should be highly accurate while still able to maintain low-energy and low resource consumption on real-time processing. In respect of the accuracy, one of the well-known techniques for human detection is the histogram-oriented gradients 64×128 pixels. These numbers of cell size, block size, window size and block overlap give the least miss rate compared to other sizes and overlaps [3]. The window will be moved by one cell column or one cell row after each evaluation. The pixels in the window will be classified by the support vector machine (SVM) for human detection. There are in total 3,285 windows to be analyzed in a 640×480 pixels image. To be fit for hardware implementation, we modify the HOG algorithm by proposing cell-based processing, cell derivatives with neighboring edge anti-aliasing, magnitude calculation using linear approach, fixed-weighted binning, block normalization using newton-raphson algorithm, block-wise SVM classification and fixedpoint representation methods. The detail of each technique will be described in the following section.

Cell-based processing
Instead of using Window-based, we used Cell-based processing for computing the derivative value in the x-direction (dx) and in y-direction (dy). Figure 2(a) and Figure 2(b) illustrate the how the windowbased and cell-based processing in raster scan is executed, respectively. Using cell-based processing, we can extremely reduce derivative computation redundancy by skipping overlapped cell data computations in window-based processing.
As shown in Figure 2(a), cell-based processing eliminates large overlapped cell area, which results in low computational complexity as well as low memory bandwidth requirements [28]. However, the derivative value (dx and dy) results are still identical to the original Window-based HOG algorithm. This method can be applied because we can reuse the computed cell derivative (dx and dy) data for different Windows instead of recalculating the derivative of all the cells inside a window each time a new window is evaluated. This proposed method is different from [22], where the computation of the overlapped data is done using complex pipeline stage. In this method, we store the calculated cell data in temporary random-access memory (RAM). Each time the system analyzes a new window, it will fetch the respective cell calculation results from the RAM, hence avoiding unnecessary recalculation.

Cell derivates with edge neighboring anti-aliasing
In the HOG algorithm, the derivative values (dx and dy) are computed for every pixel using convolution kernel as (1). Since we utilize cell-based calculation, there will be many edges within a window [29]. The edges may produce invalid dx and dy values of pixels located in corner areas as the pixels only possess one adjacent pixel instead of two. In order to combat this problem, we assign the dx and dy values of pixels in the edge areas to similar values to its neighbor pixels, as shown in Figure 3. We apply this method to pixels located in both horizontal and vertical edges. As illustrated in Figure 3, grey-colored squares represent pixels with distinct derivatives. meanwhile, blue-and yellow-colored squares represent pixels with identical derivatives as the result of duplicating the derivative values of adjacent pixels.

Magnitude calculation using linear approach
The original HOG method used euclidean distance (L2-norm) of each derivative value (dx and dy) to get the magnitude of each pixel. However, the equation of L2-norm consists of a square-root calculation [30], [31], which is very complex to be implemented in the hardware [32]. Therefore, this complicated computation needs to be avoided by other approaches for estimation purposes. In this work, we use a linear method (2) to calculate the magnitude, instead of L2-norm.  As formulated in (2), the magnitude is denoted as M (x, y). Our linear method can significantly reduce computational complexity as it merely uses addition and division-by-constant operations. The division by three can be implemented by simple shifting and adding function as described in (3).
Based on Figure 4, it can be seen that this simplification is able to deliver a satisfactory estimation of the actual L2-norm. Three different approximations -for calculating magnitude, angle and distance-will be used to avoid square root and divisions in the processing of the HOG. In this work, we do not examine the effects of such simplifications as it only to show the proposed linear method compared to L2-norm on a graph, as shown in Figure 4. The overall accuracy for the proposed system will be introduced in the future work as we will provide the error rate for a specific set of benchmarks.

Fixed weighted binning
For Histogram function, pixel angle can be calculated using complex arctangent and division operations. However, computing pixel angle with arctangent function and division will result in a very complex computation and of course, it is not suitable for hardware implementation. Furthermore, since pixel angles are computed for all pixels, there will be a lot of data to be analyzed. This computation demands very large computation cycles and latency. To cope with the requirements, we use a simplified method by setting a fixed region of bin for every 10 o (i.e., . Suppose tangent 10 o = 0.1763269807, then it will similarly equal to 2 -3 + 2 -5 + 2 -6 , which is 0.171875. Thus, we will have 9 bins with its approximated tangent values, as shown in Table 2. The pixel angle of pixel A (x, y), is computed from derivative dx and dy values using the following pseudocode: The tangent multiplication can be approximated with bit shifting and addition operations in the hardware implementation. By considering computed pixel value A (x, y), we can calculate the Histogram using the rules as described in Table 3

Block normalization using newton-raphson method
There are several normalization methods that can be employed to normalize the Histogram, such as L2-norm and Manhattan distance (L1-norm) [36]- [38]. In this case, L1-norm is more suitable for hardware implementation as it does not use square root operations unlike L2-norm, even though further simplification approaches are still required. Vector normalization is obtained using (4), where L1-sum is Manhattan distance summer.
Since vnorm = v×d, the distance d is stated as (5), To calculate d, newton-raphson approximation is used as in (8). It is derived from (6) and (7) that are formula for x0 and x1 on a newton-raphson digital blocks. [ Where n is defined as n = MSB (sum). For instance, if sum = 13, then n = bit 4. The result will be delivered in decimal fraction numbers.

Blockwise SVM classification
The idea of this method is to multiply the SVM coefficient blockwise, instead of per-window. However, it is important to note that block#1 corresponds only to window#1, but block#2 corresponds to both window#1 and window#2. Thus, block#2 will be used for SVM classifications of block#1 and block#2. This also applies to other blocks that correspond to multiple windows. Section 3 (results and discussion) will further explain the hardware design, which takes advantage of pipelined architecture to handle these complicated calculations. The SVM coefficients are trained with the simplified algorithm. We used the libsvm library to train our SVM with massachusetts institute of technology (MIT) pedestrian dataset. Then, we examined the linear SVM and retrained the false positives. The number of images used for classifier training is amount of 924 and 13,680 for positive and negative images trained, respectively.  ISSN: 2252-8938

Fixed-point representation
Fixed-point is used to represent the fractional data. The data-width of all the modules is depicted in Table 4; it contains input pixel, derivatives, magnitude, and so on. The bit-width is determined by searching for the shortest bit-width in each module that will not cause any bit-overflow or interfere with the calculation results.

Hardware architecture implementation
Our system block diagram is shown in Figure 5. The input of the system is 640×480 images. The output is a grayscale image on an external display. The output image will be marked in parts of the image that are believed to be human figures. The system is comprised of a control unit, derivative, gradient binning using linear approach, cell grouping, Histogram normalization using L1-norm, and sliding window and SVM classification modules. In order to increase the processing speed and enable the system to work in a real-time, we applied pipeline architecture to our system, as reported in Table 5. The M9K is memory block of altera FPGA DE2-115. The pipeline architecture also enables us to reduce the memory bandwidth as it does not require all cell and block values to be stored, but only some blocks that correspond to the windows being processed at that time. Consequently, we are able to use embedded RAM (internal RAM) as the processing storage instead of external RAM, which translates to a significant reduction of pins utilization and power consumption. Table 6 reveals the benefits introduced by a pipeline mode in the proposed hardware, which obtained from syntesis process. The pipeline architecture enables us to reduce the clock cycle latency dramatically. The overall process without pipelining is 13,112; it is obtained from the summation of three digital blocks (i.e., cell grouping, block normalization, and SVM classification proceses). However, as a note, pipeline does not allow any latency improvements for the SVM classification, the block normalization and the cell grouping modules.
Control unit module is used to generate read and write addresses for embedded RAM used in each module. This module plays an important role in our pipeline architecture since the RAM access scheduling should be accurate at all times. Embedded RAMs used and its memory size is Table 6.
The derivative module calculates the cell-based derivative (dx and dy) of the image; this block contains pixel derivatives as shown in Figure 6 and anti-aliasing filter as shown in Figure 3. The anti-aliasing filter applied to the pixels in the image edges is to overcome the zero-padding convolution problem. We  Figure 7. Our architecture can simultaneously calculate the derivatives and gradient bins for 64 pixels by calculating them simultaneously. This will significantly reduce clock latency.   Gradient binning module consists of a rotator, magnitude calculator and binning unit as Figure 8. The rotator unit is comprised of four digital blocks (i.e., inverse, bit extender, comparator and multiplexer units), as shown in Figure 9. The inverse unit is used to negate a number based on two's (2's) complement notation. Bit extender is used to represent a number with larger bits without altering its value. The dx and dy representation uses 12 bits instead of 9 bits to avoid overflow as the magnitude calculator and binning unit involve shifting and adding the values of dx and dy. Comparators and multiplexers are used to determine the quadrant of dx and dy. The magnitude calculator conducts approximation in (3) using four digital blocks (i.e., multiplexer, right shifter, comparator, and adder), as shown in Figure 10. Finally, all the magnitudes are grouped and summed to the respective bins based on the rule specified in (5). The output of this module is a cell histogram data consists of 9 bins × 15 bits that are concatenated into a single line as Figure 11. This module is used to group cells into a block and performed block normalization processing. To group it with the raster scan sequence, we use line delay, as shown in Figure 12. It is implemented using RAM to delay 80 cells of data, which is the row size of the input image, as shown in Figure 1. Each Cell consists of 9 bins × 15 bits of data. Using this configuration, we can group four Cells of data that comprise a single Block. For example, when cell #82 data is fed to the module, block#1 which consists of cell #1, #2, #81, and #82, will be constructed. The block data will then be normalized using combinational circuits.
To minimize memory resource usage, we decided to only use line delay RAM with the size of 80 Cells to store all 4800 Cells that will be generated to blocks. The first 80 Cells will be stored initially in the RAM. However, when cell#81 is fed into the module, the module reads cell#1 from RAM and overwrite cell#81 to cell#1 in the RAM. Block#1 will be entirely constructed when cell#82 is dispensed to the module. There are 82 clock latencies to start the block processing: 1 clock cycle for the register at the line delay input, 1 clock cycle for the register at the output, and 80 clock cycles to fill the line delay RAM initially. After providing the module with the first 82 Cells data, it will consume 3 clock cycles to generate a block. The complete block diagram is depicted in Figure 13. As stated before, our architecture avoids the usage of any division operation since it is too complex for hardware implementation. Each block should be normalized using L1-norm to simplify the calculation. The module contains L1-sum to find the Manhattan distance of 9 bin vectors × 4 Cells (36 bin vectors in total). Newton-Raphson algorithm is then used to approximate d as in (5). This module uses two multipliers, two subtractors, and bit shifters. Finally, the computed vectors are multiplied by d, and then concatenated into a single line data as in Figure 14. This module consumes 1 clock cycle. The results are then stored in the RAM.
The sliding window works in pace with block-wise SVM classification. We designed a highly paralleled and pipelined architecture to be able to calculate 7 windows simultaneously. The SVM classification is done column-wise, because several columns are used to calculate more than one windows. For example, column #1, which consists of block #1, #81, #161, and #1121, is used to calculate window #1. But column #2, which consists of block #2, #82, #162, and #1122, is used to calculate both window #1 and #2, and so on. After 7 columns of the respective window consisting of 105 blocks has been calculated, the score will be subtracted by an SVM bias. The comparator will then decide whether there is any person or not using the value of the sign bit. The system will continue to analyze a new window after a window has been calculated until the whole image has been inspected. The hardware architecture is Figure 15. The final module of this design is the display module. In relation with the video graphics array (VGA) controller, this module works by generating read address from frame buffer RAM (input image) and detection RAM (HOG results). As a note, there are no special requirements for the display module because its function is only to drawing a detection box over the object to be detected. It uses 640×480 pixels with 25 MHz VGA clock. Both RAMs are fed by the same clock to read the values. By counting the pixels with front and back porch, the VGA counts to 800×600. Therefore, it needs 800×600 per 25,000,000 second to complete one frame, which is around 52 Fps. In summary, the design must operate at minimum 52 Fps in order to fit the VGA configuration. The display module also generates markings on windows that are considered to contain human figures.

Performance implementation
To evaluate our architecture in terms of effectiveness parameter, we implemented our design in FPGA. We used altera DE2-115 board (Cyclone IV EP4CE115 FPGA chip). The board is connected to an external VGA display to show the resulting image. We tested several 640×480 pixels color images to verify the system functionality. For static detection, we chose the images with relatively small-sized pedestrian images with 128×64 pixels instead of full image 640×480 pixels. The image in Figure 16(a) contains the best pedestrian detection. The pedestrians are quite similar to our positive training dataset. In Figure 16(b), the image has various lighting conditions. However, since HOG uses the gradient feature, our detector can still detect the pedestrians and does not interpret the shadows as humans. On the other hand, in Figure 16(c), the HOG detector may not be able to detect various poses reliably as their gradients will vary. To increase the detector performance, images containing various poses should be used as our training dataset.  Figure 17(a) shows the FPGA displaying the detection result to a monitor. The marks drawn on humans indicate successful detections. Our architecture is coded in Verilog hardware description language (HDL). We use a top-down approach to design the system architecture and a bottom-up approach to code the hardware modules. Each sub-module is designed and tested before being integrated. Figure 17(b) shows the flow summary of the design analysis and synthesis result. The design consumes 48,360 logic elements (LEs), 4,363 registers, and 84 of 9-bit embedded multipliers. It merely consumes 0.141 Mbits of memory. Our architecture requires 4,888 clock cycles to complete one frame detection of image. It also needs 640× 480 = 307,200 clock cycles to receive the image data and store them into the frame buffer. Therefore, the overall system needs 312,088 clock cycles (cacluated from 307,200 + 4,888) to process one frame of image data. It is important to note that the frame buffer embedded RAM is a huge speed bottleneck as it is only able to deliver one pixel every clock cycle.
The TimeQuest Timing Analyzer shows that the maximum operating frequency allowed for the design is 28.62 MHz. By setting the system clock frequency to 27 MHz, we obtained 86.51 Fps (obtained from 27,000,000/312,088). Since the VGA has 52 Fps refresh rate, our design will be suitable to be used with VGA due to the frame output will always be available every time the VGA refreshes. However, the Fps may be improved significantly by reducing the latency of the frame buffer because most of memory resources are consumed by the input buffer. The actual processing unit (without frame buffer) is able to deliver 5,523.732 Fps (obtained from 27,000,000/4,888). This indicates that the frame buffer latency has severely overshadowed the actual capability of the processing unit. It may possible to reduce resources in particular modules and keep the algorithm consistency at the smaller throughput. As shown in Figure 15, it is regarded as a first-in-first-out (FIFO) function, a static random-access memory (SRAM) t0 can be used to replace it. Therefore, gate count and power can be saved. The FIFO always toggles to consume power, but SRAM only activated one cell to access data. Moreover, ping-pong mode can enable a SRAM to perform "read" and "write" concurrently. The detection success of the proposed algorithm will be compared with the software implementation of the original HOG algorithm in the near future work. This becomes an open challenge that should be exploited. Table 7 and Table 8 evaluate the performance result with other competitors. The strongest points compare to the others are frame per second (Fps) and Fps-to-clock ratio. All the competitors exploit standard FPGA boards. Compared to the earlier works, our implementation performs significantly well in terms of the ratio of delivered Fps and operating clock frequency. This is made possible with our pipeline architectures and custom-designed hardware modules, instead of using general-purpose processors. Processors may be smaller in size, but dedicated modules are more efficient compared to processors. Moreover, all cells and blocks will not be kept in the RAM concurrently. Unused data will be overwritten to minimize memory usage. Additionally, the low operating clock frequency translates to less power consumption.

Performance comparison
Working at a low frequency surely allows power consumption to be reduced. For instance, in Table 7 [23] has a much higher image resolution (1920×1080), higher operational clock (270 MHz), and a lower frame rate (64 Fps). Instead, the proposed work has a lower image resolution (640× 480), ten times lower operational clock (27 MHz), and a high frame rate (86.51 Fps). This becomes the architecture design trade-off. In [39] has the highest Fps (162 MHz) but the operating clock is the highest (150 MHz) resulting in lower Fps-to-clock ratio. Instead, the proposed work the highest Fps-to-clock ratio (3.2041), lower frame rate, lower operating clock (27 MHz), more efficient in memory usage (141,872 bits), and the lowest registers (4,363). With the same image resolution usage (640×480), in [40] has the lowest embedded multipliers (40 DSP block) as well as the operating frequency (25 MHz), which is not to close with our proposed architecture. But our architecture has a higher Fps and Fps-to-clock ratio, also resource-efficient. In summary, it is safe to say that our architecture is considered the best trade off comprared to previous works, in terms of operating clock frequency (MHz), Fps (Hz), Fps-to-clock ratio, and the use of Registers.

CONCLUSION AND FUTURE WORKS
This paper presents a hardware architecture design to implement a simplified HOG algorithm. We have designed a cell-based raster scanning computation instead of window-based to reduce computation redundancy. The magnitude calculation using a linear approach provides us with a reasonable approximation of magnitude without using exponentiation and square root operations; due to L2-norm approach is too difficult to be implemented in hardware implementation. By using fixed-weighted binning for histogram classification, we can avoid using arctangent and division operations. Furthermore, by using the newtonraphson algorithm, we can execute block normalization without using any division operations. Finally, the overall parallel and pipeline architecture gives accurate detection with less memory usage and maximum Fpsto-clock frequency ratio. This work used MIT pedestrian dataset for training. The primary feature of this work is to simplify HOG for an efficient hardware implementation. This simplification certainly makes some degradation on the performance of original HOG. It is important to examine in detail the performance of original HOG comprated to this work (simplified HOG) further. We will also address several interesting issues, e.g., the impact of computational reduction in term of detection accuracy, measure the throughput achieved of the GPU implementations, and a more objective figure-of-metric (FOM). This will be considered to prove that the proposed hardware architecture is more efficient than the other competitors. Later, the effect of accuracy improvement to computational costs and system complexity will be evaluated further. In the recent years, various other challenging datasets have been introduced by many researchers. Therefore, we will use various dataset provided globally and dataset produced my ourselves to train and evaluate our proposed system comprehensively.