$$\rightleftharpoonup{xx}$$
$$\longleftharp{xx}$$,
$$\longrightharp{xx}$$,
Pedestrian detection implementation on HLS
Figure 4 shows the simulation results on the HLS tool for the pedestrian detection using HoG + SVM. An input image with a pedestrian is fed as the test input to the code, and the output with the detected pedestrians is displayed. There are two sections in the image. The first detection has many bounding boxes around the same pedestrian again and again, and in the second image, the overlapping boxes are removed, and they are suppressed, leaving only the main detection boxes.

Figure 4: Simulation result from HLS tool. (A,B) Two different input images and the resultant images with the detected pedestrians. Please click here to view a larger version of this figure.
The HLS tool also provides synthesis reports for the timing and resource utilization. The timing summary highlights the time period required by the design and provides the maximum and minimum latency values in terms of the number of cycles. This information is useful for estimating how much time the design requires to execute and what the clock frequency should be when moving to the actual hardware implementation. Table 2 below shows the timing report after HLS synthesis, which clearly depicts that the target clock period was 6 ns and the design took 5.25 ns, which is less than the target, and hence the time period can be 6ns or above but not below 5 ns.
| Timing Summary |
| Clock | Target | Estimated |
| 6.00 ns | 5.250 ns |
| Utilization Summary |
| Total / Available | Percentage of Utilization |
| BRAM18K | 22 / 432 | 5% |
| DSP48E | 13 / 360 | 3% |
| FF | 5611/ 141120 | 3% |
| LUT | 9904/ 70560 | 14% |
| URAM | 0 | 0 |
Table 2: Estimated timing and resource utilization report from HLS tool for pedestrian detection using HoG-SVM.
Table 2 also depicts the utilization report. It shows the percentage utilization of important on-board FPGA resources as per the target board selected. For this pedestrian detection design, the utilization report shows that the design consumes 14% of the look up tables (LUTs), 3% of Flip Flops (FFs), 3% of digital signal processing (DSP), and 5% of block random access memory (BRAM). These estimates are not the exact utilization reports, but the actual reports are close to these estimates. These are only the estimates that can be calculated by the HLS tools. The Actual implementation is usually very different from these estimates.
Actual implementation results from hardware programming
After the code is mapped into an IP, which is imported in the FPGA programming tool, and the design is implemented on the actual FPGA hardware, several reports are also generated. The first is the timing summary, which shows whether the clock frequency provided to the design is enough or not. If all the timing constraints are met and there are no violations, then the design can proceed. Table 3 below shows the timing summary generated by the tool. As depicted in the table, the timing summary indicates the worst negative slack, which is 4.073 ns. As this value is positive, it indicates that this much time is still available. Negative values indicate that the FPGA is taking more time to complete the task, and the clock is running fast. Since in this case there are no negative values, which signifies that the timing constraints are met.
| Design Timing Summary |
| Setup | Hold | Pulse Width |
| Worst Negative Slack 4.073 ns | Worst Hold Slack 0.010 ns | Worst Pulse width Slack 3.500 ns |
Table 3: Actual timing summary for pedestrian detection on FPGA board.
Also, the tool shows the resource utilization reports, which are the actual utilization of the on-board resources as per the FPGA board selected. In this case, the selected board is the Zynq UltraScale+ MPSoC (Multi-Processor System On Chip) based FPGA development board27. Table 4 below shows the resource utilization and Figure 5 shows the diagrammatic representation of the resource utilization.
The utilization summary indicates the actual consumption of the on-board resources given that there are 8 HoG IPS used in parallel, and the estimates reported by the HLS synthesis were for a single HoG IP. But even after such extensive usage, the resource utilization for every resource is less than 50%. Table 4 clearly indicates the utilization with respect to the various resources and their utilization percentage, which is represented pictorially in Figure 5.
| Resource | Utilization | Available | Utilization % |
| LUT | 40536 | 70560 | 57.45% |
| LUTRAM | 7304 | 28800 | 25.36% |
| FF | 33342 | 141120 | 23.63% |
| BRAM | 68 | 216 | 31.48% |
| DSP | 128 | 360 | 35.56% |
| BUFG | 2 | 196 | 1.02% |
Table 4: Actual utilization Report for pedestrian detection on FPGA board.

Figure 5: Resource utilization for pedestrian detection on FPGA board after actual implementation. Look up tables (LUT): 57%, LUTRAM: 25%, Flip flops (FF): 24%, Block RAM (BRAM): 31%, Digital signal processors (DSP): 36%, Buffers: 1%. Please click here to view a larger version of this figure.
The third report is regarding the power estimates of the board for the amount of energy consumption by the design. Figure 6 below shows the power consumption report, which shows that the total on-chip power is 2.435 W. The junction temperature and the power consumed by every important net and component are also shown. The power measurements do not highlight any alarming power consumption, and hence the design can be considered energy efficient.

Figure 6: Power estimation for pedestrian detection on FPGA board after actual implementation. Power report generated by the tools depicts the total consumed power as 2.435 W and also shows the distribution of the power among the various resources on the FPGA board. Please click here to view a larger version of this figure.
Another analysis is done to understand the advantage of using 8 HoG IPs instead of a single HoG IP or more than 8 in the created block diagram, as shown in Figure 3. The hardware-related performance metrics were calculated for both a single HoG IP and 8 HoG IPs in parallel. Table 5 below shows the comparison.
| Perfromance Metric | 1 IP | 8 IPs |
| Timing (ns) | 5.312 | ~5.25 |
| Freq (MHz) | 188 | 150 |
| Power (W) | 1.9 | 2.43 |
| LUTs | 4998 | 40536 |
| FF / Registers | 4,031 | 33,342 |
| DSP | 16 | 128 |
| BRAM | 8.5 | 68 |
| FPS | ~10–11 | 83 |
Table 5: Comparison of performance metrics using single vs multiple HoG IPs.
Table 5 clearly indicates that when the resources are considered like the LUTs, FFs, DSPs, and BRAM, then with single HoG IP and 8 HoG IPs, the scaling is linear with almost 8 times increase in the resources utilized. This is clearly expected as more IPs will lead to more resources being consumed. Also, if the frequency is observed, then the maximum frequency also degrades slightly by 20% from 188 MHz to 150 MHz. This is also expected as more blocks lead to more connections and hence longer paths, causing an increase in critical paths. But the advantageous factors like frames per second (FPS) improve from 10 to 83, demonstrating nonlinear scaling in the case of FPS due to the introduced concept of parallelism, due to 8 HoG IPs. Also, the power scales from 1.9 W to 2.4 W, indicating improved energy efficiency through pipelining. Thus, this analysis clearly indicates that the introduction of 8 HoG IPs is beneficial for the design, and scaling beyond 8 can cause overconsumption of resources; thus, numbers of blocks beyond 8 are not considered favorable.
Pedestrian detection results after FPGA implementation
Finally, the entire system is integrated on the FPGA board, and the bitstream file is generated, which is then programmed on the board through the SD card booted with Python programmability capability. Once the board is booted with the SD card, the jupyter interface can be accessed and Python code can be written and run on the platform. The Python code is run and tested for pedestrian detection on different input images. The result of a few images is shown in Figure 7 below. These images are utilized from the INRIA dataset as well as random images of pedestrians obtained from open source online sources26,27.

Figure 7: Pedestrian detection results on still images through FPGA Board. The tested images include images from the INRIA dataset, open source images available on google to test to detection accuracy on crowded streets of India. Please click here to view a larger version of this figure.
The system is also tested on real-time frame capturing through a web camera and detecting the pedestrians in the frame as well as the system is tested on already recorded video inputs of pedestrians. The results for this are depicted in Figure 8 and Figure 9. Figure 8 shows set of example frames captured by the web camera and the results of pedestrian detection in each frame, whereas Figure 9 shows the results of pedestrian detection implemented on an input video provided to the system.

Figure 8: Pedestrian detection results on frame captured by a camera in real-time through the FPGA board. Real-time capturing of video through web camera 720 P and demonstrating the real- time detection of pedestrians. The blurred images are caused as snapshots are taken from the ongoing live video. Please click here to view a larger version of this figure.

Figure 9: Pedestrian detection results on videos provided as input to the FPGA Board. The videos were taken from open source links. Please click here to view a larger version of this figure.
Estimation of performance metrics
To calculate the efficiency and analyze the performance of the above implemented design, it is essential to calculate performance metrics that are useful to evaluate the performance. The performance metrics for detecting efficiency of a detection algorithm basically depend on values of true positives (TP), true negatives (TN), False positives (FP), and false negatives (FN). From these values, the performance metrics like precision, recall, F1 score, False positives per image, and accuracy can be calculated as per the equations given below. It has been observed that most of the research papers report their detection performance through the accuracy parameter. But it has been observed that the accuracy calculation that involves the use of TN can be a misleading parameter, as the value of TN cannot be calculated correctly in a true sense, as it involves finding the count of all the detection windows in an image that does not actually have a pedestrian, and the implemented algorithm also reports it as no detections. This number is generally very large, as the total number of detection windows in an image is large, and the background areas in every image usually correspond to regions with no pedestrians. By closely looking at the accuracy formula shown in equations [1] – [5], it can be realized that as the value of TN will be quite high as compared to TP+FP+FN, the accuracy parameter usually has a high value. To truly evaluate the performance, it is much better to report the metrics like precision, recall, and F1 score that do not depend on TN and hence are much more accurate.
[1]
[2]
[3]
[4]
[5]
To find the values of TP, TN, and FN for this paper, the experiment on the still images was repeated on a huge number of images. From the results of every image, the value of true positives, which is the number of pedestrians detected correctly, false positives, the number of pedestrians wrongly detected, and false negatives, which is the actual pedestrians that were undetected, was calculated. The following values were reported after the performed experiments and are shown in Table 6 below.
| Performance Metric | Value |
| TP | 143 |
| FP | 39 |
| FN | 19 |
| Precison | 0.786 (78.6%) |
| Recall | 0.883 (88.3%) |
| F1 Score | 0.831 (83.1%) |
| FPPI | 0.867 |
Table 6: Performance metrics for the FPGA based implemented of pedestrian detection algorithm.
Table 6 above thus describes the accuracy of the pedestrian detection algorithm through the various performance metrics, precision, recall, F1 score, and FPPI, when the algorithm is implemented on the hardware platform.
Performance comparison with existing FPGA-based HoG implementations
Finally, the executed work can be compared with the previous literature to state any significant contributions of this research. This comparison is depicted in Table 715,16,17,21,24below. The articles with which the comparison is done are all based on pedestrian detection applications implemented on FPGA platforms, and the algorithms used for these detections are also the same for all, which is HoG combined with a classifier, which is either an Adaboost classifier or SVM. The image size is also the same for each (640 × 480). The comparison is made based on parameters like the clock frequency that affects the speed, the frames per second, the power consumption, and the resource utilization in terms of LUTs, DSPs, Memory, Slices, and Registers. To induce a fair comparison, the research papers considered for comparison have similar image resolution, and to normalize the resource comparison, every resource utilization is normalized by dividing the number of consumed resources by the total number of available resources according to the FPGA board used.
| Reference | Image Size | FPGA Board | Clock Frequency | Frames per second (FPS) | Power | Pixels /clock | LUTs (%) | DSP48s (%) | BRAMs /memory Bits (%) | Registers/FF (%) |
| 15 | 640×480 | Xilinx Zynq | 82.2 MHz | 40 | - | 1 | 40 | 2 | 0 | - |
| 24 | 640×480 | Virtex 6 | 150 MHz | 10 | 19 W | | 39 | 53 | 22 | - |
| 16 | 640×480 | Cyclone V | 162 MHz | 526 | 9 W | 0.99 | 21 | 86 | 100 | 21 |
| 17 | 640×480 | Altera DE2-115 | 50 MHz | 129 | 3.6 W | - | 73 | - | 72 | 60 |
| 21 | 640×480 | Zync 7000 | 100 MHz | 240 | 1.6 W | - | 13 | 3 | 1 | 10 |
| THIS WORK | 640 X 480 | Ultra 96 v2 | 150 MHz | 83 | 2.435W | 0.0632 | 57 | 35 | 31 | 24 |
Table 7: Comparison of parameters and performance for implementations of pedestrian detection on FPGA
As visible in Table 7 above, it can be noticed that when the implementation in this research is compared with the previous works, the comparisons showcase significant improvements in terms of speed. The FPGA board is capable of running at a clock frequency of 150 MHz, which signifies that the time period for completing the entire task is less than 6 ns. Although some prior works report significantly more FPS, through careful examination, it can be analyzed that this advantage comes at the cost of higher power consumption as well as almost complete utilization of certain resources. If the power consumption is considered than in this work the reported power is also on the lower side and the resource utilizations suggest that the consumption of every resource is slightly more than certain implementations, but equal to or less than 50% (57% LUTs, 35% DSPs, and 31% BRAM) which shows significant room for more tasks to be implemented in this design. Overall, it can be stated that the work implemented in this paper achieves a balanced trade-off between performance, power, and resource utilization. Additionally, the presented work showcased scalable parallelism through multiple IP blocks without drastically affecting the performance parameters.
Supplementary File 1: Script_1_train_test.py.Please click here to download this file.
Supplementary File 2: Script_2_HLS_hog.cpp. Please click here to download this file.
Supplementary File 3: Script_3_HLS_test_bench.cpp. Please click here to download this file.
Supplementary File 4: Script_4_HLS_consts.h.Please click here to download this file.
Supplementary File 5: Script_5_jupyter_code.txt.Please click here to download this file.