In the last lab, you optimized the memory accesses between the kernel and the global memory. In this lab, you will focus on improving the efficiency of the kernel by converting from floating point to fixed point data types.
The original code uses floating point for the sums and coefficient. Here you will use the ap_fixed<16,9> type, representing a 9-bit signed integer with seven decimal bits. This type was chosen because it improves performance and resource utilization, while maintaining the necessary precision for the application.
Look at the following inner loop of your convolution kernel.
for(int pixel = 0; pixel < img_width; ++pixel)
{
float sum_r = 0, sum_g=0, sum_b=0;
for(int m = 0; m < coefficient_size; ++m)
{
for(int n = 0; n < coefficient_size; ++n)
{
int jj = pixel + n - center;
if(jj >= 0 && jj < img_width)
{
sum_r += window_mem[window_line_idx][jj].r * coef[m * coefficient_size + n];
sum_g += window_mem[window_line_idx][jj].g * coef[m * coefficient_size + n];
sum_b += window_mem[window_line_idx][jj].b * coef[m * coefficient_size + n];
}
}
window_line_idx=(window_line_idx + 1) == MAX_FILTER ? 0 : window_line_idx + 1;
}
window_line_idx = top_idx;
out_line[pixel].r = fabsf(sum_r);
out_line[pixel].g = fabsf(sum_g);
out_line[pixel].b = fabsf(sum_b);
}
The inner loop is multiplying individual members of an RGBPixel
object which are unsigned char with the floating coef
array. The operation result is stored back into the floating point variables sum_r
, sum_g
, sum_b
, and finally to a RGBPixel
. Based on these calculations, you can assume that the largest number that can be represented by the sum argument would be 256 because that is the maximum value of an unsigned char. Based on this, you can use a fixed point data type that is 16-bits wide and 8-bits dedicated to the integer side.
TIP: The completed kernel source file is provided in the
reference-files/fixedpoint
folder. You can use it as a reference if needed.
Open the convolve_fpga.cpp
file from src/fixedpoint
, and make the following modifications.
-
Include the
ap_fixed.h
header at the top of the file.#include "ap_fixed.h"
-
Create a typedef for a fixed point type that maps to
ap_fixed<16,9>
.typedef ap_fixed<16,9> fixed;
-
Replace the following line (line 39).
float coef[MAX_FILTER * MAX_FILTER];
with:
fixed coef[MAX_FILTER * MAX_FILTER];
This modifies the type of the
coef
array to a fixed array. -
Because the type of
coef
is different fromcoefficient
, thememcpy
command is not supported by the Vivado® High-Level Synthesis (HLS) tool. Instead, convert it to afor
loop implementation. Replace (line 40).memcpy(coef, coefficient, coefficient_size * sizeof(float));
with:
int num_coefficients = coefficient_size * coefficient_size; for(int i = 0; i < num_coefficients; i++) { coef[i] = coefficient[i]; }
This performs the same operation, but also converts the floating point elements in the
coefficient
array to fixed point elements in thecoef
array. -
Next, replace the types of the
sum_r
,sum_g
, andsum_b
variables to the fixed type. Replace (line 70):float sum_r = 0, sum_g=0, sum_b=0;
with:
fixed sum_r = 0, sum_g=0, sum_b=0;
-
Go to the
design/makefile
directory. -
Use the following command to run hardware emulation.
make run TARGET=hw_emu STEP=fixedpoint SOLUTION=1 NUM_FRAMES=1
You should see the following results.
Processed 0.02 MB in 108.788s (0.00 MBps) INFO: [SDx-EM 22] [Wall clock time: 21:17, Emulation time: 0.510047 ms] Data transfer between kernel(s) and global memory(s) convolve_fpga_1:m_axi_gmem1-DDR[0] RD = 20.000 KB WR = 20.000 KB convolve_fpga_1:m_axi_gmem2-DDR[0] RD = 0.035 KB WR = 0.000 KB
-
Use the following command to generate the Profile Summary report and Timeline Trace.
cd ../build/fixedpoint vitis_analyzer profile_summary_hw_emu.csv
-
Use the following command to view the Profile Summary report.
vitis_analyzer timeline_trace_hw_emu.csv
Here is the Profile Summary report for hardware emulation. The kernel execution time is now reduced to 0.46 ms. The reason for this significant speedup is that the computation for-loop is pipelined when using fixed point operations. Therefore, the total latency is improved significantly.
Here is the updated table. There is a 4.2x boost on kernel execution time perspective.
Step | Image Size | Time (HW-EM)(ms) | Reads (KB) | Writes (KB) | Avg. Read (KB) | Avg. Write (KB) | BW (MBps) |
---|---|---|---|---|---|---|---|
baseline | 512x10 | 10.807 | 344 | 20.0 | 0.004 | 0.004 | 1.9 |
localbuf | 512x10 | 1.969 (5.48x) | 21 (0.12x) | 20.0 | 0.064 | 0.064 | 10 |
fixedpoint data | 512x10 | 0.46 (4.2x) | 21 | 20.0 | 0.064 | 0.064 | 44 |
In the next section, you will examine how breaking a single function into sub-functions lets you achieve task-level parallelism between the different functions. In this case, you will be optimizing with dataflow.
Copyright© 2019 Xilinx