Analysis and Eval | |||||
---|---|---|---|---|---|
Supported Layers | Performance/Resource Utilization | ||||
Performance Eval | |||||
Design and Development | |||||
API Reference | Quantization User Guide for CHaiDNN | Model Zoo | Running Inference on new Network | ||
Creating SDx GUI Project | Configurable Parameters | Custom Platform Generation | Software Layer Plugin | ||
SDSoC Environment User Guide | Hardware-Software Partitioning for Performance |
You can extract a sub-graph from the entire network by providing start_layer
and end_layer
in xiInit()
function. This enables you to perform a forward-run on that sub-graph alone. This tutorial shows how to use this feature to improve the inference throughput.
In modern convolutional neural networks, convolution layer, pooling, and ReLU layers are the major compute intensive components and these are Hardware-accelerated by CHaiDNN. CNNs also contains some other layers like InnerProduct, Softmax etc. that run on ARM CPUs. These software layers might reduce the inference throughput based on their individual latencies.
One way to improve the inference throughput is to run Hardware and Software layers in parallel, so that latency of one is hidden behind the other. This will require splitting the network into two sub-graphs. The first sub-graph contains all the Hardware-accelerated layers and the second sub-graph contains all the subsequent software layers. You can then use threads to run them in parallel across multiple images in a ping-pong style. So ideally, the effective inference time would be the maximum of inference time of these two sub-graphs.
The input/output for a hardware accelerated layers is organized as two buffers containing a set of 8 feature-maps(planes) each. More details of the data organization is explained below.
Description of IO buffers used for a HW accelerated Layer
The details of IO buffers for the nth layer can be obtained using the below code snippet
chaihandle_t *chaihandle_info = (chaihandle*)handle;
std::vector<xChangeLayer> *hwQueue = chaihandle_info->JobQueue;
void **in_ptrs = hwQueue[0][n].in_ptrs;
void **out_ptrs = hwQueue[0][n].out_ptrs;
The in_ptrs and out_ptrs are described in the below table.
Buffer Type | Convolution | Convolution+Batch Norm | Pool |
---|---|---|---|
Input | in_ptrs[0] in_ptrs[1] |
in_ptrs[2] in_ptrs[3] |
in_ptrs[0] in_ptrs[1] |
Output | out_ptrs[0] out_ptrs[1] |
out_ptrs[2] out_ptrs[3] |
out_ptrs[0] out_ptrs[1] |
How to enable Layer-wise output dump
To dump the outputs of hardware accelerated layers, set the value of LAYERWISE_OUTPUT_WRITE
to 1 in the file <CHaiDNN Repo>/software/include/hw_settings.h
. This will dump the output of all layers to <models>/<network>/<6/8-bit>/out
directory. The Output contains float values printed in NCHW order.
#define LAYERWISE_OUTPUT_WRITE 1
Before starting with an example, keep the following aspects in mind.
-
CHaiDNN does not support multiple inputs/outputs for a network. This holds true for its sub-graphs also. When extracting a sub-graph, make sure that the sub-graph has only one input and only one output.
-
This technique works best for classification networks. For optimal performance, split the network into two sub-graphs such that all the initial HW-accelerated layers fall into one sub-graph and subsequent SW layers fall into second sub-graph.
-
If there are Hardware-accelerated layers in both the sub-graphs, they will try to access the Hardware simultaneously which might cause undefined behavior or performance penalty.
In the tutorial on running a network, steps to run and benchmark a network are presented. (It is highly recommended to go through that tutorial first). How you can use this technique to improve the performance is presented here. GoogleNet is used as an example. The full code can be accessed here. Key code snippets in the example are explained below.
Define structure
'pthreads' are used for multi-threading. So, a structure needs to be defined to pack all the arguments.
typedef struct execStruct_ {
void *chai_handle;
std::vector<void* > input;
std::vector<void* > output;
bool is_first_layer;
} execStruct;
Define wrapper function
Define a wrapper function for xiExec()
which can be passed to pthread_create()
.
void* execRoutine(void* args_) {
execStruct* args = (execStruct*) args_;
xiExec(args->chai_handle, args->input, args->output, args->is_first_layer);
return NULL;
}
Split the graph
Split the GoogleNet into two sub-graphs. The first sub-graph contains all the convolution/activation/pooling layers. The second sub-graph contains InnerProduct layer and Softmax layer. HW-layers and SW-layers are split into two graphs.
//# start/end layer in the graph1
string start_layer_graph1 = "";
string end_layer_graph1 = "pool5/7x7_s1";
//# start/end layer in the graph2
string start_layer_graph2 = "loss3/classifier";
string end_layer_graph2 = "";
Note that, by-default start_layer=""
is the first layer in the prototxt and end_layer=""
is the last layer in the prototxt.
Setting flags and Initialization of graphs
The actual image data after mean-subtraction is fed into the first sub-graph.
bool En_layer1_graph1 = true;
bool En_layer1_graph2 = false;
xiInit()
should be called for each sub-graph to get the separate job-queues. So, all the related data structures should be replicated.
//# Struct which holds the input/output layer info
io_layer_info io_layer_info_ptr1;
io_layer_info io_layer_info_ptr2;
//# Create Job-Queue
void chai_handle_1;
void chai_handle_2;
//# Init call for sub-graph1
xiInit(dirpath, prototxt, caffemodel, chai_handle_1 &io_layer_info_ptr1,
numImg_to_process, En_layer1_graph1, start_layer_graph1, end_layer_graph1);
//# Init call for sub-graph2
xiInit(dirpath, prototxt, caffemodel, chai_handle_2, &io_layer_info_ptr2,
numImg_to_process, En_layer1_graph2, start_layer_graph2, end_layer_graph2);
Loading input image
Load the image as explained in the tutorial on running a network. For benchmarking, same set of images across multiple xiExec()
calls must be used.
//# Create buffer to load normalized input
vector<void *> normalizeInput;
for(int batch_id = 0; batch_id < numImg_to_process; batch_id++)
normalizeInput.push_back(ptr);
float *mean_ptr = (float*)malloc(3*sizeof(float));
float *var_ptr = (float*)malloc(3*sizeof(float));
//# With input Normalization
if(inp_mode == 1)
{
mean_ptr[0] = 0.485;
mean_ptr[1] = 0.456;
mean_ptr[2] = 0.406;
var_ptr[0] = 0.229;
var_ptr[1] = 0.224;
var_ptr[2] = 0.225;
}
else
{
mean_ptr[0] = 104.0;
mean_ptr[1] = 117.0;
mean_ptr[2] = 123.0;
}
int status = inputNormalization(normalizeInput, resize_h, resize_w, img_path1, img_path2,
inp_mode, mean_ptr, var_ptr, numImg_to_process, io_layer_info_ptr1);
//# Create input/output Buffers
vector<void *> input;
void *ptr;
for(int i = 0; i < io_layer_info_ptr1.num_in_bufs; i++)
{
if(io_layer_info_ptr1.inlayer_exectype.compare("hardware") == 0)
ptr = sds_alloc_non_cacheable(size);
else
ptr = malloc(size);
input.push_back(ptr);
}
xiInputRead(normalizeInput, input, numImg_to_process, io_layer_info_ptr1);
fprintf(stderr,"input Read done\n",stderr);
Output buffers creation
Job-Queues will be running in a ping-pong style. When job-queue1 is working on one image, job-queue2 will be working on the previous output of job-queue1. Two output buffers are needed for both the job-queues. Output buffer creation for job-queue2 is presented below.
// Allocate output Buffers for pingpong for graph2
vector<void *> output1;
vector<void *> output2;
//# Memory size required for output buffers
int graph2_out_size = io_layer_info_ptr2.outlayer_sizebytes;
for(int i = 0; i < io_layer_info_ptr2.num_out_bufs; i++)
{
if(io_layer_info_ptr2.outlayer_exectype.compare("hardware") == 0)
{
ptr1 = sds_alloc_non_cacheable(size);
ptr2 = sds_alloc_non_cacheable(size);
}
else
{
ptr1 = malloc(size);
ptr2 = malloc(size);
}
output1.push_back(ptr1);
output2.push_back(ptr2);
}
The same should be repeated for job-queue1.
// Allocate output Buffers for pingpong for graph1
vector<void *> tmp_out1;
vector<void *> tmp_out2;
//# Memory size required for output buffers
int graph1_out_size = io_layer_info_ptr1.outlayer_sizebytes;
void *ptr1, *ptr2;
for(int i = 0; i < io_layer_info_ptr1.num_out_bufs; i++)
{
//# Allocate the same as for job-queue2
}
Structure Initialization for two graphs
All these parameters to should be packed to execStruct
structures and should be arranged in ping-pong style.
// image 1 inference
execStruct queue1_args_ping = {chai_handle_1, input, tmp_out1, En_layer1_graph1};
execStruct queue2_args_ping = {chai_handle_2, tmp_out1, output1, En_layer1_graph2};
// image 2 inference
execStruct queue1_args_pong = {chai_handle_1, input, tmp_out2, En_layer1_graph1};
execStruct queue2_args_pong = {chai_handle_2, tmp_out2, output2, En_layer1_graph2};
// Ping-Pong
execStruct queue1_Args[2] = {queue1_args_ping, queue1_args_pong};
execStruct queue2_Args[2] = {queue2_args_ping, queue2_args_pong};
Execution
User can start the execution. Average time of N runs is considered.
int pingpong = 0;
TIME_STAMP_INIT
execRoutine((void*)(&queue1_Args[pingpong]));
for(int i = 0; i < loop_iter; i++)
{
pthread_create(&queue2_thread, NULL, execRoutine, (void *)(&queue2_Args[pingpong]));
pingpong = 1 - pingpong;
execRoutine((void*)(&queue1_Args[pingpong]));
pthread_join(queue2_thread, NULL);
}
execRoutine((void*)(&queue2_Args[pingpong]));
TIME_STAMP
double perf_batch = ((double)(1000)/tot_time)*XBATCH_SIZE*(loop_iter+1);
fprintf(fp_lat, "\n[PERFM] Performance with Batching : %lf Images/second", perf_batch);
Unpacking and writing output data
The output from output buffer should be unpacked and written to a .txt file.
int unpack_out_size = io_layer_info_ptr2.outlayer_sizebytes;
//# Create memory for unpack output data
vector<void *> unpack_output1;
vector<void *> unpack_output2;
for(int batch_id = 0; batch_id < numImg_to_process; batch_id++)
{
void *ptr1 = malloc(unpack_out_size);
void *ptr2 = malloc(unpack_out_size);
unpack_output1.push_back(ptr1);
unpack_output2.push_back(ptr2);
}
//# Loading required params for unpack function
kernel_type_e out_kerType = io_layer_info_ptr2.out_kerType;
int out_layer_size = io_layer_info_ptr2.out_size;
//# unpacks the output data for ping and pong buffers
xiUnpackOutput(output1, unpack_output1, out_kerType, out_layer_size, numImg_to_process);
xiUnpackOutput(output2, unpack_output2, out_kerType, out_layer_size, numImg_to_process);
//# Write the output data to txt file
outputWrite(dirpath, img_path1, unpack_output1, numImg_to_process, io_layer_info_ptr2, 0);
outputWrite(dirpath, img_path1, unpack_output2, numImg_to_process, io_layer_info_ptr2, 1);
Configuration : ZU9 with 1024DSP @ 250/500 MHz
- Running the full network as explained here gives 151 fps.
- Running the network after HW-SW partitioning gives 207 fps.
Copyright© 2018 Xilinx