VGG16-OpenCL
is an implementation of the inference module of the VGG16 model pre-trained on the CIFAR 10
dataset. The goal of this project is to optimize the model to reduce inference time using OpenCL framework.
OpenCL
is a framework for writing programs that execute across heterogeneous platforms consisting of various processors and hardware accelerators. It provides programming languages and APIs to execute programs on these compute devices.
VGG16
was proposed in the paper Very Deep Convolutional Networks for Large-Scale Image Recognition by K. Simonyan and A. Zisserman and designed to work on 224 CIFAR 10
dataset, which includes 32
You need the following hardware and software installed on your machine:
- An OpenCL compatible processor
- Visual Studio (Optional)
- OpenCL SDK
PLEASE NOTE: This project is written in Visual Studio on Windows. While it is possible to build the project in IDEs other than Visual Studio, you will need to create appropriate build configurations for your IDE and/or platform.
- Clone this repository to your local machine:
git clone https://github.com/sjleo1/VGG16-OpenCL.git cd VGG16-OpenCL
- Open the
.sln
file in Visual Studio - Select
Build Solution
from theBuild
menu
This project includes separate pieces of source code that performs inference in serial and in parallel. You can choose between these two operation methods, and will get an elapsed time of the operation after the result is verified.
In this project, the performance results from the two different operation methods are compared to evaluate the optimization.
- Selecting devices and defining a context: Context is the environment within which the kernels are defined and execute.
- Creating command-queues: The host and the OpenCL devices interact with each other through commands posted by the host.
- Building program objects: Program object is compiled and linked to generate kernels for OpenCL devices.
- Creating memory objects: The host program defines memory objects required and pass them onto the arguments of kernels.
- Enqueueing commands: Commands are enqueued to the command-queues to execute the kernels.
All these steps above happen in the runtime, which means the OpenCL program object will be being built during the runtime. You can find more detailed explanation here.
The VGG16 model gained its reputation from the use of small 3
The single most important thing to consider when optimizing an OpenCL program is reducing global (off-chip) memory accesses as much as possible. Most of the techniques used in this project are variants of reducing memory accesses.
Compared to global memory latency, which is about 400-600 cycles, local memory latency is upto 100 times faster. Considering 3
Therefore, tiling input data from global memory into local memory prior to actual operations reduces a redundancy of global memory reads and thus improve operation efficiency.
Just as tiling data from global to local memory reduced redundant memory reads, more optimization is possible by further tiling from local memory to private memory (register).
In the previous technique, it had been set for each work-item (thread) to work for single output pixel/element in a feature map/matrix. However, each work-item still has to read value of multiple input pixels/elements adjacent to each other.
For example, a work-item has to read nine input pixels tiled into local memory, for the 3
What if we assign multiple output pixels to a work-item (more work per thread), while also tiling input pixels into private memory (tiling to registers)?
Assuming four pixels are assigned to a work item, single work-item produces 16 local memory reads for four output pixels, leading to 4 local memory reads per output pixel. Roughly speaking, we can expect it to be twice fast as the previous one.
The test was performed on three different computers with 3000 images.
Run Type | Processor | Host Memory | Dedicated Memory | Elapsed Time | ET/Image |
---|---|---|---|---|---|
Sequential | Intel i5-10400 |
32 GB DDR4 | - | 810 s | 0.2702 s |
OpenCL | Intel UHD Graphics 630 |
DDR4 | No Dedicated Memory | 133 s | 0.0444 s |
OpenCL | NVIDIA RTX 3060 |
DDR4 | 12 GB GDDR6 | 8.7 s | 0.0029 s |
Performance Improvement:
$\times$ 93.1
Run Type | Processor | Host Memory | Dedicated Memory | Elapsed Time | ET/Image |
---|---|---|---|---|---|
Sequential | Intel i5-1240P |
16 GB LPDDR5 | - | 1496 s | 0.4988 s |
OpenCL | Intel Iris Xe Graphics 80EU |
LPDDR5 | No Dedicated Memory | 39 s | 0.0130 s |
Performance Improvement:
$\times$ 22
Run Type | Processor | Host Memory | Dedicated Memory | Elapsed Time | ET/Image |
---|---|---|---|---|---|
Sequential | Intel m3-6Y30 |
4 GB LPDDR3 | - | 1345 s (500) | 2.6919 s |
OpenCL | Intel HD Graphics 515 |
LPDDR3 | No Dedicated Memory | 774 s (10000) | 0.0774 s |
Performance Improvement:
$\times$ 35
TODO