VGG16-OpenCL

VGG16-OpenCL is an implementation of the inference module of the VGG16 model pre-trained on the CIFAR 10 dataset. The goal of this project is to optimize the model to reduce inference time using OpenCL framework.

OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of various processors and hardware accelerators. It provides programming languages and APIs to execute programs on these compute devices.

VGG16 was proposed in the paper Very Deep Convolutional Networks for Large-Scale Image Recognition by K. Simonyan and A. Zisserman and designed to work on 224 $\times$ 224 pixel input images. However, the model in this project has been adapted for CIFAR 10 dataset, which includes 32 $\times$ 32 pixel images.

Prerequisites

You need the following hardware and software installed on your machine:

An OpenCL compatible processor
Visual Studio (Optional)
OpenCL SDK

PLEASE NOTE: This project is written in Visual Studio on Windows. While it is possible to build the project in IDEs other than Visual Studio, you will need to create appropriate build configurations for your IDE and/or platform.

Getting Started

Clone this repository to your local machine:

git clone https://github.com/sjleo1/VGG16-OpenCL.git
cd VGG16-OpenCL

Open the .sln file in Visual Studio
Select Build Solution from the Build menu

Project Outline

This project includes separate pieces of source code that performs inference in serial and in parallel. You can choose between these two operation methods, and will get an elapsed time of the operation after the result is verified.

In this project, the performance results from the two different operation methods are compared to evaluate the optimization.

Short description of how OpenCL program works

Selecting devices and defining a context: Context is the environment within which the kernels are defined and execute.
Creating command-queues: The host and the OpenCL devices interact with each other through commands posted by the host.
Building program objects: Program object is compiled and linked to generate kernels for OpenCL devices.
Creating memory objects: The host program defines memory objects required and pass them onto the arguments of kernels.
Enqueueing commands: Commands are enqueued to the command-queues to execute the kernels.

All these steps above happen in the runtime, which means the OpenCL program object will be being built during the runtime. You can find more detailed explanation here.

VGG16

The VGG16 model gained its reputation from the use of small 3 $\times$ 3 convolutional filters throughout the network, which improved the performance and reduced the computational cost. The model in this project is adapted for images of 32 $\times$ 32 pixels, contrary to the original model that takes 224 $\times$ 224 pixel images.

Used OpenCL Optimization Techniques

The single most important thing to consider when optimizing an OpenCL program is reducing global (off-chip) memory accesses as much as possible. Most of the techniques used in this project are variants of reducing memory accesses.

☑️ Tiled Convolution & SGEMM

Compared to global memory latency, which is about 400-600 cycles, local memory latency is upto 100 times faster. Considering 3 $\times$ 3 convolution operation, each pixel except those on the edges is read nine times for a single convolution operation. Likewise, each element in matrices is read 512 times for 512 $\times$ 512 matrix-multiplication.

Therefore, tiling input data from global memory into local memory prior to actual operations reduces a redundancy of global memory reads and thus improve operation efficiency.

🔳 Tiling to Registers & Multiple Works per Thread

Just as tiling data from global to local memory reduced redundant memory reads, more optimization is possible by further tiling from local memory to private memory (register).

In the previous technique, it had been set for each work-item (thread) to work for single output pixel/element in a feature map/matrix. However, each work-item still has to read value of multiple input pixels/elements adjacent to each other.

For example, a work-item has to read nine input pixels tiled into local memory, for the 3 $\times$ 3 convolution operation. That's equivalent to 9 local memory reads for a single output pixel.

What if we assign multiple output pixels to a work-item (more work per thread), while also tiling input pixels into private memory (tiling to registers)?

Assuming four pixels are assigned to a work item, single work-item produces 16 local memory reads for four output pixels, leading to 4 local memory reads per output pixel. Roughly speaking, we can expect it to be twice fast as the previous one.

🔳 Using Vector Datatypes

🔳 Hiding Latency

Performance Result

The test was performed on three different computers with 3000 images.

Computer 1 (Desktop)

Run Type	Processor	Host Memory	Dedicated Memory	Elapsed Time	ET/Image
Sequential	`Intel i5-10400`	32 GB DDR4	-	810 s	0.2702 s
OpenCL	`Intel UHD Graphics 630`	DDR4	No Dedicated Memory	133 s	0.0444 s
OpenCL	`NVIDIA RTX 3060`	DDR4	12 GB GDDR6	8.7 s	0.0029 s

Performance Improvement: $\times$ 93.1

Computer 2 (Laptop)

Run Type	Processor	Host Memory	Dedicated Memory	Elapsed Time	ET/Image
Sequential	`Intel i5-1240P`	16 GB LPDDR5	-	1496 s	0.4988 s
OpenCL	`Intel Iris Xe Graphics 80EU`	LPDDR5	No Dedicated Memory	39 s	0.0130 s

Performance Improvement: $\times$ 22

Computer 3 (Tablet)

Run Type	Processor	Host Memory	Dedicated Memory	Elapsed Time	ET/Image
Sequential	`Intel m3-6Y30`	4 GB LPDDR3	-	1345 s (500)	2.6919 s
OpenCL	`Intel HD Graphics 515`	LPDDR3	No Dedicated Memory	774 s (10000)	0.0774 s

Performance Improvement: $\times$ 35

License

TODO

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
VGG16-OpenCL.sln		VGG16-OpenCL.sln
VGG16-OpenCL.vcxproj		VGG16-OpenCL.vcxproj
VGG16-OpenCL.vcxproj.filters		VGG16-OpenCL.vcxproj.filters
answer.bin		answer.bin
argmax.cl		argmax.cl
clproject.c		clproject.c
clproject.h		clproject.h
cnn.c		cnn.c
cnn.h		cnn.h
conv.cl		conv.cl
customlib.c		customlib.c
customlib.h		customlib.h
executing_programs.jpg		executing_programs.jpg
fc.cl		fc.cl
images.bin		images.bin
labels.bin		labels.bin
main.c		main.c
maxp.cl		maxp.cl
network.bin		network.bin
parallel.c		parallel.c
sequential.c		sequential.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VGG16-OpenCL

Prerequisites

Getting Started

Project Outline

Short description of how OpenCL program works

VGG16

Used OpenCL Optimization Techniques

☑️ Tiled Convolution & SGEMM

🔳 Tiling to Registers & Multiple Works per Thread

🔳 Using Vector Datatypes

🔳 Hiding Latency

Performance Result

Computer 1 (Desktop)

Computer 2 (Laptop)

Computer 3 (Tablet)

License

About

Releases

Packages

Languages

sjleo1/VGG16-OpenCL

Folders and files

Latest commit

History

Repository files navigation

VGG16-OpenCL

Prerequisites

Getting Started

Project Outline

Short description of how OpenCL program works

VGG16

Used OpenCL Optimization Techniques

☑️ Tiled Convolution & SGEMM

🔳 Tiling to Registers & Multiple Works per Thread

🔳 Using Vector Datatypes

🔳 Hiding Latency

Performance Result

Computer 1 (Desktop)

Computer 2 (Laptop)

Computer 3 (Tablet)

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages