# Improve openjpeg2000 encoding/decoding time

Jpeg2000 provides superior compression and advanced features like optionally lossless compression, region of interest coding, stream decoding etc. As a result of this complex feature set, the encoding and decoding process for jpeg2000 is computationally expensive. It has already been demonstrated that a significant speed up is achieved in many image processing applications by using the massively parallel GPU architecture. In fact there is previous literature that reports speed up on the GPUs for various components of jpeg2000 like DWT, EBCOT etc.

As a part of this project we plan to develop a parallel implementation for jpeg2000 encoding/decoding using the CUDA programming platform available for Nvidia GPUs. The decompression is more challenging than the compression and we plan to focus on the lossy decoding. At the end of this project, we hope to have a parallel implementation for tier 1/2 decoding, inverse dwt and inverse dc level shift which comprise the decoding pipeline.

# Code Repository

The code for this project is pushed into [openjpeg optimization branch](https://gitorious.org/~aditya12agd5/openjpeg-optimization/aditya12agd5s-openjpeg-optimization)

# Progress

## Compilation

As a part of [commit](https://gitorious.org/~aditya12agd5/openjpeg-optimization/aditya12agd5s-openjpeg-optimization/commit/922a0f3a8f626b19f27953917a19641faba150a8), the appropriate changes were done to the cmake files to enable compilation of CUDA code with openjpeg library.
The file gpu.cu, contains all the CUDA kernels and also the kernel wrappers. The kernel wrappers can be evoked from the openjpeg library files and these wrappers then invoke the appropriate kernel. The kernel wrapper functions are prefixed with <b>gpu</b> and kernel functions are prefixed with <b>kernel</b>

## Inverse DC Level Shift

In [commit](https://gitorious.org/~aditya12agd5/openjpeg-optimization/aditya12agd5s-openjpeg-optimization/commit/e73b93602d9af9d9bb9cd02bbe5241072fe5a6e5), the function for inverse dc level shift Parameter 2 was implemented on the GPU (gpu[\_ ](.md)dc[\_ ](.md)level[\_ ](.md)shift[\_ ](.md)decode). The data is copied component by component to the GPU.
The number of threads is equal to the image size and each thread adds the dc level shift value to the corresponding pixel.

Once, the entire pipeline has been implemented,Memory Setup we can remove the memory transfer
overhead for this stage. Ideally before the first stage the image data is transferred to the GPU. It then continues to reside and also gets modified as the decoding stages t1/t2, inverse dwt and inverse dc level shift are performed. Finally after all stages, the output image is ready in the GPU memory and it is transferred back to the CPU output array.

## Inverse Discrete Wavelet Transform

The [commit](https://gitorious.org/~aditya12agd5/openjpeg-optimization/aditya12agd5s-openjpeg-optimization/commit/84f53565c0a5578ad55dc95652a40c335f864e75) has the version#1 of the complete implementation of the inverse DWT stage. The cuda implementation is as follows :

1. Similar to the CPU code, processing of four Memory Setupvalues is done together using the float4 data type. Though a single cuda core does not have the vector computation capability (like MMX instructions on CPU), there is still benefit in using float4 because GPUs provide higher FLOPS/s and  and memory access is faster i.e. loading a float4 is quicker than individually loading the 4 floats.

2. For processing a rh x rw image in say decode[\_ ](.md)h stage, the number of blocks is equal to rh/4 and each block has rw threads. A simpler way to understand is j<sup>th</sup> thread of i<sup>th</sup> block process four values : (4\*i,j); (4\*i+1,j); (4\*i+2.j); (4\*i+3;j). If rw is less than a threshold (currently 512), then the entire wavelet array of size rw per thread can be stored in the shared memory. Thus provided that the size of current resolution is less than a threshold we use the kernel with optimization of shared memory.

3. If the size exceeds the threshold, then we can no longer use shared memory and a global memory array is used for the wavelet. The kernels which handle this case of overflow of shared memory have theParameter 2  [\_ ](.md)global[\_ ](.md) in their function names.

4. Note that processing the entire wavelet array of size _rw_ in a single block gives us a chance to use the block synchronization primitive [\_\_ ](.md)synchthreads. Thus we can club the v4dwt[\_ ](.md)interleave[\_ ](.md)(h/v) and v4dwt[\_ ](.md)decode[\_ ](.md)step1 and v4dwt[\_ ](.md)decode[\_ ](.md)step2 in a single kernel. Such kernel fusion as and when possible results in optimal performance of the code.

# Performance/Results

## Inverse DWT

These are performance results for the [commit84f53565](https://gitorious.org/~aditya12agd5/openjpeg-optimization/aditya12agd5s-openjpeg-optimization/commit/84f53565c0a5578ad55dc95652a40c335f864e75)

The timing measurements are performed using clock\_gettime(CLOCK\_MONOTONIC, ...), this is a monotonically increasing timer without drift adjustments and it is a standard for measuring the time execution in case of asynchronous events like cuda memory transfer or kernel calls.

The platform is Nvidia Geforce GTX 580 GPU and Intel Core i7 920 (2.67GHz) CPU.

The below table contains a split of the timings for various phases of the GPU code.

<table border='1'>
<tr>
<td> Test Images (Right) <br /> Parameters (Down) </td>
<td> sintel_2k.j2k </td>
<td> oldtowncross.j2k </td>
<td> crowdrun.j2k </td>
<td> duckstakeoff.j2k </td>
</tr>
<tr>
<td> Memory Setup Time (secs)</td>
<td> 0.054404 + 0.001611 + 0.001551 = 0.057566 </td>
<td> 0.055298 + 0.002517 + 0.001976 = 0.059791 </td>
<td> 0.054851 + 0.001928 + 0.001972 = 0.058751 </td>
<td> 0.055063 + 0.001997 + 0.001954 = 0.059014 </td>
</tr>
<tr>
<td> Computation Time (secs)</td>
<td> 0.002360 + 0.002307 + 0.002309 = 0.006976 </td>
<td> 0.003153 + 0.003737 + 0.003089 = 0.009979 </td>
<td> 0.003108 + 0.003283 + 0.003057 = 0.009448 </td>
<td> 0.003099 + 0.004512 + 0.003135 = 0.010746 </td>
</tr>
<tr>
<td> Output Memory Transfer Time (secs)</td>
<td> 0.002965 + 0.002913 + 0.002833 = 0.008711 </td>
<td> 0.003318 + 0.003231 + 0.003295 = 0.009844 </td>
<td> 0.003315 + 0.003684 + 0.003278 = 0.010277 </td>
<td> 0.003285 + 0.003222 + 0.003304 = 0.009811 </td>
</tr>
</table>

As per the readings above, the overall execution time is on average 0.05 secs for the memory setup phase, 0.01 secs for the compute phase and again 0.01 secs for the
output memory transfer phase. Thus the overall inverse dwt time is time is about 0.07secs per image.

The above are component wise timings and note that in all cases, the time for the first component is very high for the memory setup phase as compared to the other components. It is a known fact that the first cuda call performs some bus initialization and is known to have a 50ms = 0.05secs overhead. [Refer Slow Cuda Setup](http://forums.nvidia.com/index.php?showtopic=158779),  [Refer](http://codinggorilla.domemtech.com/?p=1135)

Unfortunately, previous CPU benchmark by @nicolas [mailing list](https://groups.google.com/forum/?fromgroups#!searchin/openjpeg/nicolas$20DCMAKE_BUILD_TYPE/openjpeg/2c0vqIcn-Qc/adY4_KVRTVIJ) report 0.05s time for the inverse dwt phase.

But we should take note that this 0.05s is a one time overhead for CUDA and it is dominant because we are looking only at the dwt inverse phase.

If we implement the complete pipeline as follows :

1. Transfer image data to GPU.

2. Apply all compute steps (t1/t2, inverse dwt, inverse dc level shift) through GPU kernels.

3. Transfer output image data back to CPU.

<br />
The actual 0.05s overhead is only incurred once in step(1) above and the time for inverse dwt is only the compute time 0.01s which is 5x faster than the CPU time.

Also the time for tier2 decode phase is very high and if are able to achieve speed up on GPU for this phase, then the resulting speed up even with the 0.05s initial overhead will still be significant.

## Inverse t1 decoding

### Procedure

t1\_decode\_cblks\_v2 performs decoding across all code blocks for every component. Decoding of each code block is done by calling t1\_decode\_cblk\_v2, which internally calls the three decoding passes t1\_dec\_sigpass, t1\_dec\_refpass etc. The data filling for each decoded code block is done as follows :

for j = 0 to cblk\_h
> for i =  0 to cblk\_w
> > tmp = datap[(j\*cblk\_w) + i]
> > tiledp[(j\*tile\_w) + i] = tmpæ/2
> > end
end

cblk\_w and cblk\_h unless otherwise specified have a max initialization of 64 x 64 in openjpeg.c

### Coarse Parallelization

All code blocks ( of all compno/precno) are decoded in parallel. The actual decoding passes of each code block are performed sequentially.

One Cuda Block is used per code block, based on the following rationale :

1. 64x64 = 4096 values per code block, and usual CUDA configurations allow 512 threads per cuda block. Thus if we develop a fully parallel approach at later stage, we will incur computation of 8 values per thread which is reasonable.

2. For extension to fully parallel approach for EBCOT we will require synchronisation which is available in the CUDA block (synchthreads())

threadID = 0 for each code block will do the heavy computation i.e. call t1\_decode\_cblk\_v2 and other subsequent functions t1\_dec_(sig/ref)pass etc._

The code for t1\_decode\_cblk\_v2 and some of the mqc functions apart from those called in / **for (passno = 0; passno < seg->real\_num\_passes; ++passno)**/ loop
has been re-factored into device functions that are callable from the GPU code.
Refer [commit](https://gitorious.org/openjpeg-optimization/aditya12agd5s-openjpeg-optimization/commit/4fe348a64bb8ce7160363ca49ee32ef619692173/diffs/601a4503f018fa8bfea1e68f9d11003309a977d0)

The major challenge in implementing t1 decoding on GPU is to ensure memory transfer of the dynamic arrays that are part of the OPJ structs.

e.g. opj\_tcd\_cblk\_dec\_v2 has a OPJ\_BYTE**data array and opj\_tcd\_seg\_t** seg array and opj\_tcd\_seg\_t has internally a OPJ\_BYTE data array.

The method devised in [commit](https://gitorious.org/openjpeg-optimization/aditya12agd5s-openjpeg-optimization/commit/4fe348a64bb8ce7160363ca49ee32ef619692173/diffs/601a4503f018fa8bfea1e68f9d11003309a977d0) was to copy the structs to a device structure and additionally copy the dynamic array part separately to another flat array on the device.
Test Images (Right) Parameters (Down)	sintel_2k.j2k	oldtowncross.j2k	crowdrun.j2k	duckstakeoff.j2k
Memory Setup Time (secs)	0.054404 + 0.001611 + 0.001551 = 0.057566	0.055298 + 0.002517 + 0.001976 = 0.059791	0.054851 + 0.001928 + 0.001972 = 0.058751	0.055063 + 0.001997 + 0.001954 = 0.059014
Computation Time (secs)	0.002360 + 0.002307 + 0.002309 = 0.006976	0.003153 + 0.003737 + 0.003089 = 0.009979	0.003108 + 0.003283 + 0.003057 = 0.009448	0.003099 + 0.004512 + 0.003135 = 0.010746
Output Memory Transfer Time (secs)	0.002965 + 0.002913 + 0.002833 = 0.008711	0.003318 + 0.003231 + 0.003295 = 0.009844	0.003315 + 0.003684 + 0.003278 = 0.010277	0.003285 + 0.003222 + 0.003304 = 0.009811