# Improve openjpeg2000 encoding/decoding time Jpeg2000 provides superior compression and advanced features like optionally lossless compression, region of interest coding, stream decoding etc. As a result of this complex feature set, the encoding and decoding process for jpeg2000 is computationally expensive. It has already been demonstrated that a significant speed up is achieved in many image processing applications by using the massively parallel GPU architecture. In fact there is previous literature that reports speed up on the GPUs for various components of jpeg2000 like DWT, EBCOT etc. As a part of this project we plan to develop a parallel implementation for jpeg2000 encoding/decoding using the CUDA programming platform available for Nvidia GPUs. The decompression is more challenging than the compression and we plan to focus on the lossy decoding. At the end of this project, we hope to have a parallel implementation for tier 1/2 decoding, inverse dwt and inverse dc level shift which comprise the decoding pipeline. # Code Repository The code for this project is pushed into [openjpeg optimization branch](https://gitorious.org/~aditya12agd5/openjpeg-optimization/aditya12agd5s-openjpeg-optimization) # Progress ## Compilation As a part of [commit](https://gitorious.org/~aditya12agd5/openjpeg-optimization/aditya12agd5s-openjpeg-optimization/commit/922a0f3a8f626b19f27953917a19641faba150a8), the appropriate changes were done to the cmake files to enable compilation of CUDA code with openjpeg library. The file gpu.cu, contains all the CUDA kernels and also the kernel wrappers. The kernel wrappers can be evoked from the openjpeg library files and these wrappers then invoke the appropriate kernel. The kernel wrapper functions are prefixed with gpu and kernel functions are prefixed with kernel ## Inverse DC Level Shift In [commit](https://gitorious.org/~aditya12agd5/openjpeg-optimization/aditya12agd5s-openjpeg-optimization/commit/e73b93602d9af9d9bb9cd02bbe5241072fe5a6e5), the function for inverse dc level shift Parameter 2 was implemented on the GPU (gpu[\_ ](.md)dc[\_ ](.md)level[\_ ](.md)shift[\_ ](.md)decode). The data is copied component by component to the GPU. The number of threads is equal to the image size and each thread adds the dc level shift value to the corresponding pixel. Once, the entire pipeline has been implemented,Memory Setup we can remove the memory transfer overhead for this stage. Ideally before the first stage the image data is transferred to the GPU. It then continues to reside and also gets modified as the decoding stages t1/t2, inverse dwt and inverse dc level shift are performed. Finally after all stages, the output image is ready in the GPU memory and it is transferred back to the CPU output array. ## Inverse Discrete Wavelet Transform The [commit](https://gitorious.org/~aditya12agd5/openjpeg-optimization/aditya12agd5s-openjpeg-optimization/commit/84f53565c0a5578ad55dc95652a40c335f864e75) has the version#1 of the complete implementation of the inverse DWT stage. The cuda implementation is as follows : 1. Similar to the CPU code, processing of four Memory Setupvalues is done together using the float4 data type. Though a single cuda core does not have the vector computation capability (like MMX instructions on CPU), there is still benefit in using float4 because GPUs provide higher FLOPS/s and and memory access is faster i.e. loading a float4 is quicker than individually loading the 4 floats. 2. For processing a rh x rw image in say decode[\_ ](.md)h stage, the number of blocks is equal to rh/4 and each block has rw threads. A simpler way to understand is jth thread of ith block process four values : (4\*i,j); (4\*i+1,j); (4\*i+2.j); (4\*i+3;j). If rw is less than a threshold (currently 512), then the entire wavelet array of size rw per thread can be stored in the shared memory. Thus provided that the size of current resolution is less than a threshold we use the kernel with optimization of shared memory. 3. If the size exceeds the threshold, then we can no longer use shared memory and a global memory array is used for the wavelet. The kernels which handle this case of overflow of shared memory have theParameter 2 [\_ ](.md)global[\_ ](.md) in their function names. 4. Note that processing the entire wavelet array of size _rw_ in a single block gives us a chance to use the block synchronization primitive [\_\_ ](.md)synchthreads. Thus we can club the v4dwt[\_ ](.md)interleave[\_ ](.md)(h/v) and v4dwt[\_ ](.md)decode[\_ ](.md)step1 and v4dwt[\_ ](.md)decode[\_ ](.md)step2 in a single kernel. Such kernel fusion as and when possible results in optimal performance of the code. # Performance/Results ## Inverse DWT These are performance results for the [commit84f53565](https://gitorious.org/~aditya12agd5/openjpeg-optimization/aditya12agd5s-openjpeg-optimization/commit/84f53565c0a5578ad55dc95652a40c335f864e75) The timing measurements are performed using clock\_gettime(CLOCK\_MONOTONIC, ...), this is a monotonically increasing timer without drift adjustments and it is a standard for measuring the time execution in case of asynchronous events like cuda memory transfer or kernel calls. The platform is Nvidia Geforce GTX 580 GPU and Intel Core i7 920 (2.67GHz) CPU. The below table contains a split of the timings for various phases of the GPU code.
Test Images (Right) Parameters (Down) |
sintel_2k.j2k | oldtowncross.j2k | crowdrun.j2k | duckstakeoff.j2k |
Memory Setup Time (secs) | 0.054404 + 0.001611 + 0.001551 = 0.057566 | 0.055298 + 0.002517 + 0.001976 = 0.059791 | 0.054851 + 0.001928 + 0.001972 = 0.058751 | 0.055063 + 0.001997 + 0.001954 = 0.059014 |
Computation Time (secs) | 0.002360 + 0.002307 + 0.002309 = 0.006976 | 0.003153 + 0.003737 + 0.003089 = 0.009979 | 0.003108 + 0.003283 + 0.003057 = 0.009448 | 0.003099 + 0.004512 + 0.003135 = 0.010746 |
Output Memory Transfer Time (secs) | 0.002965 + 0.002913 + 0.002833 = 0.008711 | 0.003318 + 0.003231 + 0.003295 = 0.009844 | 0.003315 + 0.003684 + 0.003278 = 0.010277 | 0.003285 + 0.003222 + 0.003304 = 0.009811 |