Here we present a CUDA-accelerated implementation of the Block Matching and 3D Filtering (BM3D) image denoising method. This project offers a high-performance solution for image denoising, harnessing the computational power of NVIDIA GPUs.
Author: Salma Shaik <[email protected]>
In this project, the state-of-the-art image denoising algorithm, Block Matching and 3D Filtering (BM3D), has been successfully implemented in CUDA on NVIDIA GPUs. The implementation underwent rigorous testing and comparison with other open-source alternatives, including OpenCV. It showcases a remarkable 20% speedup compared to the latter and demonstrates real-time video denoising capabilities.
The BM3D algorithm is a pioneering approach to image denoising that relies on collaborative filtering in the transform domain. Since its introduction in 2007, BM3D has remained the state-of-the-art method. The algorithm encompasses two steps, each comprising three main stages:
- Block Matching: Grouping similar patches in the 2D image into a 3D data array, forming what is referred to as a "group."
- Collaborative Filtering: Applying a 3D transform to the group, producing a sparse representation in the transform domain that is then filtered. An inverse transformation is used to convert the filtered data back into the image domain.
- Image Reconstruction: Redistributing the patches within each group to their original positions, with each pixel potentially undergoing multiple updates.
The algorithm repeats the three-step procedure twice. In the initial run, the noisy image undergoes processing with hard thresholding in the sparse transform, resulting in an estimated image. In the subsequent run, the same procedure is repeated with a Wiener filter replacing hard thresholding to obtain Wiener coefficients. These coefficients are then applied to the original images. The second run assumes that the energy spectrum of the first output is accurate, rendering it more efficient than hard thresholding.
The noisy image is partitioned into a set of overlapping reference patches using a sliding-window approach. Each patch has a size of 8x8 with a default stride of 3. A local window of 64x64 around the reference patch is used to search for patches that closely match the reference patch.
For example, in a 512x512 input image, a total of 28,561 reference patches are generated. Each CUDA thread is assigned a reference patch, and within each thread, a local window of 64x64 is used to find the closest matching patches. The distance metric employed for matching is the L2-distance in pixel space, simplifying computation and implementation.
The result is a stack of patches containing the closest matching patches for each reference patch.
BM3D is a recent denoising method based on the fact that an image has a locally sparse representation in the transform domain. This sparsity is enhanced by grouping similar 2D image patches into 3D groups. In this paper, we propose an open-source implementation of the method. We discuss the choice of all parameter methods and confirm their actual optimality. The description of the method is rewritten with a new notation. We hope this new notation is more transparent than in the original paper. A final index gives nonetheless the correspondence between the new notation and the original notation.
- The BM3D Algorithm paper used
- Video denoising by sparse 3D transform-domain collaborative filtering
- An Analysis and Implementation of the BM3D Image Denoising Method
- Adaptive BM3D Algorithm for Image Denoising Using Coefficient of Variation
For comprehensive instructions on the project and how to use BM3D-GPU, please refer to the Documentation.
To view performance benchmarks and comparisons with other implementations, please visit the Benchmarks section.
This project is licensed under the MIT License - see the LICENSE file for details.
Note: If you wish to contribute or report issues, please refer to our Contribution Guidelines.