 Research
 Open Access
Twoway partitioning of a recursive Gaussian filter in CUDA
 Chang Won Lee^{1},
 Jaepil Ko^{1} and
 TaeYoung Choe^{1}Email author
https://doi.org/10.1186/16875281201433
© Lee et al.; licensee Springer. 2014
 Received: 15 November 2013
 Accepted: 10 June 2014
 Published: 30 June 2014
Abstract
Recursive Gaussian filters are more efficient than basic Gaussian filters when its filter window size is large. Since the computation of a point should start after the computation of its neighborhood points, recursive Gaussian filters are line oriented. Thus, the degree of parallelism is restricted by the length of the data image. In order to increase the parallelism of recursive Gaussian filters, we propose a twoway partitioned recursive Gaussian filter. The proposed filter partitions a line into two lines and a point, which is used for Gaussian blur effect across the two lines. This partition increases the parallelism because the filter is applied to the two blocks in parallel. Experimental results show that the process time of the proposed filter is half compared to the time of an oneway parallel recursive Gaussian filter while the peak signaltonoise ratio is maintained within an acceptable rate of 26 to 33 dB.
Keywords
 Graphic Processing Unit
 Gaussian Filter
 Global Memory
 Finite Impulse Response Filter
 Boundary Pixel
Introduction
A Gaussian blur filter, or Gaussian filter, is one of the fundamental and widely used image processing techniques. A typical use of the filter is denoising. It is also used as a preprocessing step for down/up sampling, edge detection[1, 2], or scale space representation[3]. Contrast enhancement techniques such as Retinex[4–6] or unsharp filters[7] are the other uses for Gaussian blur where it approximates the illumination component of an image at a large scale.
According to the definition of Gaussian filter, filtered value of a pixel in a twodimensional image is computed using nearby pixel values. The range of the pixels to be used is determined by filter window size N × N. It is known that the filtered value of a pixel can be computed by pixels in a horizontal line and a vertical line that pass through the pixel. Let us call the Gaussian filter that computes filtered value of a pixel using the crossing two lines as finite impulse response (FIR) filter. FIR filter has 4N × width × height computation steps if the size of given image is width × height.
Although FIR filter implements Gaussian blur filter in the exact discrete way, the processing time of the filter depends on the filter window size N × N. Recursive Gaussian filters that implement Gaussian filter are developed in order to eliminate effect of the filter window size. A recursive Gaussian filter computes filtered value of a pixel using differential values of its neighborhood pixels[8]. Since the differential values of neighborhood pixels contain all approximated data within the range of the filter window size, the filter window size is not involved in the number of computation steps. Computation step of a recursive Gaussian filter proposed by van Vliet et al. is about 32 × width × height. Thus, recursive Gaussian filters is faster than the FIR filter if a filter window size is greater than 8. Unfortunately, recursive Gaussian filters make dependence between pixels and restrict the degree of parallelism. Pixel p[i][j] must wait until the filtered value of pixel p[i1][j] is computed in the roworiented step.
As graphic processing unit (GPU) cores can be used for general purpose computation, many image processing algorithms have been implemented in general purpose GPU (GPGPU). NVIDIA supports Compute Unified Device Architecture (CUDA) as a GPGPU architecture and a development environment. Since recursive Gaussian filters make dependencies between pixels, it is conventional to allocate a line into a thread in a core processes. On the other side, the bitmap Gaussian filter can allocate 1 pixel per 1 thread, which fully utilizes available cores. Thus, recursive Gaussian filters shows better performance in the restricted area where the number of cores is small and the filter window size is large.
We propose a refined recursive Gaussian filter for GPGPU that partitions working domain into two ways. The proposed filter combines a recursive Gaussian filter and FIR filter in order to minimize error rate that occurs by splitting the working domain. The remainder of this paper is structured as follows: ‘Problem environment and related work’ section explains the problem environment and reviews related work. ‘Proposed filter’ section gives details of the proposed refined recursive Gaussian filter. ‘Experimental results’ section gives the experimental results of the proposed filter. Finally, ‘Conclusions’ section concludes with future works.
Problem environment and related work
Proposed filter
Twoway partitioning
In order to increase the parallelism of a recursive Gaussian filter, we propose a twoway recursive Gaussian filter. The filter partitions an image in linebased orientation and partitions each line into three blocks again. The first and third blocks use the recursive Gaussian filter, while the second block uses a general Gaussian filter. One major problem is that all pixels in a line are related by a precedence dependency, as shown in Equations 3 and 4. However, we note two facts that circumvent the dependencies in a line:

When out [x,y] is computed in backward pass, out [x + 1,y] through out [x + 3,y] are required. Pixels out [x + 1,y] through out [x + 3,y] are the results of a roworiented step. If the pixels have already been computed using the basic Gaussian filter, out [x,y] has no precedence dependency on out [x + k,y] where k > 3.

There is no priority between forward and backward passes. Thus, any pass can start in any order.
These facts motivate us to partition a line into the following three parts:

B_{l}: the left half of the line from index 0 to width/2  1, where width is the number of horizontal pixels in the picture.

P_{c}: the pixel located at width/2.

B_{r}: the right half of the line from index width/2+1 to width 1.
Since forward and backward passes are used to reflect the leftside and rightside neighborhood pixels, respectively, the execution order of the forward and backward passes does not affect the filtering results. After finishing the first pass, that is, computation of w[ ], each block starts a reverse directed pass. It is important to notice the filtering boundary pixels. Boundary pixels like w[0,y],w[1,y], and w[2,y] need values w[3,y], w[2,y], and w[1,y]. Since these values are not available, w[0,y] is used instead. When the direction of computation is changed, pixel out [width/2  1,y] needs value out [width/2,y], out [width/2 + 1,y], and out [width/2 + 2,y]. Although these indexes have been computed in B_{r}, they are not the same values as in Equation 4 because those values are generated by the second pass while the values in B_{r} are generated by the first pass.
Algorithm 1 shows the detailed process of roworiented step of the proposed twoway filter. Three threads are assigned to each row calculated in line 1. Thus a block takes charge of adjacent lines in an image. For example, if four lines are allocated to each block (blockDim.x = 4), the third block filters line 12 ∼ 15. The three threads are identified by preassigned value threadIdx.y. If a thread has index (blockIdx.x = 2, threadIdx.x = 2, threadIdx.y = 1), the thread tasks charge of central pixel P_{c} of line 10. Neighborhood values for boundary pixels are initialized at line 3, 9, 14, and 21. Intermediate array w[row][ ] is divided to two sub arrays w_{ l } and w_{ r } in order to prevent race condition in w[row][width/2  2] ∼ w[row][width/2 + 2], which are overlapped and are accessed concurrently by two threads.
The time complexity of the proposed filter is determined by two factors: line width and filter window size. Assuming that there are enough processors, three processors are allocated to each line. If l = max(width, height), then 3l processors are required in order to maximize the performance. A processor for block B_{l} or B_{r} computes half of the l/2 pixels in a pass. Since each pixel requires eight multiplications and four additions, 6l operations are required for each pixel. The center pixel P_{c} requires 3N operations, where N × N is the filter window size. Thus, max(6l,3N) + 6l operations are required per step and max(12l,6N) + 12l steps are required during the proposed filter. In short, the process time of the filter is halved or speedup is doubled compared to those of a oneway recursive Gaussian filter if the number of cores is equal to or greater than 3l.
Experimental results
Experimental environment
Description  

CPU  Intel core i5 750 (2.67 GHz) 
GPU  Geforce GTX 670 (GK104) 
Compute capability, 3.0  
Number of CUDA cores 1,344  
Graphic clock, 915 MHz  
Processor clock, 980 MHz  
Compiler  Visual Studio 2010 
Library  CUDA 5.0 
Image  512 × 512 bitmap 
Gaussian filters usually have been used to denoise in differential operations for image processing or to achieve image blurring effects. For the differential operations, the typical filter window size is 3 × 3. To obtain a proper blurring effect, we should carefully determine the filter window size according to image sizes. One of the thumb rules suggests 1% ∼5% of the image size for the Gaussian halfwidth σ and two times of σ for the filter window size[25]. If we take 3% of σ for a 512 × 512 resolution image, the filter window size becomes about 30. Bilateral filters require 10% of σ, then the mask size becomes 102 for the same image resolution. The larger image size requires much larger mask size. In this paper, we take 3 × 3 and 30 × 30 mask size for Lena image.
Validity of the proposed filter
PSNR comparison
Image size  Filter window  Oneway vs.  FIR vs.  FIR vs. 

size  twoway  oneway  twoway  
SD  3  29.4  26.2  29.7 
30  31.1  32.8  31.4  
90  29.5  33.4  29.4  
HD  3  33.0  30.9  33.6 
30  30.6  39.7  30.8  
90  24.7  33.0  26.3  
Full HD  3  30.5  26.6  29.2 
30  32.9  36.7  33.2  
90  30.1  34.8  31.4 
Performance comparison
Measure using cudaEventRecord()
Since the parallel CUDA code runs in a GPU, general clock measure functions like gettimeofday() or clock() cannot measure its process time correctly. CUDA provides the cudaEvent family of functions in order to measure process time in a GPU.
Measure using Nsight
Nsight is a development tool provided by NVIDIA and it tells occupancy of parallel applications in GPU[27]. Thus, Nsight can be used to investigate GPU utilization. Occupancy is the percentage of active warps against the maximum active warps, where a warp means a group of 32 threads. The amount of active warps is decided by the number of allocated threads per block, amount of available registers, and amount of shared memory. Allocating more threads to block increases occupancy.Figure10b shows an occupancy comparison between oneway recursive Gaussian filter and twoway recursive Gaussian filter. Since the total number of threads is fixed to data image width or height, that is, 512 in the case of Lena image, many CUDA cores are idle and it makes the CUDA occupancy at the low value. On the other side, twoway recursive Gaussian filter increases its occupancy as the number of lines per block increases over 16 as shown in Figure10b.
Improvement using local memory
Figure9 shows that performances of recursive Gaussian filters degrade as the number of lines per block exceeds 16. The imbalanced block allocation to SMs and sequential global memory access are assumed to be the main reasons of the degradation. Since the balanced allocation policy changes by different graphic card models, finding efficient memory access is focused in order to improve the proposed filter. CUDA provides following fourlevel hierarchical memory model: constant memory, perthread local memory, perblock shared memory, and global memory[28]. Constant memory is used to store constant values. Thus, constants in Equations 3 and 4 are stored in constant memory. Image data is stored in the global memory in order to be accessed by all threads. During a forward pass, each pixel in array in [ ] is read once or twice according to filter window size. Thus it is not highly required to copy it in shared or local memory. Since array w[ ] is generated and frequently accessed by each thread, it is good target for locating in local memory. Pixel P_{ c } is read six times by adjacent two threads. Since P_{ c } is stored in cache of the thread after being read, its memory location does not highly affect the performance of the filtering. As the result, major variables are located in each memory as follows:
Kernel process time components
Lines per  Process time  Data request  Execution and  

block  (ms)  percentage  sync (ms)  
Local  Global  Local  Global  Local  Global  
4  15.2  15.1  1.2%  12.3%  15.0  13.5 
8  8.4  8.9  2.9%  25.2%  8.2  6.8 
16  5.1  5.9  6.1%  41.9%  4.8  3.5 
32  4.9  8.4  10.2%  48.9%  4.4  4.3 
64  5.1  8.5  10.2%  48.9%  4.4  4.3 
Conclusions
In this paper, we have noticed that lineoriented recursive Gaussian filter can be partitioned more and proposed a twoway recursive filter that increases CUDA GPGPU utilization. The proposed filter divides each line into two sublines and a central point. The central point is used to compensate mismatches occurred by dividing a line into two parts. PSNR shows that the quality of the filter locates between nonrecursive Gaussian filter and the oneway lineoriented recursive Gaussian filter. The process time of the proposed filter is reduced to half by setting the number of lines per block to the same or greater than 16.
The research can be expanded by considering following various concerns. Starting from the proposed twoway recursive Gaussian filter, a line can be partitioned to three or more blocks, where quality and speedup are major concerns. Another consideration is the central points. The central points or boundary points between blocks in the partitioned line can be designed differently. The central points are required as boundary points in each block. If recursive equations are designed carefully, it could be possible to partition a line without any central point or boundary point.
Declarations
Authors’ Affiliations
References
 Canny J: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell 1986, 8(6):679698.View ArticleGoogle Scholar
 Luo Y, Duraiswami R: Canny edge detection on NVIDIA CUDA. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2008. CVPRW’08. IEEE, Anchorage, AK, USA; 2008:18.Google Scholar
 Lowe DG: Distinctive image features from scaleinvariant keypoints. Int. J. Comput. Vis 2004, 60(2):91110.View ArticleGoogle Scholar
 Land EH: The Retinex theory of color vision. Sci Am 1977, 237(6):108128. 10.1038/scientificamerican1277108View ArticleGoogle Scholar
 Jobson DJ, Rahman Z, Woodell GA: Properties and performance of a center/surround retinex. IEEE Trans. Image Process 1997, 6(3):451462. 10.1109/83.557356View ArticleGoogle Scholar
 Jobson DJ, Rahman ZU, Woodell GA: A multiscale retinex for bridging the gap between colour images and the human observation of scenes. IEEE Trans. Image Process 1997, 6(7):965976. 10.1109/83.597272View ArticleGoogle Scholar
 Haralick R, Shapiro L: Computer and Robot Vision. AddisonWesley, Boston, USA; 1992.Google Scholar
 Young IT, van Vliet LJ: Recursive implementation of the Gaussian filter. Signal Process. 1995, 44: 139151. 10.1016/01651684(95)00020EView ArticleGoogle Scholar
 Shapiro LG, Stockman GC: Image smoothing. In Computer Vision. Prentice Hall, Upper Saddle River, NJ, USA; 2001:137137.Google Scholar
 Hale D: Recursive gaussian filters. CWP546 2006. http://www.cwp.mines.edu/Meetings/Project06/cwp546.pdfGoogle Scholar
 Podlozhnyuk V: Image convolution with CUDA. NVIDIA Corporation White Paper, June 2007., 2097(3): https://cluster.earlham.edu/trac/bccdng/export/2037/branches/cuda/trees/software/bccd/software/cuda0.2.1221/sdk/projects/convolutionSeparable/doc/convolutionSeparable.pdfGoogle Scholar
 Vliet LJV, Young IT, Verbeek PW: Recursive Gaussian derivative filters. In Proceedings of the Fourteenth International Conference on Pattern Recognition, 1998. IEEE, Brisbane, Queensland, Australia; 1998:509514.Google Scholar
 NVIDIA Corporation: White Paper NVIDIA GeForce GTX 680. 2012.http://www.geforce.com/Active/en_US/en_US/pdf/GeForceGTX680WhitepaperFINAL.pdfGoogle Scholar
 Jaaskelainen P, de La Lama CS, Huerta P, Takala JH: OpenCLbased design methodology for applicationspecific processors. In 2010 International Conference on Embedded Computer Systems. IEEE, Samos, Greek; 2010:223230s.Google Scholar
 CUDA Parallel computing platform , Jan. 2014 http://www.nvidia.com/object/cuda_home_new.html
 Foster I, Zhao Y, Raicu I, Lu S: Cloud computing and grid computing 360degree compared. In Grid Computing Environments Workshop, 2008. GCE’08. IEEE; 2008:110.View ArticleGoogle Scholar
 Su Y, Xu Z, Jiang X: GPGPUbased Gaussian filtering for surface metrological data processing. In IV. IEEE Computer Society; 2008:9499. doi:10.1109/IV.2008.14Google Scholar
 Trabelsi A, Savaria Y: A 2D Gaussian smoothing kernel mapped to heterogeneous platforms. In 2013 IEEE 11th International New Circuits and Systems Conference (NEWCAS). IEEE, Paris, France; 2013:14.View ArticleGoogle Scholar
 Ryu J, Nishimura TH: Fast image blurring using Lookup Table for real time feature extraction. In IEEE International Symposium on Industrial Electronics, 2009. ISIE 2009. IEEE, Seoul, Korea; 2009:18641869.View ArticleGoogle Scholar
 Deriche R: Recursively implementating the Gaussian and its derivatives. 1993.http://hal.archivesouvertes.fr/docs/00/07/47/78/PDF/RR1893.pdfGoogle Scholar
 Ma Y, Xie K, Peng M: A parallel Gaussian filtering algorithm based on color difference. In IPTC. IEEE; 2011:5154. http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6099690Google Scholar
 Nehab D, Maximo A, Lima RS, Hoppe H: GPUefficient recursive filtering and summedarea tables. ACM Trans. Graph. 2011, 30(6):176:1176:11.View ArticleGoogle Scholar
 Adams A, Gelfand N, Dolson J, MsLevoy: Gaussian KDtrees for fast highdimensional filtering. In ACM Transactions on Graphics (TOG), Volume 28. ACM; 2009:2121.Google Scholar
 Paris S, Durand F: A fast approximation of the bilateral filter using a signal processing approach. In Computer Vision–ECCV 2006. Springer, Graz, Austria; 2006:568580.View ArticleGoogle Scholar
 Park SK, Kim BS, Chung EY, Lee KH: A new illumination estimation method based on local gradient for retinex. In IEEE International Symposium on Industrial Electronics, 2009. ISIE 2009. IEEE, Seoul, Korea; 2009:569574.View ArticleGoogle Scholar
 Luitjens J, Rennich S: CUDA warps and occupancy. GPU Computing Webinar 2011.Google Scholar
 NVIDIA Nsight Visual Studio Edition June 2014 https://developer.nvidia.com/nvidiansightvisualstudioedition
 NVIDIA: Programming model. In CUDA C Programming Guide, Design Guide. NVIDIA; 2013:1213. June 2014 http://docs.nvidia.com/cuda/cudacprogrammingguide/index.html#axzz34o3nNnUFGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.