Recursive non-local means filter for video denoising

In this paper, we propose a computationally efficient algorithm for video denoising that exploits temporal and spatial redundancy. The proposed method is based on non-local means (NLM). NLM methods have been applied successfully in various image denoising applications. In the single-frame NLM method, each output pixel is formed as a weighted sum of the center pixels of neighboring patches, within a given search window. The weights are based on the patch intensity vector distances. The process requires computing vector distances for all of the patches in the search window. Direct extension of this method from 2D to 3D, for video processing, can be computationally demanding. Note that the size of a 3D search window is the size of the 2D search window multiplied by the number of frames being used to form the output. Exploiting a large number of frames in this manner can be prohibitive for real-time video processing. Here, we propose a novel recursive NLM (RNLM) algorithm for video processing. Our RNLM method takes advantage of recursion for computational savings, compared with the direct 3D NLM. However, like the 3D NLM, our method is still able to exploit both spatial and temporal redundancy for improved performance, compared with 2D NLM. In our approach, the first frame is processed with single-frame NLM. Subsequent frames are estimated using a weighted sum of pixels from the current frame and a pixel from the previous frame estimate. Only the single best matching patch from the previous estimate is incorporated into the current estimate. Several experimental results are presented here to demonstrate the efficacy of our proposed method in terms of quantitative and subjective image quality.


Introduction
Digital videos are invariably corrupted by noise during acquisition.Digital video tends to have a lower signal-tonoise ratio (SNR) than static images, due to the short integration times needed to achieve desired frame rates [1].Low-light conditions and small camera apertures tend to worsen the problem.In the case of some medical imagery, like x-ray images and video, short integration times are essential to limit the x-ray dose to the patient.While digital video may suffer from lower SNR, it also provides 3D data that often has significant temporal redundancy [2].Video denoising algorithms seek to reduce noise by exploiting the both spatial and temporal correlation in the signal [1].The non-local means (NLM) algorithm [3] for image denoising has received significant attention in the image processing community.This may be, in large part, because of generally good performance, and its intuitive and conceptually simple nature.The standard NLM algorithm is introduced by Buades et al. in [3].The NLM method exploits self-similarity that appears in most natural images for noise reduction.In the single-frame NLM method, each output pixel is formed as a weighted sum of the center pixels of neighboring patches, within a given search window.The weights are based on similarity with respect to the reference patch.The similarity is measured by means of patch intensity vector distances.Pixels from patches with higher similarity (lower vector distances) are given more weight, using a negative exponential weighting.One or more tuning parameters are used to control the weighting.
Many variations of the NLM method have been proposed to reduce the computational complexity and/or improve the denoising performance.In the work of Mahmoudi and Sapiro [4], dissimilar neighborhoods are excluded from the weighted sum.Dissimilar blocks are identified based on mean value and gradient.This may improve performance, and it reduces the computational cost.Wang et al. [5] proposed an efficient summed square image scheme as another means to accelerate the patch similarity computations.A cluster tree arrangement has been used to group similar patches in [6].A method using preselection of the most similar voxels, multithreading, and blockwise implementation is presented in [7].Furthermore, adaptive smoothing neighborhoods are presented in [8], and a kernel regression method is presented in [9].The method in [10] uses a spatially recursive moving-average filter to compute the Euclidean distances.The estimation of the mean square error (MSE) from a noisy image is performed using an analytical expression based on Stein's unbiased risk [11].Another speed enhancement, based on probabilistic early termination, is proposed in [12].Karnati et al. [13] proposed a multiresolution approach requiring fewer comparisons.
Many methods, originally proposed for single-image denoising, have been adapted to video denoising.Among these are the NLM method, which has been applied successfully to image sequences in [14] and [15].Han et al. [16] introduced the dynamic non-local means (DNLM) video denoising method, which is based on Kalman filter theory.The basic idea of this filter is to use information from the past video frames to restore the current frame, combining the NLM and Kalman filtering algorithms.However, the computational complexity is still relatively high with this method.Another example of a 2D denoising method, later extended to 3D, is block-matching and 3D (BM3D) filtering [17].BM3D generally outperforms NLM in single-image denoising but has a higher computational complexity because of the 3D transforms required.The BM3D method, like NLM, uses vector distances between 2D image blocks.The most similar blocks are stacked into a 3D group and then filtered through a transform-domain shrinkage operation.The BM3D filtering has been extended to video denoising in the video BM3D (VBM3D) algorithm described in [18].While there may be many variations of patch-based image denoising algorithms.What they share in common is that they require computing vector distances between each reference patch and neighboring patches.Direct extension of such methods from 2D to 3D, for video processing, can be computationally demanding.Note that the size of a 3D search window is the size of the 2D search window multiplied by the number of frames being used to form the output.Exploiting a large number of frames in this manner can be prohibitive for real-time video processing.
In this paper, we propose a novel temporally recursive NLM (RNLM) algorithm for video processing.Our RNLM method takes advantage of temporal recursion for computational savings, compared with the direct 3D NLM.However, like the 3D NLM, our method is still able to exploit both spatial and temporal redundancy for improved performance, compared with 2D NLM.In our approach, the first frame is processed with single-frame NLM.Subsequent frames are estimated using a weighted sum of pixels from the current frame and a pixel from the previous frame estimate.Only the best-matching patch from the previous estimate is incorporated into the current estimate.This is done to maximize the temporal correlation.Our approach shares its recursive nature with DNLM in [16].However, here, we have opted for a much simpler framework, in keeping with the simplicity of the original NLM.Several experimental results are presented here to demonstrate the efficacy of our proposed method in terms of quantitative and subjective image quality.We show that our approach offers a computationally simple approach to video denoising with a performance that rivals much more complex methods.
The remainder of the paper is organized as follows.Section 2 introduces the observation model and some benchmark methods.The proposed RNLM algorithm is presented in Section 3. Experimental results are presented in Section 4. Finally, we offer conclusions in Section 5.

Observation Model
In this section, we present the observation model and notation for our video restoration methods.Some of the key variables used in this paper are defined in Table 1.We use the standard degradation model, treating the noise as additive and signal independent.This is expressed as for i = 1, 2, . . ., N and k = 1, 2, . . ., K. Note that y k (i) represents a pixel in the observed frame in the video sequence.The index i refers to the specific pixel in the spatial domain, and the k denotes the temporal frame number in the image sequence.The variable x k (i) denotes the corresponding pixel in the ideal input frame.The noise is represented by n k (i) ∼ N (0, σ 2 n ), which is assumed to be samples of a zero-mean independent and identically distributed Gaussian random variable, with variance σ 2 n .

Single-frame non-local means filter
We begin by defining the single-frame NLM (SNLM) [3], upon which our method is built.Processing the frames from an image sequence individually, the SNLM output can be expressed as where xk (i) denotes the estimated image at pixel i in frame k.The set ε(i) contains the indices of the pixels within an M s × M s search window centered about pixel i.The variable w k (i, j) is the weight applied to pixel j, when estimating pixel i in frame k.To normalize the weights, the variable W k,i is used, and this is simply the sum of the individual weights.
The SNLM weights are computed based on patch similarity and spatial proximity.In particular, w k (i, j) is computed as and W k,i can be expressed as Note that the variable y k (i) is a vector in lexicographical form containing pixels from an M p × M p patch centered about pixel i in frame k from the sequence {y k (•)}.The variable d(i, j) 2 is the squared Euclidean distance between pixel i and j.The parameter σ 2 y is a tuning parameter to control the decay of the exponential weight function with regard to patch similarity, and σ 2 d is a tuning parameter controlling the decay with regard to spatial proximity between pixels i and j.It can be seen from Eq. ( 3) that the weight given to pixel y k (j) goes down as y k (i) − y k (j) 2 goes up.The weight also goes down with the spatial distance between pixel i and j.Note that the tuning parameter, σ 2 y , is often set close to the noise variance.Studies of filter parameter selection can be found in [3,[19][20][21].

3D non-local means filter
In this section, we introduce a direct extension of 2D SNLM to 3D, to use as an additional performance benchmark.The 3D NLM uses a spatio-temporal search window to provide improved video denoising.In our approach, the patches remain 2D, but the search window is extended to 3D.This version of the 3D NLM is given by where w k,m (i, j) is the weight for pixel j of frame m, when estimating pixel i of frame k.The temporal search window is defined by ψ(k), which is the set of frame indices used in the 3D search window for the estimation of frame k.In our experimental results, we use a causal temporal window comprised of the most recent L frames, and this is represented as The 3D NLM weights are computed as where a new temporal proximity tuning parameter, σ 2 t is introduced.This extra tuning parameter is a natural extension to the spatial proximity parameter σ 2 d .Samples from frame numbers that are far from the estimation sample will receive less weight using an exponential weighting controlled by σ 2 t .A small σ 2 t gives a large penalty for frame difference.The weights are normalized using Other similar extensions of the SNLM to 3D can be found in [22,23].

The proposed RNLM video denoising algorithm definition
The goal of the proposed RNLM method is to effectively exploit spatio-temporal information, as is done with the 3D NLM in Eq. ( 5) but with a computational complexity more in line with the SNLM in Eq. (2).To do so, RNLM estimate is formed as a weighted sum of pixels from the current frame, like Eq. ( 2), but it also includes a previous frame pixel estimate.That is, the current input frame and the prior output frame are used to form the current output.This type of temporal recursive processing helps to exploit the temporal signal correlation, without significantly increasing the search window size or overall computational complexity.Specifically, the estimate for the proposed RNLM is given by where xk−1 (s k (i)) is the previous frame estimate (i.e., frame k − 1) at pixel s k (i) ∈ {1, 2, . . ., N}. Pixel s k (i) is selected from {x k−1 (•)} based on a standard blockmatching algorithm (BMA) with respect to the block in input frame k centered about pixel i.For the selection of s k (i), we allow for a potentially different block and search size from that used for the within-frame processing.In particular, the block-matching block size is N b × N b , with an N s × N s search window.As in Eq. ( 2) for the SNLM, the set ε(i) in Eq. ( 8) contains the indices of the pixels within an M s × M s search window centered about pixel i.The recursive weight in Eq. ( 8) is w x,k (i), and the non-recursive weights, w y,k (i, j), are similar to that for the SNLM.We shall define and discuss all of the weights shortly.The relationship among the various pixels used in the RNLM estimation process is depicted in Fig. 1.Shown are the raw unprocessed frames in parallel with the processed frames.
Output xk (i) is formed using a weighted sum of the input frames pixels, shown on the left, and the best matching previous processed output, shown on the back right.
Let us now define the weights.The non-recursive weights are defined in a manner similar to SNLM.Specifically, these are given by where h y b and h y n are tuning parameters.Here, we do not use the spatial distance weighting term of the SNLM.This could easily be added, but it did not provide improved performance in out experimental results.The recursive weights are given by where h xb and h xn are tuning parameters, and is the residual noise variance associated with xk−1 (s k (i)).
The vector xk−1 (s k (i)) is the M p ×M p patch of pixels about pixel xk−1 (s k (i)) (shown in Fig. 1) in lexicographical vector form.The weight normalization factor here is given by Finally, assuming the noise is independent and identically distributed, the residual noise variance can be computed recursively as follows Note that Eq. ( 12) is not the error variance.Rather it only accounts for the variance of the noise component of the error associated with the estimate xk−1 (s k (i)).
The RNLM weights in Eqs. ( 9) and ( 10) have a total of four tuning parameters, two to govern the non-recursive weights, and two to govern the recursive weights.In the non-recursive weights in Eq. ( 9), the parameter h y b serves the same role as σ 2 y in the SNLM in Eq. ( 3).We refer to this weight as the non-recursive bias error weight.We view ||y k (i) − y k (j)|| 2 as a measure of the bias error in y k (j) with respect to the true sample x k (i) that we are estimating.That is, underlying signal differences between the Fig. 1 Block diagram showing the pixels in the video sequence used in proposed RNLM algorithm.The left side shows the raw input frames, and the right side shows the processed frames.Because of the recursive processing, samples from the both input and processed frames are used to form the current estimate pixels i and j are being quantified by this term.The tuning parameter h y b controls the exponential decay of the weights as a function of this bias error.The noise error associated with y k (j) is given by the constant noise variance σ 2 n .The noise variance in Eq. ( 9) is scaled by the h y n , to control the weight decay as a function of the noise variance.For the recursive weight, we have similar tuning parameters.The term ||y k (i)− xk−1 (s k (i))|| 2 quantifies the bias error associated with xk−1 (s k (i)), and the residual noise variance associated with this estimate is , and may be computed using Eq. ( 12).We give each of these error quantities a tuning parameter, to balance their impact on the resulting filter weights.The bias error for the recursive sample is scaled by h xb , and the residual noise term is scaled by h xn .
We acknowledge that the proposed RNLM does not have the optimal framework of the Kalman filter [16].However, it can exploit temporal signal correlation effectively, as we shall show in Section 4. By choosing the tuning parameters to balance the bias and noise components of the recursive and non-recursive terms, very good performance can be achieved with a low computational complexity.We believe that the RNLM may provide useful solution for many video denoising applications, and we believe it is in keeping with the spirit and simplicity of the original NLM method.

Computational complexity
In this section, we compare the computational complexity of SNLM, 3D NLM, and RNLM by counting the number of multiplications and additions required to compute one processed output pixel.Beginning with the SNLM, the number of floating point multiplies and adds to compute Eq. ( 2 s adds and multiplies is needed to compute and apply the weight normalization in Eq. ( 4).
The 3D NLM algorithm requires LM 2 s floating point multiplies and adds to compute Eq. ( 5).The weights from Eq. ( 6) require LM 2 s vector distances using M 2 p × 1 vectors.This requires approximately 2LM 2 s M 2 p additions/subtractions and LM 2 s M 2 p multiplications.Another LM 2  s adds and multiplies is needed to compute and apply the weight normalization in Eq. (7).Thus, the complexity of the 3D NLM algorithm increases linearly with the number of frames, L.
The process of the RNLM filter is accomplished in several steps.The output in Eq. ( 8) requires M 2 s + 1 floating point multiplies and adds.The computation of the weights in Eqs. ( 9) and ( 10 (11).Note that fast parallel architectures are available for rapid BMA computation [24][25][26].
Note that if we let N s = M s and N b = M p , then RNLM has a computational complexity comparable to 3D NLM for L = 2 frames.Also, if we simply let s k (i) = i (i.e., no BMA option), the RNLM has a computational complexity that is approximately the same a SNLM.However, we have found that the BMA matching significantly improves performance in video sequences with significant amounts of motion.Furthermore, we have found that it is generally advantageous to choose N b > M p .This helps to provide a better match for the the important sample xk−1 (s k (i)) .The size of the search window N s maybe selected based on the temporal motion expected in the video sequence.
The RNLM method has a relatively low computational complexity compared with many transform-domain patch-based algorithms such as BM3D and VBM3D.Details of the BM3D implementation are presented in [27].These transform-domain methods form a 3D cube from selected patches and perform a transform-domain shrinkage operation based on the fast Fourier transform (FFT), discrete cosine transform (DCT), or Wavelet transform.The patch selection operations are common to all of these methods, including the RNLM.However, the transform-based denoising methods have the additional requirement of computing forward and backward 3D transform on blocks of size M = M p × M p × P for each pixel, where P is the number of patches selected [27].Using row-column decomposition, such transforms are known to have a complexity of order M log(M) [27].
There is no such corresponding processing requirement for the RNLM.

Results and discussion
In order to illustrate the efficacy of the proposed RNLM algorithm, we present a number of experimental results.We compare our method to several state-of-the-art methods including SNLM [3], 3D NLM, BM3D [17], VBM3D [18], and DNLM [16].Our results make use of standard and publicly available video sequences allowing for reproducible and comparative results.These sequences present variations in scene content, lighting conditions, and scene motion.We artifically degrade the imagery with Gaussian noise and compare the restored images to the originals.The metrics we shall use are the peak signal-to-noise ratio (PSNR) and the structural similarity index (SSIM).The PSNR metric in units of decibels (dB) is defined as where R is the maximum limit of the dynamic range of the image.For our 8-bit images, R = 255.The variable, MSE(k), is the mean squared error for frame k, given by where x k (i) is the true pixel value, and xk (i) is the estimated pixel.The SSIM provides an aditional metric that Fig. 2 PSNR (dB) versus frame number using several methods for the Flower Garden sequence corrupted with a noise level of σ = 40.The RNLM (BMA) method gives the best performance for this sequence some argue is more consistent with subjective perception than PSNR [28][29][30].
In Table 2, we provide PSNR results for five different image sequences, each with eight different noise standard deviations.The proposed RNLM method results are reported for two variations.The results labeled RNLM (BMA) refers to the proposed method where block matching is employed to find the best match for the recursive sample xk−1 (s k (i)).The (no BMA) version simply uses the estimate pixel position, such that s k (i) = i.This version has a reduced computational complexity.However, as shown in Table 2, using BMA gives improved results.
The results in Table 2 show that the RNLM method consistently outperforms the SNLM.Thus, the recursive processing is providing a clear performance benefit.RNLM also outperforms single-frame BM3D.VBM3D does provide higher PSNR values in most cases, but the computational complexity of that method is far greater, and the processing of the video is non-causal.On the other hand, RNLM has a low computational complexity, and the processing is fully causal, allowing for real-time processing.
Figures 2 and 3 show the restored PSNR for individual frames 1 to 150 for the sequences Flower Garden and Salesman, respectively.The noise standard deviation for these results is σ n = 40.The methods shown are VBM3D, BM3D, SNLM, RNLM (No BMA), and RNLM (BMA).The proposed RNLM (BMA) provides the best results in the Flower Garden video sequence, and provides the second best performance on the Salesman sequence.
In Fig. 4, we offer a visual comparison of denoised frame 92 from the Salesman sequence with Gaussian noise at σ n = 40.Figure 4a shows the original frame.The noisy frame is shown in Fig. 4b. Figure 4c is the denoised image  The RNLM method appears to preserve the background details better than the other methods using SNLM.This method struggles to effectively handle the high levels of noise.However, with decreased σ n , it tends to exhibit much better subjective visual performance.Figure 4d is the BM3D denoised frame.While it successfully reduces the noise, here, it produces an oversmooth result and detail such as the mouth, eyes, and texture of the original frame is lost.Figure 4e shows the VBM3D denoised frame.This result does a good job with noise reduction and some detail preservation.However, it also tends to over-smooth texture in this image, such as the plant leaves.Finally, Fig. 4f shows the output of the proposed RNLM (BMA) algorithm.We believe that it produces a visually pleasing result here, with a good balance of noise reduction and detail/texture preservation.

Conclusions
In this paper, we have presented a new temporally recursive NLM algorithm for video denoising.The output of the new RNLM method is a weighted sum of pixels from the current noisy frame, and of a selected pixel from the prior processed frame.This type of recursive processing allows us to exploit temporal correlation with little additional computational cost, compared with a SNLM.We have explored two versions of the RNLM method here, RNLM (BMA) that uses BMA for motion compensation and RNLM (No BMA) that simply uses the previous pixel at the same location to be included in the weighted sum.The results also show that RNLM (BMA) consistently outperformed the RNLM (No BMA).The computational complexity of the RNLM (BMA) method is approximately the same as 3D-NLM with just two frames.Both versions of RNLM method consistently outperform SNLM.Furthermore, our results show that RNLM (BMA) is competitive with much more computationally complex algorithms, such as BM3D and VBM3D.We believe the proposed method offers a simple and practical video denoising solution, capable of balancing noise reduction and detail preservation.Because of its low computational cost, we believe it is well suited for real-time implementation.

Fig. 3
Fig.3PSNR (dB) versus frame number using several methods for the Salesman sequence corrupted with a noise level of σ = 40.With no background motion, RNLM with and without BMA perform about the same.VBM3D gives the best performance for this sequence

Fig. 4
Fig. 4 Comparison of denoised frame 92 from the Salesman sequence with σ n = 40.a Original frame.b Corrupted frame.c SNLM. d BM3D.e VBM3D.f RNLM (BMA).The RNLM method appears to preserve the background details better than the other methods

Table 1
Variable definitions used to describe the proposed video restoration method

Table 2
PSNR comparison with competitive denoising algorithms using five different sequences and eight noise standard deviation levels

Table 3
SSIM comparison with published results for several competitive denoising algorithms using five different sequences and three noise standard deviation levels

Table 4
Additional PSNR performance comparisons with various published results.All PSNR values are reported in decibel