Sparse Signal Subspace Decomposition Based on Adaptive Over-complete Dictionary

This paper proposes a subspace decomposition method based on an over-complete dictionary in sparse representation, called"Sparse Signal Subspace Decomposition"(or 3SD) method. This method makes use of a novel criterion based on the occurrence frequency of atoms of the dictionary over the data set. This criterion, well adapted to subspace-decomposition over a dependent basis set, adequately re ects the intrinsic characteristic of regularity of the signal. The 3SD method combines variance, sparsity and component frequency criteria into an unified framework. It takes benefits from using an over-complete dictionary which preserves details and from subspace decomposition which rejects strong noise. The 3SD method is very simple with a linear retrieval operation. It does not require any prior knowledge on distributions or parameters. When applied to image denoising, it demonstrates high performances both at preserving fine details and suppressing strong noise.


INTRODUCTION
Signal subspace methods (SSM) are efficient techniques to reduce dimensionality of data and to filter out noise [1]. The fundamental idea under SSM is to project the data on a basis made of two subspaces, one mostly containing the signal and the other the noise. The two subspaces are separated by a thresholding criterion associated with some measures of information.
The two most popular methods of signal subspace decomposition are wavelet shrinkage [2] and Principal Component Analysis (PCA) [3]. Both techniques have proved to be quite efficient. However, wavelet decomposition depending on signal statistics is not equally adapted to different data, and requires some knowledge on prior distributions or parameters of signals to efficiently choose the thresholds for shrinkage. A significant advantage of the PCA is its adaptability to data. The separation criterion is based on energy which may be seen as a limitation in some cases as illustrated in the next section.
In recent years, sparse coding has attracted significant interest in the field of signal denoising [4]. A sparse representation is a decomposition of a signal on a very small set of components of an over-complete basis (called dictionary) which is adapted to the processed data. A difficult aspect for signal subspace decomposition based on such a sparse representation is to define the most appropriate criterion to identify the principal components (called atoms) from the learned dictionary to build the principal subspace. The non-orthogonal property of the dictionary does not allow to use the energy criterion for this purpose, as done with PCA.
To solve this problem, we introduce a new criterion to measure the importance of atoms and propose a SSM under the criterion of the occurrence frequency of atoms. We thus make benefit both from the richness of over-complete dictionaries which preserves details of information and from signal subspace decomposition which rejects strong noise.
The remainder of this paper is organized as follows: Section 2 presents two related works to signal decomposition. Section 3 introduce the proposed sparse signal subspace decomposition based on adaptive over-complete dictionary. Some experimental results and analysis are presented in Section 4. Finally, we draw the conclusion in section 6.

Review of PCA and Sparse Coding Methods
We start with a brief description of two well-established approaches to signal decomposition that are relevant and related to the approach proposed in the next section.

PCA based Subspace Decomposition
The basic tool of SSM is principal component analysis (PCA). PCA makes use of an orthonormal basis to capture on a small set of vectors (the signal subspace) as much energy as possible from the observed data. The other basis vectors are expected to contain noise only and the signal projection on these vectors is rejected.
Consider a data set {x m ∈ R N ×1 } M m=1 grouped in a matrix form X of size N × M : The PCA is based on singular value decomposition (SVD) with singular values σ i in descending order obtained from: where U and V are unitary matrices of size N × N and M × M respectively are positive real known as the singular values of X with rank r (r ≤ N ).
Equation (1) can be re-written in a vector form as: where U = {u n ∈ R N ×1 } N n=1 and A = {α m ∈ R N ×1 } M m=1 . Equation (2) means that the data set {x m } M m=1 is expressed on the orthonormal basis {u n } N n=1 as {α m } M m=1 . In the SVD decomposition given in equation (1), the standard deviation σ i is used as the measurement for identifying the meaningful basis vector u i . PCA takes the first P (P < r) components {u n } P n=1 to span the signal subspace, and the remainders {u n } r n=P +1 are considered in a noise subspace orthogonal to the signal subspace. Therefore, projection on the signal subspace will hopefully filter out noise and reveal hidden structures. The reconstructed signalŜ P CA of size N × M is obtained by projecting in the signal subspace aŝ The underlying assumption is that information in the data set is almost completely contained in a small linear subspace of the overall space of possible data vectors, whereas additive noise is typically distributed through the larger space isotropically. PCA, using the standard deviation as a criterion, implies that the components of the signal of interest in the data set have a maximum variance and the other components are mainly due to the noise. However, in many practical cases, some components with low variances might actually be important because they carry information relative to the signal details. On the contrary, when dealing with noise with non-Gaussian statistics, it may happen that some noise components may actually have higher variances. At last, note that it is often difficult to provide a physical meaning to the orthonormal basis {u i } r i=1 of the SVD decomposition (equation 2) although they have a very clear definition in the mathematical sense as orthogonal, independent and normal. It is therefore difficult to impose known constraints on the signal features when they exist after the principal component decomposition.

Sparse Decomposition
Recent years have shown a growing interest in research on sparse decomposition of M observations {x m ∈ R N } M m=1 based on a dictionary D = {d k } K k=1 ∈ R N ×K . When K > N , the dictionary is said over-complete. d k ∈ R N is a basis vector, also called atom since they are not necessarily independent. By learning from data set {x m } M m=1 , the sparse decomposition is the solution of equation (4) [4]: where where the matrix A of size K × M is composed of M sparse column vectors α m . The first term on the right side of equation (4) is a sparsity-inducing regularization that constrains the solution with the fewest number of nonzero coefficients in each of sparse code vectors α m (1 ≤ m ≤ M ). The underlying assumption is that a meaningful signal could be represented by combining few atoms. This learned dictionary adapted to sparse signal descriptions has proved to be more effective in signal reconstruction and classification tasks than PCA method, which is demonstrated in the next section. The second term in equation (4) is the residual of the reconstruction, based on the mean-square reconstruction error estimate in the same way as in PCA method.
On the other hand, we note that the dictionary D, a basis in sparse decomposition, is produced by learning noisy data set {x m } M m=1 , so the basis vectors {d k } K k=1 should be decomposed into a principal subspace and a residual subspace. However, it is impossible to exploit an energy-constrained subspace since {d k } K k=1 are not necessarily orthogonal or independent.

The Proposed Sarse Signal Subspace Decomposition (3SD)
In this section, we introduce a novel criterion to the subspace decomposition over a learned dictionary and a corresponding index of significance of the atoms. Then we propose a signal sparse subspace decomposition (3SD) method under this new criterion.

Weight Vectors of Learned Atoms
At first, we intend to find out the weight of the atoms. In the sparse representation given in (5), coefficient matrix A is composed by M sparse column vectors α m , each α m representing the weight of the observation x m , a local parameter for the m th observation. Let us consider the row vectors {β k } K k=1 of coefficient matrix A : Note that the row vector β k is not necessarily sparse. Then equation (5) can be rewritten as: Equation (7) means that the row vector β k is the weight of the atom d k , which is a global parameter over the data set X. Denoting β k 0 the 0 zero pseudo-norm of β k . β k 0 is the number of occurrences of atom d k over the data set {x m } M m=1 . We call it the frequency of the atom d k denoted by f k : In the sparse decomposition, basis vectors {d k } K k=1 are prototypes of signal segments. That allows us to take them as a signal patterns. Thereupon, some important features of this signal pattern could be considered as a criterion to identify significant atoms. It is demonstrated [5] that f k is a good description of the signal texture. Intuitively, a signal pattern must occur in meaningful signals with higher frequency even with a lower energy. On the contrary, a noise pattern would hardly be reproduced in observed data even with a higher energy.
It is reasonable to take this frequency f k as a relevance criterion to decompose the over-complete dictionary into a principal signal subspace and a remained noise subspace. Here, we use the word "subspace", but in fact these two subspaces are not necessary independent.

Subspace Decomposition Based on Overcomplete Dictionary
Taking vectors {β k } K k=1 , we calculate their 0 -norms { β k 0 } K k=1 and rank them in descending order as follows. The index k of vectors {β k } K k=1 are belonging to the set C = {1, 2, · · · , k, · · · , K}. A one-to-one index mapping function π is defined as: By the permutation π of the row index k of matrix A = β T 1 · · · β T k · · · β T K T , the reordered coefficient matrixÃ becomes With corresponding reordered dictionaryD = {d π(k) } K k=1 , equation (7) can be written as: Then, the span of the first P atoms can be taken as a principal subspace D An estimateŜ P of the underlying signal S embedded in the observed data set X can be obtained on the principal subspace D = d π(1) · · · d π(k) · · · d π(P ) . β T π(1) · · · β T π(k) · · · β T π(P )

Threshold of Atom's Frequency
Determining the number P of atoms spanning the signal subspace D (S) P is always a hard topic especially for wide-band signals. Here, P is the threshold of atom's frequency f k to distinguish a signal subspace and a noise subspace. One of the advantages of 3SD is that this threshold P can be easily chosen without any prior parameter.
For a noiseless signal even with some week details, such as the image example in Fig. 1(a), the atoms' frequencies f image π(k) s shown in Fig. 1(d) (in black line) are almost always high except the zero value. For a signal with strong noise, such as the example in Fig. 1(b), the atoms' frequencies f noise π(k) s shown in Fig. 1(d) (in red line) are almost always equal to 1 without zero and very few with value 2 or 3. It is easy to set a threshold P of f k (dotted line in the Fig. 1(d)) to separate signal's atoms from noise's atoms. By contrast, the index numbers β k 2 s under energy criterion shown in Fig. 1(c) for this example are rather puzzle to identify principal bases. For a noisy signal, such as an image example in Fig. 2(a), its adaptive overcomplete dictionary (Fig. 2(b)) consists of atoms of noiseless signal patterns, pure noise patterns and noisy signal patterns. Signal atoms should have higher frequencies, noise ones lower and noisy ones moderate. Intuitively, the red line (Fig. 2(c)) should be a suitable threshold P of the frequencies f k s. In practical implementation, the value of P could be simply decided relying to the histogram of f k . As shown in Fig. 2(d), one can set the value of f k associated to the maximum point of its histogram to P as follows: In fact, the performances in signal analyses by 3SD method are not sensitive to the threshold P , owed to the dependence of the atoms. To demonstrate this point, we take 3 images, Barbara, Lena and Boat. Their histograms of f k are shown in Fig. 3(a) with the maximum points in dotted lines, 121, 97 and 92 respectively. Fig.  3(b) reports the peak signal-to-noise ratio (PSNR) of the retrieved imagesŜ P on the signal principal subspace D (S) P with respect of P . We can see that PSNRs of the results remain the same in a large range around the maximum points (in dotted lines). Consequently, taking the value of f k associated to the maximum point of its histogram as the threshold P is a reasonable solution.

Signal Decomposition Methods
Taking a part of the noisy Barbara image (Fig. 4(a)), we show an example of the sparse signal subspace decomposition (3SD) and the corresponding retrieved image (Fig. 4(b)). For comparison, the traditional sparse decomposition and the PCA based subspace decomposition are shown in Fig. 4(c) and Fig. 4(d). Let us look at the proposed sparse signal subspace decomposition on the top of Fig. 4(b) The 128 atoms d k s of the learned overcomplete dictionary D are shown in descending order of their energies measured by β k 2 . The 32 principal signal atoms are chosen from the dictionary D under the frequency criterion. They are shown in descending order of their frequencies measured by β k 0 composing a signal subspace D (S) 32 . We can see that some of the principal atoms are not among the first 32 atoms with the largest energy in the overcomplete dictionary D. The retrieved images are shown on the bottom of Fig. 4(b). The image S on D is apparently denoised. The imageŜ on the signal subspace D (S) 32 improves obviously by preserving fine details and at suppressing strong noise. On the other hand, the residual image on noise subspace D (N ) 96 contains some very noisy information. This is because the atoms of the overcomplete dictionary are not independent.
For the same example, the classical sparse decomposition is shown in Fig. 4(c). Using K-SVD algorithm [6] in which the allowed error tolerance ε (in equation (4)) is set to a smaller value to filter out noise. The retrieved image S seems to have a high Signal to Noise Ratio (SNR), but have lost the weak information. This is because signal distortion and residual noise cannot be minimized simultaneously at dictionary learning by equation (4).
In another comparison, the PCA based subspace decomposition is shown in Fig.  4(d). The 64 components are orthonormal and the 32 principal components are of the largest variance. The retrieved image by projecting on the signal subspace is rather noisy. This is because it cannot suppress strong noise and preserve weak details of information only using the variance criterion.

Application to Image Denoising
The application of 3SD to image denoising is presented here. A major difficulty of denoising is to separate underlying signal from noise. The proposed 3SD method could win this challenge. In 3SD method, the important components are selected from the over-complete dictionary relying to their occurrence number over the noisy image set. Evidently, the occurrence numbers would be large for signal, even for weak details, such as edges or textures and so on. On the other hand, the occurrence numbers would be low for different kinds of white Gaussian or non-Gaussian noises, even strong at intensity.
The 3SD algorithm for image denoising is presented as follows: Input: Noisy image X Output: Denoised imageŜ -Sparse representation {D, A}: using K-SVD algorithm [6] by (4) -Identify principal atoms from D based on A : Compute the frequencies of atoms { β k 0 } K k=1 according to (6) and (8) Get the permutation π sorting the index k of { β k 0 } K k=1 by (9) Compute the threshold P by (14) -Obtain the signal principal atoms {d π(k) } P k=1 by (12) -Reconstruct imageŜ P by (13) In this application, we intend to preserve faint signal details under a situation of strong noise. We use the peak signal-to-noise ratio (PSNR) to assess the noise removal performance: and the structural similarity index metric (SSIM) between the denoised image and the pure one to evaluate the preserving details performance: where u x is the average of x, σ 2 x is the variance of x, σ xy is the covariance of x and y, and c 1 and c 2 are small variables to stabilize the division with weak denominator.
In the experiments, dictionaries used Ds are of size 64 × 256 (K = 256 atoms), designed to handle image patches x m of size N = 64 = 8 × 8 pixels.

Image Denoising
A noisy Lena image X = S + V with an additive zero-mean white Gaussian noise V is used. The standard deviation of noise is σ = 35. A comparison is made with 3SD method and K-SVD method [6] which is one of the best denoising methods reported in the recent literatures. From the results shown in Fig. 5, the 3SD method outperforms the K-SVD method by about 1dB in PSNR and by about 1% in SSIM (depending on how much details in images and how faint the details). In terms of subjective visual quality, we can see that the corner of mouth and the nasolabial fold with weak intensities are much better recovered by 3SD method.

SAR Image Despeckling
In the second experiment, a simulated SAR image with speckle noise is used. Speckle is often modeled as multiplicative noise as x(i, j) = s(i, j)v(i, j) where x, s and v correspond to the contaminated intensity, the original intensity, and the noise level, respectively. Fig. 6 shows the despeckling results of simulated one-look SAR scenario with a fragment of Barbara image. A comparison is made with 3SD method and a probabilistic patch based (PPB) filter based on nonlocal means approach [7] which can cope with non-Gaussian noise. We can see that PPB can well remove speckle noise. However, it also removes the low-intensity details. The 3SD method shows advantages at preserving fine details and at suppressing strong noise.

Conclusion
We proposed a method of sparse signal subspace decomposition (3SD). The central idea of the proposed 3SD is to identify principal atoms from an adaptive overcomplete dictionary relying the occurrence frequency of atoms over the data set (equation (8)). The atoms frequency is measured by zero pseudo-norms of weight vectors of atoms (equation (6) and (8)). The principal subspace is spanned by the maximum frequency atoms (equation (12)).
The 3SD method combines the variance criterion, the sparsity criterion and the component's frequency criterion into a uniform framework. As a result, it can identify more effectively the principal atoms with the three important signal features. On the contrary, PCA uses only variance criterion and sparse coding method uses the variance and the sparsity criterions. In those ways, it is more difficult to distinguish weak information from strong noise.
Another interesting assert of 3SD method is that it takes benefits from using an over-complete dictionary which reserves details of information and from subspace decomposition which rejects strong noise. On the contrary, some undercomplete dictionary methods [8] and some sparse shrinkage methods [9,10] might lose week information when suppressing noise.
Moreover, the 3SD method is very simple with a linear retrieval operation (equation (13)). It does not require any prior knowledge on distribution or parameter to determine a threshold (equation (14)). On the contrary, some sparse shrinkage methods, such as [9], necessitate non-linear processing with some prior distributions of signals.
The proposed 3SD could be interpreted as a PCA in sparse decomposition, so it admits straightforward extension to applications of feature extraction, inverse problems, or machine learning.