We consider two scenarios: (I) single pixel errors appearing with high density due to bad reception. (II) Error blocks due to channel outages where significant parts of a frame is missing.
Scenario i: spatial inpainting
The drawback of PPM modulation is that it may introduce large decoding errors, named anomalous errors, if the channel deteriorates from the optimal operation point (any two symbols may be exchanged with equal likelihood) [19, p. 627]. However, in images, anomalous errors will be close to salt and pepper noise, which can be efficiently reduced by known spatial inpainting methods utilizing intra frame correlation.
Uncompressed frames
We apply median filtering (MF) [20], which can cope with a salt and pepper noise density up to 50% [21, p.200]. A median filter runs over the entire image replacing each pixel with the median of pixel values in a certain neighborhood of the relevant pixel [20]. We consider a quadratic (n×n) neighborhood of pixels for the median computation here.
Figure 2 shows the original image, the same image corrupted by errors from the relevant simulation model, and the median filtered versions using 4×4 and 5×5 pixel blocks. The reconstruction is quite good, which is also confirmed by the SSIM values: 0.93 for 4×4 blocks and 0.96 for 5×5 blocks. These values are inline with the average SSIM provided in Section 4.2.1.
Compressed frames
For compressed frames, salt and pepper-like noise will be added in the compression domain. As the image is decompressed, distortions will be created.
Figure 4a shows the reconstruction of a DPCM compressed version of Fig. 2a. The algorithm [4] introduces certain artifacts into the frames. Since about 96% of the original data has been removed, the algorithm is quite good (note also that this algorithm was designed for HD colonoscopy images). However, a great deal of these artifacts likely appear due to the fact that the available videostreams for WCE is already compressed (blocking artifacts are observed when the image is magnified, and these correspond to location of the DPCM related artifacts). This is reflected in the SSIM between original and compressed frame which is around 0.8−0.85. In comparison, the same algorithm provides a SSIM of about 0.9−0.95 for HD colonoscopy images. This implies that the relevant compression algorithm does not have a good representative for original compressed WCE image. A fair subjective test is therefore hard to obtain in this case.
Since the DPCM decoder is a recursive filter [4], errors will have “tails” in each image dimension, resulting in the corner-like artifacts shown in Fig. 4b. As shown in [11], when the density of errors is low, they can be fully concealed by first using a corner detector in the decompressed image, like the Harris detector [22], then go back to the compression domain and insert one of the neighboring pixels in the corresponding (corrupted) pixel location. With numerous errors, as in Fig. 4b, this method mostly fails as seen in Fig. 4c. A median filter will also fail as it smoothens the compressed image, leading to severe decompression errors.
A way to cope with a high density of errors is through total variation (TV) inpainting [23] in the compression domain, as the noise there is close to salt and pepper noise. Figure 3 depicts our approach to spatial inpainting of compressed frames. As suggested in [21, pp. 201–202], TV inpainting can reduce such errors without smoothening other parts of the image as follows: with Ωc, the compressed image domain, and Dc the inpainting domain (the set of noisy pixels given in (2)), let v0 denote the compressed noisy image on Ωc. We seek the image v on Ωc that is the minimizer of [23]
$$ E[v|v_{0},D_{c}] = \int_{\Omega_{c}} |\nabla v | \mathrm{d} \mathbf{x} + \frac{\lambda}{2} \int_{\Omega_{c} \setminus D_{c}} |v - v_{0} |^{2} \mathrm{d} \mathbf{x}, $$
(1)
where λ controls the degree of noise reduction in v0 outside the inpainting domain Dc, which is given by
$$ D_{c}=\{\mathbf{x} | v_{0}(\mathbf{x}) \geq C_{1} \vee v_{0}(\mathbf{x}) \leq C_{2}\}. $$
(2)
For salt and pepper noise C1= max(v0) and C2= min(v0). Since the noise resulting from PPM modulation is not exactly salt and pepper noise, we set C1= max(v0)−ε1 and C2= min(v0)+ε2, where ε1 and ε2 are determined for a relevant set of images. ε1 and ε2 cannot be chosen large enough for all noisy pixels to be contained within Dc without introducing blur in the compressed frame. This will be most problematic in very light or dark areas of the image. A “blob detection” algorithm (like “difference of Gaussians”) [24] can be applied to detect what sets of pixels has the lightest and darkest values, then ε1 and ε2 can be adjusted from that. As the output from the DPCM coder has a Laplace-like distribution, this method works quite well, as we will see in Section 4.2.2 where more examples are provided. Since errors residing outside Dc are small and all blur introduced in the compressed frame leads to a bad reconstruction, a large λ should be chosen in Eq. (1). One may obtain further quality enhancement through the algorithm in [11] described above after TV inpanting. That is, by corner detection in the reconstructed image followed by pixel adjustment in the compression domain.
The result is shown in Fig. 4d. Although most of the prominent corners are removed and coarse details in the image are enhanced, there are still some false artifacts present due to smaller errors residing outside Dc in Eq. (2). These false artifacts are likely the reason why the SSIM is not larger than 0.87. We will provide a more thorough analysis of SSIM for this inpainting method in Section 4.2.2.
Scenario ii: temporal inpainting
The method proposed here is the same for compressed and uncompressed frames. We consider uncompressed frames.
If significant parts of a frame is missing, then large inpainting errors are unavoidable with spatial inpainting since the inpainting domain becomes too wide [25]. We utilize interframe correlation in a temporal inpaiting strategy to cope with this situation: if neighboring frames are close enough content wise, then missing regions can be inserted from one of them. The advantage of this approach is that possible malign tissue that may become invisible due to an error block will become visible in the corrected frame, as information will be inserted from a neighboring frame. That is, information about malign tissue is not lost, and no false artifacts should be introduced.
The proposed scheme is depicted in Fig. 5. First corrupted parts of a frame is detected using the Harris detector. Due to capsule movement, the same features will seldom be located at the same coordinates and perspective on the screen in different frames. To align the two images so that their common features are located at the same set of coordinates, one can use a homography transform \(\mathcal {H}\). That is, pixel coordinates of (past or future) frames In+1 or In−1, denoted x, are warped onto the coordinates of image In as \(\tilde {\mathbf {x}} = \mathcal {H} \mathbf {x} \). Past frames can often cover the whole inpainting region at the cost of some blur as the WCE often moves closer to the background scene as it progresses through the digestive system. Future frames may not cover the whole inpainting region, but can be made as sharp as the original frame. We provide examples using past frames in the following.
\(\mathcal {H}\) has to be estimated from the relevant frames. There are two main ways to do this: (I) direct (pixel-based) method which is described in [26]. (II) Estimate common features using the scale-invariant feature transform (SIFT) algorithm [27], then select the best matches (inliers) and find the best fit to \(\mathcal {H}\) using the random sample consensus (RANSAC) algorithm [28]. I is likely the least complex method. However, we use method II here since it can determine an accurate \(\mathcal {H}\) even from small overlapping regions of two images [26, pp. 15–33]. This implies that \(\mathcal {H}\) can be found even when large parts of a frame is missing due to outage.
We applied the MATLAB implementation of SIFT, as well as other supporting functions, from the VLFeat library [29] in order to do the computations. Since certain artifacts due to compression and noise may be mistaken as features, it is important to make the SIFT algorithm favor larger features. Therefore, we set a large “WindowSize” (variance of the Gaussian window), that is 4 units of spatial bins [29] (other parameters were set to default). Good matches were then found, as illustrated in Fig. 6.
Due to luminance differences between the original image and the inpainted part, edges may appear (see Fig. 7c). These can be removed through Poisson editing [30]: with Ω, the image domain, and D the inpainting domain with boundary ∂D, let u0 denote the available image information on Ω−D and \(\vec {v}\) be some “guiding” vector field on D. We seek the image u on D that is the minimizer of [30]
$$ \min_{u}\int_{D} |\nabla u - \vec{v}|^{2}\mathrm{d} \mathbf{x}, \ \ u_{|\partial D}=u_{0|\partial D}. $$
(3)
The last condition ensures continuity over the boundary of D. Now let \(f_{D} = \{I_{n-1}(\tilde {\mathbf {x}}) | \tilde {\mathbf {x}} = \mathcal {H} \mathbf {x} \in D \}\), i.e., the part inside D which is mapped from the neighboring image. Then, we can set \(\vec {v}=\nabla f_{D}\).
Figure 7 shows the original image, an image with large error blocks in Y, U,and V channels as well as the reconstructed image. We have estimated the homography from the Y channel as depicted in Fig. 6. The SSIM for the noisy image is around 0.55, whereas the corrected image has a SSIM of =0.93. In comparison, by applying (spatial) TV inpainting within the same noisy frame a SSIM value of 0.83 is obtained. These values are in line with the average SSIM presented in Section 4.2.
One may also apply the chrominance channels, U or V (from the RGB to YUV transform in Fig. 1), to estimate \(\mathcal {H}\) if frame from the luminance channel Y is destroyed. This yields additional noise protection. However, since the energy in U or V is significantly lower, the accuracy of \(\mathcal {H}\) may be less than that obtained with the Y channel.
It is important to note that \(\mathcal {H}\) can only compensate for the WCE’s movement, or rigid motion in general. When there are movements in the background due to muscle contractions etc., there will be distortions in the reconstructed frame. One may use optical flow [31] computed from neighboring frames to compensate for such motions, or techniques developed in so-called non-rigid structure from motion algorithms [32]. Still, it will be hard to obtain stable transforms among images if the correlation (i.e., similarity of image content) is too low, which will be the case when the WCE undergo rapid movements. However, it is likely that future WCE’s will have higher framerate, making the above algorithm perform better in general.
Occurrence of single pixel errors and error blocks simultaneously
Single pixel errors and error blocks may both occur in the same image. There are two approaches to this problem: (i) deal with single pixel errors first and (ii) remove error blocks first.
Experiments clearly showed that approach (i) was the only functioning option: Although SIFT followed by RANSAC is very robust to noise in the images (as these are singled out as outliers through RANSAC) we get into trouble when we try to decide the area in the image that should replace error blocks. This since the corner/line detector becomes confused by the salt and pepper-like characteristic of the single pixel errors.
The result of approach i) is shown in Fig. 8. One can observe that the combined algorithm is capable of coping with both scenarios simultaneously. The SSIM is about the same as it was for block errors in isolation treated in the previous section. This implies that our approach is quite robust.
For compressed frames one would remove all corners in the image by using the method in Fig. 3 first, then remove block errors in the decompressed image. Then one will avoid that the DPCM decoder introduces new set of false artifacts due to the slight mismatch between original and temporally inpainted image.