Side-Information Generation for Temporally and Spatially Scalable Wyner-Ziv Codecs
© Bruno Macchiavello et al. 2009
Received: 1 May 2008
Accepted: 15 January 2009
Published: 25 March 2009
The distributed video coding paradigm enables video codecs to operate with reversed complexity, in which the complexity is shifted from the encoder toward the decoder. Its performance is heavily dependent on the quality of the side information generated by motio estimation at the decoder. We compare the rate-distortion performance of different side-information estimators, for both temporally and spatially scalable Wyner-Ziv codecs. For the temporally scalable codec we compared an established method with a new algorithm that uses a linear-motion model to produce side-information. As a continuation of previous works, in this paper, we propose to use a super-resolution method to upsample the nonkey frame, for the spatial scalable codec, using the key frames as reference. We verify the performance of the spatial scalable WZ coding using the state-of-the-art video coding standard H.264/AVC.
The paradigm of distributed source coding (DSC) is based on two information theory results: the theorems by Slepian and Wolf  and Wyner and Ziv  for lossless and lossy codings of correlated sources, respectively. It has recently become the focus of different video coding schemes [3–12]. A review on DSC applied to video coding, that is, distributed video coding (DVC) can be found elsewhere . Even though it is believed that a DVC algorithm will never outperform conventional video schemes in rate-distortion performance , DVC is a promising tool in creating reversed complexity codecs for power-constrained devices. Currently, digital video standards are based on predictive interframe coding and discrete cosine block transform. In those, the encoder typically has high complexity  mainly due to the need for mode search and motion estimation in finding the best predictor. Nevertheless, the decoder complexity is low. On the other hand, DVC enables reversed complexity codecs, where the decoder is more complex than the encoder. This scheme fits the scenario where real-time encoding is required in a limited-power environment, such as mobile hand-held devices.
A common DVC architecture is a transform domain Wyner-Ziv codec [4, 13], where periodic key frames are encoded with a conventional intraframe encoder, while the rest of the frames—called Wyner-Ziv (WZ) frames—are encoded with a channel coding technique, after applying the discrete cosine transform and quantization. This codec can be seen as DVC with temporal scalability because the WZ-encoded frames can represent a temporal enhancement layer. At the decoder, the key frames are used to generate prediction of the current WZ frame, called side information (SI), which is fed to the channel decoder. The SI for the current WZ frames can be generated using motion analysis on neighboring key and previously decoded WZ frames, thus exploring temporal correlations. As in much of the prior work [4, 11, 15], this architecture uses a feedback channel to implement a Slepian-Wolf codec.
A different approach is a mixed resolution framework that can be implemented as an optional coding mode in any existing video codec standard, as proposed in previous works [16–18]. In that framework, the encoding complexity is reduced by lower resolution encoding, while the residue is WZ encoded. That spatially scalable framework does not use a feedback channel and considers more realistic usage scenarios for video communication using mobile power-constrained devices. First, it is not necessary for the video encoder to always operate in a reversed complexity mode. Thus, this mode may be turned on only when available battery power drops. Second, while complexity reduction is important, it should not be achieved at a substantial cost in bandwidth. Hence, the complexity reduction target may be reduced in the interest of a better rate-distortion trade-off. Third, since the video communicated from one mobile device may be received and played back in real time on another mobile device, the decoder in a mobile device must support a mode of operation where at least a low-quality version of the reversed complexity bit stream can be decoded and played back immediately, with low complexity. Off-line processing may be carried out for retrieving the higher quality version. In the temporal scalability approach, the only way to achieve this is to drop the WZ frames, resulting in unnecessarily low frame rates.
It is well known that the performance of those or any other WZ codec is heavily dependent on the quality of the SI generated at the decoder. In this work, we compare the performance and complexity reduction of different SI estimators. For a temporally scalable codec, we introduce a new SI generator that models the motion between two key frames, in order to predict the motion among key frames and a WZ frame. We compare our results with a common SI estimator for a WZ codec with temporal scalability [19, 20], such a codec tries to model the motion vectors of the current WZ frame using the next and previous decoded key frames.
A more accurate SI generator for a DVC codec with temporal scalability was presented elsewhere [21, 22]. There the SI generator uses forward and bidirectional motion estimation, motion vector refinement, spatial motion smoothing techniques, and it adapt the motion vector to fit into the frame grid. Compared to that technique [21, 22], the SI generation proposed in this paper is less complex, and does not modify the reference or the motion vector. Nevertheless, it is less efficient, being outperformed by the more complex algorithm [21, 22]. However, similar tools like spatial motion smoothing and motion vector refinement, as described elsewhere [21, 22], can be used along with the proposed technique in order to increase the overall performance at the cost of a more complex decoding. For the mixed resolution framework, we improve SI generation, as a continuation of previous work . This method is based on superresolution using key frames . The main idea is to restore the high-frequency information from an interpolated block of the low-resolution encoded frame. This SI generation can be done iteratively, using the SI generated from a previous iteration to improve the quality of the current frame being generated. Other works have used iterative SI generation techniques [24–26]. All of them assume key frames are intracoded, and the intermediate frames are entirely WZ coded. In , previously decoded bit planes are used to improve SI. In , a motion-based algorithm is presented, however the SI is generated by aggressively replacing low-resolution (LR) blocks by blocks from the key frames.
Here, the rate-distortion (RD) performance of the proposed SI generation methods along with a coding time comparison is presented. We also present the RD performance of the spatial scalable coder and compare it to conventional coding. Such a coder is based in previous studies for optimal coding parameter selection  and correlated statistic estimation . The results of the temporal scalable coder in the transform domain are known, and normally outperform simple intracoding, but underperforms zero-motion vector coding, depending on the sequence . The entire tests were implemented using the state-of-the-art standard H.264/AVC as the conventional codec.
The paper is organized as follows; the WZ architectures are described in Section 2. In Section 3, the different schemes for generation of the side information are detailed, and in Section 4 simulation results are presented. Finally, Section 5 contains the conclusions of this work.
2. Wyner-Ziv Coding Architectures
In order to compare SI generation methods, we consider two different Wyner-Ziv coding architectures: a transform domain Wyner-Ziv (TDWZ) codec and a spatially scalable Wyner-Ziv (SSWZ) codec.
2.1. Transform Domain Wyner-Ziv Codec
Complexity reduction is initially obtained with temporal downsampling, since only the key frames are conventionally encoded. However, if the key frames were to be encoded as I-frames, a more significant complexity reduction can be achieved, since there will be no motion estimation at the encoder side. Note that if the key frames are selected as the reference frames and the WZ frames are the nonreference frames, then the key frames can be coded as conventional I-, P-, or reference B-frames, without drifting errors. This not only increases the performance in terms of RD, but also increases the complexity since motion estimation may be used for the key frames as well.
At the decoder, the SI generator uses stored key frames in order to create its best estimate for the missing WZ frames. Motion estimation and temporal interpolation techniques are typically used. Typically, the previous and next key frames of the current WZ frame are used for SI generation, although some works use two previously decoded frames . This SI is used for channel decoding and frame reconstruction in the decoding process of the WZ frame. A better SI means fewer errors, thus requesting fewer bits from the encoder. Therefore, the bit rate may be reduced for the same quality. Hence, a more accurate SI can potentially yield a better performance of the TDWZ codec.
2.2. Spatially Scalable Wyner-Ziv Codec
The mixed resolution framework [16–18] used by the SSWZ codec can be implemented as an optional coding mode in any existing video codec standard (results using H.263+ can also be found in previous works [16–18, 28]).
3. Side-Information Generation
The second SI generation method proposed in this work creates a superresolved version of a LR frame for a SSWZ codec. This new method outperforms previous works .
3.1. Motion-Modeling Side-Information Estimator
it is uniquely defined by a single motion vector;
it is defined by more than one motion vector (an overlapping occurred);
it is not defined by any motion vector (it is left blank).
In order to perform motion compensation, we need to assign a motion vector or filling process for every pixel. The first case is trivial. For the second case, when more than one option for a pixel exists, a simple average might solve the problem. The last case is more challenging, since no motion vector points to a pixel. One could use the colocated pixel in the previous frame. However, it may not be very efficient since it might be that the motion vector of that block is not zero.
Note that, in the proposed method, the reference block found using the motion estimation process is kept and translated to the SI frame by a motion vector that is half the original motion vector. In SE-B, the reference block is changed while the motion vector is kept. In another technique , the reference block is also kept, but, in order to prevent the uncovered and overlapping areas, motion vectors are changed to point to the middle of the current block in the SI frame. In the proposed method, however, both the motion vector and the reference block are kept. Also, the proposed algorithm is focused on improving the motion estimation based on the key frames. This technique can be used along with spatial motion smoothing techniques and motion vectors refinements [21, 22].
In the unlikely case of blocks wherein most or all of the pixels are blank, one can, for example, use colocated pixels for compensation. These cases are rare and can be avoided with careful choices of the sizes of the blocks and of the motion vector search window.
3.2. Super-Resolution Using Key Frames
At the decoder, in the SSWZ codec, the SI is iteratively generated. However, the first iteration is different form the other ones and represents an important contribution of this paper. In the first iteration, similar to an example-based algorithm , we seek to restore the high-frequency information of an interpolated block through searching in previous decoded key frames for a similar block, and by adding the high-frequency of the chosen block to the interpolated one.
Note that the original sequence of frames at a high resolution has both key frames and nonkey frames (WZ frames). The framework encodes the WZ frames at a lower resolution and the key frames at regular resolution. At the decoder, the video sequence is received at mixed resolution.
First, we interpolate the WZ frames to the spatial resolution of the key frames to obtain all the decoded frames at the desired resolution.
Then, the key frames are filtered with a low-pass filter and the high frequency content is obtained as a difference between the original key frames and their filtered version.
A block matching algorithm is used, with the interpolated nonkey frame as source and the filtered key frames as reference, in order to find the best predictor for each block of the nonkey frame.
The corresponding high frequency content of the predictor block is added to the block of the WZ frame, after scaling it by a confidence factor.
The past and future reference frames in the frame store of the current WZ frame are low-pass filtered. The low-pass filter is implemented through downsampling followed by an up-sampling process (using the same decimator and interpolator applied to the WZ frames). At this point, we have both key and nonkey frames interpolated from a LR version. Next, a block-matching algorithm is applied using the interpolated decoded frame. The block-matching algorithm works as follows.
Differently from previous works [16–18], we are adding high frequency content. We want to avoid adding noise in cases where a match is not very close. Hence, we use a confidence factor to scale the high-frequency contents before being added to the LR block.
where is a Lagrange multiplier. Note, that if SAD = = 0 then c = 1. This means that all the high-frequency content will be added. On the other hand, if (SAD + ) = T then c = 0, so no high frequency is added. The values of T and can be empirically found using different test sequences. In our implementation , where i indicates the number of the iteration as we will describe next. And the factor depends on the QP used. For example, we used in our H.264/AVC implementation.
4. Results and Simulations
All the SI generation methods were implemented on the KTA software implementation of H.264/AVC . In our entire tests, we use fast motion estimation, the CAVLC entropy coder, no rate-distortion optimization, 16 16-pixel search range and spatial direct mode type for B-frames.
For the TDWZ codec, we set the coder to work in two different modes: IZIZI and IZPZP. That is, in the first mode, all the key frames are set to be coded as conventional I-frames. In the second mode, the key frames are set to be P-frames, with the exception of the first frame. In both cases, Z refers to the WZ frame. Since the goal is SI comparison, the WZ layer for the TDWZ is not really generated. For the WZ frames, DCT transform, quantization and bit plane creation are computed only to be included as overhead coding time. The SSWZ codec was set to work in IbIbI, IbPbP and IpPpP modes, where b represents the nonreference B-frames coded at quarter resolution and p is a disposable nonreference P frame  also encoded at quarter resolution
Average encoding time for the temporally scalable WZ codec (TDWZ), spatial scalable codec (SSWZ) and conventional H.264/AVC. TOTAL = total coding time in seconds, ME = motion estimation coding time in seconds.
Table 1 also shows results for the conventional H.264/AVC codec working in IBPBP and modes without rate-distortion optimization. The B-frames are nonreference frames and indicates a disposable nonreference P-frame. It can be seen that all WZ frameworks spend less encoding time than conventional coding. As expected the TDWZ codec with the key frames encoded as I-frames yields the faster encoding.
Average SI generation time in frame per second.
In Figure 12 a similar comparison is done, for the superresolution process, also using intra key frames (IbIbI mode). In this case, the semi-super resolution process outperforms previous techniques at a cost of higher encoding complexity. Note that the Soccer sequence presents high motion; therefore it is harder to make an accurate temporal interpolation of the frame. In these cases, the Si generated by the superresolution process should potentially achieve better results.
It can be seen that the WZ coding mode is competitive. The SSWZ codec outperforms conventional coding with zero motion vectors at most rates. The gap between conventional coding and WZ coding, with similar encoding settings, is larger at high rates. However, as can be seen in the Mother and Daughter CIF sequence, the WZ mode may outperform the conventional H.264 at low rates. In fact, the SSWZ can potentially yield better results for low rates in low motion sequences, than conventional coding. This can be explained because the SSWZ uses multi resolution encoding that can be seen as an interpolative coding scheme which is known for their good performance of low bit-rates. Other interpolative coding schemes have been used in image compression with better performance than conventional compression for low rates . Therefore, it is possible to have a WZ codec operating with a 40%–50% reduction in encoding complexity (see encoding time for conventional IBPBP coding mode and SSWZ IbPbP coding mode in Table 1), and still produce better results than conventional coding for certain rates. Also, the SSWZ is not using a feedback channel, the correlation statistics are estimated . Thus, a more robust estimation may significantly improve the performance. A specially designed entropy codec can encode the cosets more efficiently.
In this work, we have introduced two new SI generation methods, one for a temporally scalable Wyner-Ziv coding mode and another one for a spatially scalable Wyner-Ziv coding mode. The first SI generation method, proposed for the temporally scalable codec, models the motion between two key frames as linear. Thus, the motion between one key frame and the current WZ frame, with a GOP size of 2, will be half of the motion between the key frames. An algorithm for solving the problem of overlapping and blanks was proposed. The results show that this SI method has a better performance than the SE-B estimator , while being significantly simpler than frame interpolation with spatial motion smoothing and motion vector refinement . However, the later outperforms the proposed technique. Nevertheless, spatial motion smoothing and motion vectors refinement tools can also be incorporated in the present framework potentially increasing its performance. The SI generation for the spatial scalable codec uses a confidence value to scale the amount of high-frequency content that is added to the block to be superresolved. It works better than the previous techniques [16–18]. This SI method helps a spatial scalable Wyner-Ziv to achieve competitive results.
Also, a complexity comparison using coding time as benchmark was presented. The temporal scalable codec with key frames coded as "intra" frames is considerably less complex than any other WZ codec. However, it has the worst RD performance (considering key frames and SI). The WZ coding mode with spatially scalability is about 20% more complex than the temporal scalable codec using P-frames as key frames in both cases. In the other hand, the spatial scalable coder is more competitive and may outperform a conventional codec for low-motion sequences at low rates. Thus, in certain conditions, the spatial scalable framework allows reversed complexity coding without a significant cost in bandwidth.
We can conclude that a spatial scalable WZ codec produces RD results closer to conventional coding than the temporal scalable WZ codec. However, a complete WZ codec may be able to have both coding modes, since the temporal scalable mode can achieve lower complexity.
This work was supported by Hewlett-Packard Brasil.
- Slepian J, Wolf J: Noiseless coding of correlated information sources. IEEE Transactions on Information Theory 1973,19(4):471-480. 10.1109/TIT.1973.1055037View ArticleMathSciNetMATHGoogle Scholar
- Wyner A, Ziv J: The rate-distortion function for source coding with side information at the decoder. IEEE Transactions on Information Theory 1976,2(1):1-10.View ArticleMathSciNetGoogle Scholar
- Pradhan SS, Ramchandran K: Distributed source coding using syndromes (DISCUS): design and construction. Proceedings of the Data Compression Conference (DCC '99), March 1999, Snowbird, Utah, USA 158-167.Google Scholar
- Aaron A, Rane SD, Setton E, Girod B: Transform-domain Wyner-Ziv codec for video. Visual Communications and Image Processing 2004, January 2004, San Jose, Calif, USA, Proceedings of SPIE 5308: 520-528.View ArticleGoogle Scholar
- Puri R, Ramchandram K: Prism: a new robust video coding architecture based on distributed compression principles. Proceedings of the 40th Annual Allerton Conference on Communication, Control, and Computing, October 2002, Allerton, Ill, USA 1-10.Google Scholar
- Xu Q, Xiong Z: Layered Wyner-Ziv video coding. Visual Communications and Image Processing, January 2004, San Jose, Calif, USA, Proceedings of SPIE 5308: 83-91.Google Scholar
- Xu Q, Xiong Z: Layered Wyner-Ziv video coding. IEEE Transactions on Image Processing 2006,15(12):3791-3803.View ArticleMathSciNetGoogle Scholar
- Wang H, Cheung N-M, Ortega A: A framework for adaptive scalable video coding using Wyner-Ziv techniques. EURASIP Journal on Applied Signal Processing 2006, 2006:-18.Google Scholar
- Tagliasacchi M, Majumdar A, Ramchandran K: A distributed-source-coding based robust spatio-temporal scalable video codec. Proceedings of the 24th Picture Coding Symposium (PCS '04), December 2004, San Francisco, Calif, USA 435-440.Google Scholar
- Wang X, Orchard MT: Desing of trellis codes for source coding with side information at the decoder. Proceedings of Data Compression Conference (DCC '01), March 2001, Snowbird, Utah, USA 361-370.Google Scholar
- Aaron A, Girod B: Compression with side information using turbo codes. Proceedings of the Data Compression Conference (DCC '02), April 2002, Snowbird, Utah, USA 252-261.Google Scholar
- Ouaret M, Dufaux F, Ebrahimi T: Codec-independent scalable distributed video coding. Proceedings of the IEEE International Conference on Image Processing (ICIP '07), September 2007, San Antonio, Tex, USA 3: 9-12.Google Scholar
- Girod B, Aaron AM, Rane S, Rebollo-Monedero D: Distributed video coding. Proceedings of the IEEE 2005,93(1):71-83.View ArticleGoogle Scholar
- Wiegand T, Sullivan GJ, Bjøntegaard G, Luthra A: Overview of the H.264/AVC video coding standard. IEEE Transactions on Circuits and Systems for Video Technology 2003,13(7):560-576.View ArticleGoogle Scholar
- Aaron AM, Rane S, Zhang R, Girod B: Wyner-Ziv coding for video: applications to compression and error resilience. Proceedings of the Data Compression Conference (DCC '03), March 2003, Snowbird, Utah, USA 93-102.Google Scholar
- Mukherjee D: A robust reversed complexity Wyner-Ziv video codec introducing sign-modulated codes. HP Labs, Palo Alto, Calif, USA; May 2006. HPL-2006-80Google Scholar
- Mukherjee D, Macchiavello B, de Queiroz RL: A simple reversed-complexity Wyner-Ziv video coding mode based on a spatial reduction framework. Visual Communications and Image Processing 2007, January 2007, San Jose, Calif, USA, Proceedings of SPIE 6508: 1-12.Google Scholar
- Macchiavello B, de Queiroz RL, Mukherjee D: Motion-based side-information generation for a scalable Wyner-Ziv video coder. Proceedings of IEEE International Conference on Image Processing (ICIP '07), September 2007, San Antonio, Tex, USA 6: 413-416.Google Scholar
- Li Z, Delp EJ: Wyner-Ziv video side estimator: conventional motion search methods revisited. Proceedings of IEEE International Conference on Image Processing (ICIP '05), September 2005, Genova, Italy 1: 825-828.Google Scholar
- Li Z, Liu L, Delp EJ: Rate distortion analysis of motion side estimation in Wyner-Ziv video coding. IEEE Transactions on Image Processing 2007,16(1):98-113.View ArticleMathSciNetGoogle Scholar
- Brites C, Ascenso J, Pedro JQ, Pereira F: Evaluating a feedback channel based transform domain Wyner-Ziv video codec. Signal Processing: Image Communication 2008,23(4):269-297. 10.1016/j.image.2008.03.002Google Scholar
- Ascenso J, Brites C, Pereira F: Improving frame interpolation with spatial motion smoothing for pixel domain distributed video coding. Proceedings of the 5th EURASIP Conference on Speech and Image Processing, Multimedia Communications and Services, June-July 2005, Smolenice, Slovakia 1-6.Google Scholar
- Brandi F, de Queiroz RL, Mukherjee D: Super resolution of video using key frames and motion estimation. Proceedings of the IEEE International Conference on Image Processing (ICIP '08), October 2008, San Diego, Calif, USA 321-324.Google Scholar
- Artigas X, Torres L: Iterative generation of motion-compensated side information for distributed video coding. Proceedings of IEEE International Conference on Image Processing (ICIP '05), September 2005, Genova, Italy 1: 833-836.Google Scholar
- Ascenso J, Brites C, Pereira F: Motion compensated refinement for low complexity pixel based distributed video coding. Proceedings of IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS '05), September 2005, Como, Italy 593-598.Google Scholar
- Weerakkody WARJ, Fernando WAC, Martínez JL, Cuenca P, Quiles F: An iterative refinement technique for side information generation in DVC. Proceedings of the IEEE International Conference on Multimedia and Expo (ICME '07), July 2007, Beijing, China 164-167.Google Scholar
- Mukherjee D: Optimal parameter choice for Wyner-Ziv coding of laplacian sources with decoder side-information. HP Labs, Palo Alto, Calif, USA; 2007.Google Scholar
- Macchiavello B, Mukherjee D, de Queiroz RL: A statistical model for a mixed resolution Wyner-Ziv framework. Proceedings of the 26th Picture Coding Symposium (PCS '07), November 2007, Lisbon, PortugalGoogle Scholar
- Artigas X, Ascenso J, Dalai M, Klomp S, Kubasov D, Ouaret M: The discover codec: architecture, techniques and evaluation. Proceedings of the 26th Picture Coding Symposium (PCS '07), November 2007, Lisbon, Portugal 1-4.Google Scholar
- Natário L, Brites C, Ascenso J, Pereira F: Extrapolating side information for low-delay pixel-domain distributed video coding. Proceedings of the 9th International Workshop on Visual Content Processing and Representation (VLBV '05), September 2005, Sardinia, Italy 16-21.Google Scholar
- Freeman WT, Jones TR, Pasztor EC: Example-based super-resolution. IEEE Computer Graphics and Applications 2002,22(2):56-65. 10.1109/38.988747View ArticleGoogle Scholar
- Jung J, Tan TK: KTA 1.2 software manual. VCEG-AE08, January 2007Google Scholar
- Zeng B, Venetsanopoulos AN: A JPEG-based interpolative image coding scheme. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '93), April 1993, Minneapolis, Minn, USA 5: 393-396.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.