A packet-layer video quality assessment model with spatiotemporal complexity estimation
© Liao and Chen; licensee Springer. 2011
Received: 1 November 2010
Accepted: 22 August 2011
Published: 22 August 2011
A packet-layer video quality assessment (VQA) model is a lightweight model that predicts the video quality impacted by network conditions and coding configuration for application scenarios such as video system planning and in-service video quality monitoring. It is under standardization in ITU-T Study Group (SG) 12. In this article, we first differentiate the requirements for VQA model from the two application scenarios, and state the argument that the dataset for evaluating the quality monitoring model should be more challenging than that for system planning model. Correspondingly, different criteria and approaches are used for constructing the test datasets, for system planning (dataset-1) and for video quality monitoring (dataset-2), respectively. Further, we propose a novel video quality monitoring model by estimating the spatiotemporal complexity of video content. The model takes into account the interactions among content features, the error concealment effectiveness, and error propagation effects. Experiment results demonstrate that the proposed model achieves robust performance improvement compared with the existing peer VQA metrics on both dataset-1 and dataset-2. It is noted that on the more challenging dataset-2 for video quality monitoring, we obtain a large increase in Pearson correlation from 0.75 to 0.92 and a decrease in the modified RMSE from 0.41 to 0.19.
With the development of video service delivery over IP networks, there is a growing interest in low-complexity no-reference video quality assessment (VQA) models for measuring the impact of transmission losses on the perceived video quality. No-reference VQA model generally uses only the received video with compression and transmission impairment as model input to estimate the video quality. No-reference model fits better with the real-world situation where customers usually watch IPTV or streaming video without the original video as reference.
A media-layer model employs with pixel signal. Thus, it can easily obtain content-dependent features that influence video quality, such as texture-masking effects and motion-masking effects. However, a media-layer model usually needs special solutions (e.g., ) for locating the impaired parts in the distorted video because of the lack of information on packet loss.
A packet-layer model (e.g., P.NAMS) utilizes various packet headers (e.g., RTP header, TS header), network parameters (e.g., packet loss rate (PLR), delay), and codec configuration information as input to the model. Obviously, this type of model can roughly locate the impaired parts by analyzing the packet headers. However, how to take the content-dependent features into account is a big challenge to this model.
A bitstream-level model (e.g., P.NBAMS, ) uses the compressed video bitstream in addition to the packet headers as input. Thus, it is not only aware of the location of the loss-impaired parts of video, but also has access to video-content feature and the detailed encoding parameters by parsing the video bitstream. It is supposed to be more accurate than a packet-layer model at a cost of slightly higher computational complexity. However, in the case that video bitstream is encrypted, only packet-layer model works.
Hybrid model uses the pixel signal in addition to the bitstream and the packet headers to further improve video quality prediction accuracy. Because the various error concealment (EC) artifacts become available only after decoding video bitstream into pixel signal, in principle it can provide the most accurate quality prediction performance. However, it has much higher computational complexity.
The packet-layer model, which primarily estimates the video quality impairment caused by unreliable transmission, is studied in this article.
Two use cases of packet-layer VQA models have been identified in ITU-T SG12/Q14: video system planning and in-service video quality monitoring.
As a video system planning tool, parametric packet-layer model can help to determine the proper video encoder parameters and network quality of service (QoS) parameters. This can avoid over-engineering the applications, terminals, and networks while guaranteeing user's satisfactory QoE. ITU-T G.OMVS and G.1070  for videophone service are the examples of the video system planning model.
For video quality monitoring application, usually operators or service providers need to ensure video quality service level agreement by monitoring and diagnosing video quality degradation caused by network issues. Since packet-layer model is computationally lightweight, it can be deployed in large scale along the media service chain. The video quality model of ITU-T standard P.NAMS (Non-intrusive parametric model for the Assessment of performance of Multimedia Streaming) is specifically designed for this purpose.
In general, two approaches can be followed in packet-layer modeling. One is the parameter-based modeling approach [6–9] and another is the loss-distortion chain-based modeling approach . The parameter-based approach estimates perceptual quality by extracting the parameters of a specific application (e.g., coding bitrate, frame rate) and transmission packet loss, then building a relationship between the parameters and the overall video quality. Obviously, the parametric packet-layer model is in nature consistent with the requirement of system planning. However, it predicts the average video quality over different video contents. The coefficient table of this model needs to change with the codec type and configuration, the EC strategy of a decoder, the display resolution, and the video content types. Noticeably, the models in [6, 8, 9] were claimed to achieve a very high Pearson correlation above 0.95, and the RMSE lower than 0.3 on the 5-point rating scale or 7 on the 0-100 rating scale, even if the video content features were not considered in the models. This motivated us to verify the results and look into the ways of setting up training and evaluation dataset on which the model performance directly depends.
Loss-distortion chain-based approach  has the merit of accounting in error propagation, content features, and EC effectiveness. Since iteration process is generally involved in, it is suitable for quality monitoring, not for system planning model. Keeping low computational complexity, which is very important to in-service monitoring, is one challenge for this approach. Another challenge is to estimate the video content and compression information at packet layer. Our proposed model follows this approach and deals with the challenges.
The main contributions of this article are in two aspects. First, we differentiate the requirements for packet-layer model from two application scenarios: video system planning and video quality monitoring. We design the respective criteria and methods to select the processed video sequences (PVSs) for subjective evaluation when setting up the subjective mean opinion score (MOS) database. This helps us to explain why the abovementioned parametric packet-layer models had a high performance even if the video content feature was not taken into consideration. Furthermore, we state the argument that the dataset for evaluating the video quality monitoring model should be more challenging than that for video system planning model.
Second, we propose a novel quality monitoring model, which has low complexity and fully utilizes the video spatiotemporal complexity estimation at packet layer. In contrast to the parametric packet-layer models, it takes into consideration the interaction among video content features and EC effect and error propagation effect, thus improves estimate accuracy.
The rest of the article is organized as follows. In Section 2, we review several literatures that motivated this study. The novelty of this study is then discussed. In Section 3, two different criteria and methods are used to set up respective datasets for monitoring and planning scenarios. In Section 4, the proposed VQA model is described. Experimental results are discussed in Section 5. Conclusions and future work are discussed in Section 6.
2. Related work
The recent studies [10–13] are somehow related to the idea of our proposed model. In [10, 11], the contributing factors to the visibility of artifacts caused by lost packet(s) were studied; video quality metrics based on the visibility of packet loss were developed in [12, 13].
The factors to the visibility of a single packet loss were studied in  for MPEG-2 compressed video. The top three most important factors were the magnitude of overall motion which is the average across all macroblocks (MBs) initially affected by loss, the type (I, B, or P) of the frame (FRAMETYPE) in which packet loss occurred, and the initial MSE (IMSE) of the error-concealed pixels. Further, the visibility of multiple packet losses in H.264 video was studied in . Again, the IMSE and the FRAMETYPE are identified as the most important factors to the visibility of losses. Besides, it was shown that the IMSE is very different because of the different concealment strategies . It can be seen that the accurate detection of the initial visible artifacts (IVA) and the error propagation effects are two important aspects to be considered in a packet-layer VQA model. Furthermore, the different EC effects should be considered when estimating the annoyance level of IVA.
Yamada et al.  developed a no-reference hybrid video quality metric based on the count of the MBs for which the EC algorithm of a decoder is identified as ineffective. Classifying lost MBs based on the error-concealment effectiveness can be essentially regarded as an operation to classify the visibility of the artifacts caused by packet loss(s). Suresh  reported that the simple metric of mean time between visible artifacts has an average correlation of 0.94 with subjective video quality.
There are two major novel points in our proposed model. First, the IVA of a frame suffering from packet loss and EC is estimated based on the EC effectiveness. Unlike , the EC effectiveness is determined based on the spatiotemporal complexity estimation with packet-layer information; and the different EC effects are considered. Second, the IVA is incorporated into an error propagation model to predict the overall video quality. The estimate of spatiotemporal complexity is employed to modulate the propagation of the IVA in the error propagation model. The performance gain resulting from the spatiotemporal complexity-based IVA assessment and from using the error propagation model is analyzed in the experiment section.
3. subjective dataset and analysis
As described above, the packet-layer video QoE assessment model has two typical application scenarios, video system planning and in-service video quality monitoring, each of which has different requirements. The video system planning model is for network QoS parameter planning and video coding parameter planning, given a target video quality. It predicts average perceptual quality degradation, ignoring the impact of different distortion and content types on the perceived quality. Therefore, it should predict well the quality of the loss-affected sequences with large occurrence probability. Whereas, the VQA model for monitoring purpose is expected to give quality degradation alarm with high accuracy and should be able to estimate as accurate as possible the quality of each specific video sequence distorted by packet losses. Correspondingly, the respective subjective dataset for training and evaluating the planning model and the monitoring model should be built differently. Further analysis of the PVSs in Sections 3.3 and 3.4 illustrates that the different EC effects and the different error propagation effects are two of the most important factors to the perceptual quality of packet-loss distorted videos.
There are mutual influences between the perception of coding artifacts and that of transmission artifacts especially at low coding bitrate . In our subjective database, visible coding artifact is not considered by setting the quantization parameter (QP) to a certain smaller value. Only the video quality degradation cause by transmission impairments is discussed in this article.
3.1 Subjective test
Video QoE is both application-oriented and user-oriented assessments . Viewer's individual interests, quality expectation, and service experience are among the contributing factors to the perceived quality. To compensate the subjective variance of these factors, usually MOS averaged over a number of viewers (called subjects hereafter) is used as the quality indication of a video sequence. Moreover, to minimize the variance of subjects' opinion caused by these factors, subjective test should be conducted under well-controlled environment; subjects should be well instructed about the task and video application scenario, which influences the subjects' expectation to video quality.
- Imperceptible: "no artifact (or problematic area) can be perceived during the whole video display period".
-Perceptible but not annoying: "artifact can be perceived occasionally, but it does not influence the interested content, or it appears in the background for an instant moment".
- Slightly annoying: "the noticeable artifact appearing in the region of interest (ROI) is identified, or noticeable artifacts are detected for several instant moments even if they do not appear in the ROI".
- Annoying: "noticeable artifact appears in ROI for several times or many noticeable artifacts are detected and last for a long time".
- Very annoying: "video content cannot be understood well due to artifacts and the artifacts spread all over the sequence".
3.2 Select PVSs for dataset
Six CIF format video contents, which cover a wide range of spatial complexity (SC) index and temporal complexity (TC) index , are used as original sequences, namely Foreman, Hall, Mobile, Mother, News, and Paris. The six sequences are encoded using H.264 encoder with two sequence structures, namely, IBBPBB and IPPP. Group of picture (GOP) size is 15 frames. A proper fixed QP is used to prevent the compressed video from visible coding artifacts. Each row of MBs is encoded as an individual slice, and one slice is encapsulated into an RTP packet. To simulate transmission error, the loss patterns generated at five PLRs (0.1, 0.4, 1, 3, and 5%) in  are used. For each nominal PLR, 30 channel realizations are generated by starting to read the error pattern file at a random point. Thus, for each original sequence, there are 150 realizations of packet loss corrupted sequences. Before subjective evaluation test, we must choose some typical PVSs from the large numbers of realizations.
For each video content, select the PVSs that are representatives of the dominant MOS-PLR distribution as done in ;
For each video content, select the PVSs that cover the MOS-PLR distribution widely by including the PVSs of the best and the poorest quality at a given PLR level, in addition to those representing the dominant MOS-PLR distribution.
In Figure 4c, the PLR-PSNR distribution for all the six video contents spreads away from each other, whereas in Figure 4a the PLR-MOS distributions for the mostly video contents are mixed together. This phenomenon partially illustrates that the PSNR is not a good objective measurement of video quality because it fails to take into consideration the impact of video content feature on human perception of video quality.
Figure 4b shows that PVSs present very different perceptual qualities in dataset-2 even under the same PLR. Taking the PLR of 0.86% for an example, the MOSs vary from Grade 2 to Grade 4. PLR treats all lost data as equal important to perceived quality, ignoring the content and compression's influence on perceived quality. It may be an effective feature on dataset-1 as shown in Figure 4a, but is not an effective feature on dataset-2 for quality monitoring applications.
Unlike [6, 8, 9], our proposed objective model targets at video quality monitoring application. The objective model for monitoring purpose should be able to estimate as accurately as possible the video quality of each specific sequence distorted by packet loss. Correspondingly, the dataset for evaluating the model performance should be more challenging than that for planning model, i.e., the proposed model should work well not only on dataset-1 but also on dataset-2.
3.3 Impact of EC
Both the duration and the annoyance level of the visible artifacts contribute to the perceived video quality degradation. The annoyance level of artifacts produced by packet loss depends heavily on the EC scheme of a decoder. The goal of EC is to estimate the missing MBs in a compressed video bitstream with packet losses, in order to provide a minimum degree of perceptual quality degradation. EC methods that have been developed roughly fall into two categories: spatial EC approach and temporal EC approach. In the spatial EC class, spatial correlation between local pixels is exploited; missing MBs are recovered by interpolation from neighbor pixels. In the temporal EC class, both the coherence of motion field and the spatial smoothness of pixels along edges cross block boundary are exploited to estimate motion vector (MV) of a lost MB. In H.264 JM reference decoder, spatial approach is applied to conceal lost MBs of Intra-coded frame (I-frame) using bilinear interpolation technique; temporal approach is applied to conceal lost MBs for inter-predicted frame (P-frame, B-frame) by estimating MV of the lost MB based on the neighbor MBs' MVs. Minimum boundary discontinuity criterion is used to select the best MV estimate.
3.4 Impact of error propagation
In general, an I-frame packet loss results in artifact duration of GOP length, or even longer if open GOP structure is used in compression configuration. The more intra-coded MBs exist in inter-coded frames, the more easily the video quality recovers from error, and the shorter the artifact duration is. In general, the artifact duration caused by P-frame packet loss is less than that by I-frame packet loss. However, the impact of a P-frame packet loss can be significant, if large motion exists in the packet and/or the packets temporally adjacent to it. The artifacts caused by a B-packet loss, if noticeable, look like an instant glitch, because there is no error propagation from B-frame and the artifacts last merely for 1/30 s. When the motion in a lost B slice is low, there are no visible artifacts at all.
4. VQA model with spatiotemporal complexity estimation
Both the effects of EC and the effects of error propagation have close relationship with the spatiotemporal complexity of the lost packets and its spatiotemporally adjacent packets. To improve prediction accuracy of packet-layer VQA model in the quality monitoring case, influence from video content property, EC strategy, and error propagation should be taken into consideration as much as possible. The proposed objective quality assessment model is based on the video spatiotemporal complexity estimation.
4.1 Spatiotemporal complexity estimation
For a video frame indexed as i, the parameter set π i including frame size s i , number of total packets Ni,total, number of lost packets Ni,lost, and the location of lost packet in the frame is calculated or recorded. The location of lost packets in a video frame is detected with the assistance of the sequence number field of RTP header. To identify different frames, the timestamp in RTP header is used. The frame size includes both lost packet size and received packet size. For a lost I-frame packet, its size is estimated as the average of the two spatially adjacent I-frame packets that are correctly received or equal to the size of the spatially adjacent I-frame packet if there is only one spatially adjacent I-frame packet correctly received. For a lost P-frame packet, its size is estimated as the average size of the two temporally adjacent collocated P-frame packets that are correctly received. Similar method is used for size estimate of lost B-frame packet.
For a I slice, if its size is smaller than thrdsmooth, then the slice is classified as smooth-SC slice; otherwise, as edged-SC slice. The thrdsmooth is a function of coding bitrate. In our experiment, thrdsmooth is set to 200 bytes for CIF format sequences coded with H.264 encoder and QP equal to 28.
4.2 Objective assessment model
The value of depending on EC method and SC/TC classification
Spatial EC method
Temporal EC method
M-TC & IPPP structure
M-TC & IBBP structure
where b is weight for the propagated error from respective reference frames. For our datasets, b = 0.75 for P frames, and b = 0.5 for B frames.
The value of depending on TC classification
L-TC & M-TC
where M is the total number of frames in t seconds; f x is the frame rate of a video sequence.
5. Experimental results
First, we compare the correlation between the subjective MOS and some affecting parameters that are used in the existing packet-layer models. These parameters include PLR , burst loss frequency (BLF) , and invalid frame ratio (IFR) . In the existing work, these parameters and other video coding parameters like coding bitrate, frame rate, are modeled together. In order to fairly compare the performance of the above parameters that reflect transmission impairment, the coding artifacts are prevented by properly setting QP in our datasets.
where the index i denotes the video sample; N denotes the number of samples; and d the number of freedoms. The degree of freedom d is set to 1 because we did not apply any fitting method to the predicted MOS score before comparing it with the subjective MOS.
The correlation and modified RMSE between different artifact features and subjective MOS
Quantitative analysis of the contribution from EC effectiveness estimation and EP model
Second, the contributions of two factors, namely EC effectiveness and EP model, are quantified on dataset-2. If we set the weights for EC effectiveness to one in Equation 4 and ignore the second item of propagated artifacts by setting it to zero in Equation 3, then the MLoVA regresses to PLR, where the data losses are regarded as equally important to perceptual quality. As described in Section 3, the EC strategy employed at decoder can hide the visible artifacts caused by packet loss to a degree that depends on the spatiotemporal complexity of the lost content. When the complexity estimation-based EC weights are applied to calculate IVA and still ignore the item of propagated error, it is shown in Figure 10b that the correlation of mean IVA (MIVA) with subjective MOS is 0.86, and the modified RMSE is reduced to 0.27. The performance is significantly improved as compared with PLR. Further, the improvement brought by incorporating the error propagation model of Equation 5 was evaluated. As we know, inter-frame prediction is used in video compression, as a result, the influence of an I-packet loss, or a P-packet loss appearing early in a GOP, is quite different from that of a B-packet loss or a P-packet loss appearing later in a GOP. By setting β1 = β2 = 1, we did not consider the error attenuation effects during propagation, and denoted the corresponding result of Equation 7 as MLOVA0. It can be seen that introducing the EP model and the complexity estimation-based EP attenuation weight can further improve the prediction accuracy on dataet-2.
6. Conclusion and future work
In this study, the different requirements of two application scenarios of a parametric packet-layer model are discussed. We provide the insight that different criteria and methods should be used to select processed sequences for subjective evaluation when setting up the evaluation dataset. It is shown that the parameters PLR/BLF/IFR used in existing models are effective for video system planning modeling, but are not effective for video quality monitoring applications.
Further, a model is proposed for video monitoring scenario, taking into consideration the interaction between video content features and EC effects and error propagation effects. It achieves much better performance on both types of datasets for planning and monitoring applications. The result also shows that, for the encoding configuration given in this article, the packet-layer model taking packet header information and encoder configuration information as inputs is able to estimate video quality with enough accuracy for practical use. However, there are many error-resilience tools (e.g., flexible MB order) in H.264 to combat the video quality degradation in case of transmission losses and different EC strategies that may be employed at a decoder. A packet-layer model must be tailored to the specific video application configuration. For future study, the influence of the distribution of visible packet losses on the overall perceived quality will be studied.
- Takahashi A, Hands D, Barriac V: Standardization activities in the ITU for a QoE assessment of IPTV. IEEE Commun Mag 2008, 46(2):78-84.View ArticleGoogle Scholar
- ITU-T document, Draft terms of reference (ToR) for P.NAMS2009. [http://www.itu.int/md/meetingdoc.asp?lang=en%26;parent=T09-SG12-091103-TD-GEN-0146]
- ITU-T document, Draft Terms of Reference (ToR) for P.NBAMS2009. [http://www.itu.int/md/T09-SG12-110118-TD-GEN-0521]
- Rui H, Li C, Qiu S: Evaluation of packet loss impairment on streaming video. J Zhejiang Univ Sci 2006, A7: 131-136.View ArticleMATHGoogle Scholar
- Reibman AR, Vaishampayan VA, Sermadevi Y: Quality monitoring of video over a packet network. IEEE Trans Multimedia 2004, 6(2):327-334. 10.1109/TMM.2003.822785View ArticleGoogle Scholar
- Yamagishi K, Hayashi T: Video-quality planning model for videophone services. Inf Media Technol 4(1):1-9.Google Scholar
- Mohamed S, Rubino G: A study of real-time packet video quality using random neural networks. IEEE Trans Circ Syst Video Technol 2002, 12(12):1071-1083. 10.1109/TCSVT.2002.806808View ArticleGoogle Scholar
- Yamagishi K, Hayashi T: Parametric packet-layer model for monitoring video quality of IPTV services. IEEE International Conference on Communications 2008, 110-114.Google Scholar
- Raake A, Garcia M-N, Moller S, Berger J, Kling F, List P, Johann J, Heidemann C: T-V-model: parameter-based prediction of IPTV quality. Proc ICASSP 2008, 1149-1152.Google Scholar
- Kanumuri S, Cosman PC, Reibman AR, Vaishampayan VA: Modeling packet loss visibility in MPEG-2 video. IEEE Trans Multimedia 2006, 8(2):341-355.View ArticleGoogle Scholar
- Reibman AR, Poole D: Predicting packet-loss visibility using scene characteristics. Proceedings of the International Workshop in Packet Video 2007, 308-317.Google Scholar
- Yamada T, Miyamoto Y, Serizawa M: No-reference video quality estimation based on error-concealment effectiveness. IEEE Packet Video Workshop 2007, 288-293.Google Scholar
- Suresh N: Mean time between visible artifacts in visual communications. PhD thesis, Georgia Institute of Technology 2007.Google Scholar
- Winkler S, Dufaux F: Video quality evaluation for mobile applications. Proc VCIP 2003, 593-603.Google Scholar
- Winkler S, Mohandas P: The evolution of video quality measurement: from PSNR to hybrid metrics. IEEE Trans Broadcast 2008, 54(3):660-668.View ArticleGoogle Scholar
- ITU-T Rec. P.910, Subjective video quality assessment methods for multimedia applications Geneva; 2008.Google Scholar
- Simone FD, Naccari M, Tagliasacchi M, Dufaux F, Tubaro S, Ebrahimi T: Subjective assessment of H.264/AVC video sequences transmitted over a noisy channel. Proc International Workshop on Quality of Multimedia Experience (QoMEx) 2009, 204-209. [http://mmspl.epfl.ch/]Google Scholar
- ITU-T document, Qualification test plan for P.NAMS[http://www.itu.int/md/meetingdoc.asp?lang=en%26;parent=T09-SG12-091103-TD-GEN-0150]
- ITU-R Rec. BT.500-10, Methodology for the subjective assessment of the quality of the television pictures 2000.Google Scholar
- Clark A: Method and system for viewer quality estimation of packet video streams.2009. [http://www.freepatentsonline.com/y2009/0041114.html]Google Scholar
- Hayashi T, Masuda M, Tominaga T, Yamagishi K: Non-intrusive QoS monitoring method for realtime telecommunication services. NTT Tech Rev 2006, 4(4):35-40.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.