Visual rhythms for qualitative evaluation of video stabilization

Recent technological advances have enabled the development of compact and portable cameras for the generation of large volumes of video content. Several applications have benefited from such significant growth of multimedia data, such as telemedicine, surveillance and security, entertainment, teaching, and robotics. However, videos captured by amateurs are subject to unwanted motion or vibration while handling the camera. Video stabilization techniques aim to detect and remove glitches or instabilities caused during the acquisition process to enhance visual quality. In this work, we introduce and analyze a novel representation based on visual rhythms for qualitative evaluation of video stabilization methods. Experiments conducted on different video sequences are performed to demonstrate the effectiveness of the visual representation as qualitative measure for evaluating video stability. In addition, we present a proposal to calculate an objective metric extracted from the visual rhythms.


Introduction
The popularization of mobile devices in recent years has contributed to making video acquisition possible for a variety of applications. Handling such devices generally causes unwanted motion during the video generation, which inevitably affects the quality of the final video.
Techniques and metrics for quality evaluation must be well established so that video stabilization approaches can be developed, refined, and compared in a consistent manner. Therefore, ineffective evaluation measures may lead to the development of inadequate techniques, compromising the advance of state-of-the-art video stabilization approaches.
Most of the quantitative techniques for the evaluation of video stabilization available in the literature are inaccurate and, in some cases, incompatible with human visual perception. Moreover, the techniques used to evaluate and report the results subjectively are

Background
Different categories of stabilization systems have been proposed to improve the quality of videos. The three most common types are mechanical stabilization, optical stabilization, and digital stabilization.
Mechanical video stabilization typically uses sensors to detect camera shifts and compensate for undesired movements. A common way is to use gyroscopes to detect motion and send signals to motors connected to small wheels, such that the camera can move in the opposite direction of motion.
Optical video stabilization [19] is widely used in photographic cameras and consists of a mechanism to compensate for the angular and translational movements of the cameras, stabilizing the image before it is recorded on the sensor. A form of optical stabilization introduces a gyroscope to measure velocity differences at distinct instants to distinguish between normal and undesired motion.
Digital video stabilization is implemented in software without the use of special devices. Digital video stabilization methods are commonly categorized into two-dimensional (2D) and three-dimensional (3D) approaches. In the first category, techniques estimate camera motion from two consecutive frames and apply 2D transformations to stabilize the video. In the second category, techniques reconstruct the camera trajectories from 3D transformations [20,21], such as scaling, translation, and rotation.
In the context of image and video processing, the evaluation can be classified as (i) objective, when obtained through functions applied between two images [22] or video frames, and (ii) subjective, when the analysis is performed by human observers. In both cases, a desired goal is to assess stabilization based on criteria in agreement with the perception of the human visual system.

Objective evaluation
Criteria for measuring the amount and nature of the camera displacement have been proposed to evaluate the quality of video stabilization in an objective manner [23]. Unintentional motion is decomposed into divergence and jitter through low-pass and high-pass filters, respectively. The amount of jitter from the stabilized and original video is compared. The divergence is also verified, which indicates the amount of expected displacement. For an overall assessment, the blurring caused by the stabilization process is considered.
Most of the video stabilization approaches found in the literature have adopted the interframe transformation fidelity (ITF) [24][25][26][27][28], which can be expressed as the peak (2020) 2020: 19 Page 3 of 19 signal-to-noise ratio (PSNR) of video frames. More recent approaches have considered the structural similarity (SSIM) [29] as an alternative to PSNR [28]. Liu et al. [30] employed the amount of energy present in the low-frequency portion of the 2D motion estimated as a stability metric. The rate of frame cropping and distortion are used to assess the stabilization process more generally.
Synthesizing unstable videos from stable videos has been proposed for the evaluation of video stabilization [31] in order to provide the ground-truth of the stable videos. The methods are evaluated according to two aspects: (i) the distance between the stabilized frame and the reference frame and (ii) the average of the SSIM between each pair of consecutive frames.
Due to the weaknesses of ITF in motion videos, an evaluation method based on the variation of the intersection of angles between the global motion vectors, calculated from the scale-invariant feature transform (SIFT) keypoints [32], was proposed to evaluate the video stabilization process [33]. In fixed-camera videos, the ITF is considered, however, only for overlapping the frame background, instead of the entire frame.

Subjective evaluation
Several methods found in the literature briefly describe and analyze review the trajectories made by the camera and the trajectories of the stabilized video [34][35][36][37][38]. These trajectories are usually related to the different factors that compose the estimated 2D motion. For instance, the approaches present the camera path for horizontal and vertical translations and rotations. Figure 1 shows an example of path for horizontal translation estimated from the original (blue) and smoothed (green) trajectory.
From the trajectory, it is possible to identify when a motion occurs and its intensity in the original video, as well as such motion after its smoothing. This type of visualization can be very useful to analyze the behavior of the motion smoothing step used in a certain method. However, its result depends on the technique used in estimating the motion, so that the trajectory does not reliably represent the video motion. Thus, the trajectory may not be a good alternative to the evaluation of the stabilization quality, as well as not an adequate visualization for videos with spatially distinct motion. Some approaches in the literature deal with frame sequences usually superimposed by horizontal and vertical lines [25, 28, 35-37, 39, 40]. Thus, it is possible to check the alignment of a small set of consecutive frames. Figure 2 illustrates an example of such type of visualization, where objects intercepted by lines are more aligned in the stabilized video.
From the sequence of frames, the displacement of each frame is noticeable, in addition to the amount of pixels lost due to the transformation applied to each frame. However, this technique becomes impractical when a large number of frames is considered, compromising the analysis of the entire video.
Furthermore, there are approaches that summarize a video in a single image calculated through the average gray levels of the frames [41,42], as shown in Fig. 3. Better-defined images are expected for more stable videos. From this representation, it is possible to check if the video has more amount of motion; however, it is difficult to determine the nature of video motion.
In a broader context, video visualization is concerned with the creation of a new visual representation, obtained from an input video, capable of indicating its characteristics and important events [43]. Video visualization techniques can generate different types of output data, such as another video, a collection of images or a single image. Borgo et al. [43] reported a review of several video visualization techniques proposed over the last years.
In order to help users find scenes with specific motion characteristics in the context of video browsing, motion histograms were proposed in the HSV color space [44]. Motion histograms are obtained by means of motion vectors contained in H.264/AVC codecs.  Figure 4 presents an example of the visualization, where each frame of the video is represented by a vertical line, such that the motion direction is mapped by different colors and the motion intensity by brightness values. As a disadvantage, this technique suffers from the presence of noise in the motion vectors, introduced by the motion estimation algorithm [44]. Visual rhythm [45] (VR) corresponds to a summary of temporal information of a video represented as a single image. This is done by concatenating portions of information from each frame of the video. Visual rhythms have been generally applied in the context of video identification and classification, for instance, location of video subtitles, recognition of person action detection of video shot boundaries, detection of face spoofing, among others [46][47][48][49][50]. Unlike these approaches, the visual rhythms are used in this work to create a representation of temporal information that allows the evaluation of the video stabilization by humans. Typically, two different paths for constructing the visual rhythms are considered when traversing each video: horizontal and vertical. Such representations differ according to the information that is extracted from the video frames. The vertical rhythm extracts the information from the columns of each frame, whereas the horizontal rhythm is constructed from the rows of each frame.
A single column or row (or a small set of them) of each frame is usually used to construct the visual rhythm. Figure 5 illustrates the construction of a horizontal visual rhythm, as commonly described in the literature. However, the construction of a visual rhythm is very susceptible to different strategies for video traversal, for instance, a zigzag path, where an alternating direction might extract patterns from the video frames more appropriately for a certain problem.

Methods
In this work, the visual rhythms are constructed by traversing the video at vertical and horizontal directions. However, as opposed to using a single row or column (or a small set of rows or columns), we use the average of the columns for the vertical rhythm and the average of the rows for the horizontal rhythm.
For both path directions, the rhythm is obtained from the sequential concatenation of the intensity values, such that the jth column of the visual rhythm image corresponds to the intensity values in the jth frame. In the horizontal rhythm, a rotation is performed on the rows in order to obtain the columns in the final image. The width of a visual rhythm corresponds to the number of video frames, whereas its height corresponds to the height or width of the frames for the vertical or horizontal rhythm, respectively. Figure 6 shows the relations between the pixels of the neighborhood in a visual rhythm image, from which we can see that the visual rhythm maintains the temporal and spatial information of the video. Thus, the temporal behavior of the gray levels in a certain region can be easily visualized. This provides information on how and when movements occur in the video, that is, in addition to being able to distinguish the direction, the intensity, and the form that the movements are spatially arranged, we can verify the frequency of certain type of movement and determine the moments of its occurrence. Stable video is expected to have a more uniform visual rhythm, with fewer twitches and better defined (2020) 2020: 19 Page 7 of 19

Fig. 6
Patterns for pixel neighborhood in the visual rhythm curves. We refer to "neighbor i − 1" as the pixel that is on the row immediately above the row of pixel i in the column that represents information extracted from a frame, whereas "neighbor i + 1" corresponds to the pixel immediately below the row of pixel i. Figure 7 shows the construction of a horizontal rhythm for two 3×3 frames. At the transition between frames A and B, the camera moves from right to left, causing the pixels to be to the right of their original position. Thus, when obtaining the horizontal rhythm, the pixels of the column corresponding to frame B are below the equivalent pixels of frame A, thereby forming a declination.
The separation of the vertical and horizontal visual rhythms is important to thoroughly detect and evaluate problems in the video stabilization process. From the vertical rhythm, we can analyze the characteristics of the motion in the y axis. Thus, inclined rhythm lines indicate camera movements from the bottom to top, whereas declined lines indicate camera movements from top to bottom. From the horizontal rhythm, in turn, we have the characteristics of the motion in the x axis. Thus, sloped lines indicate camera movements from left to right, whereas declined lines indicate camera movement from right to left. The use of only one column or row in the extraction of information from each frame may be inadequate since it considers little information of the frame. In addition, it makes horizontal and vertical separation less accurate. This problem can be seen in Fig. 8, where a vertical movement of the camera occurs, which can influence the horizontal rhythm, depending on the difference of the pixels between the rows. Thus, the average of the columns or rows is adopted in our work to compensate for this difference, making the horizontal rhythm less sensitive to vertical movements, and the vertical rhythm less sensitive to horizontal movements.
In Fig. 8, both columns of the horizontal rhythm should have either the same or very close values. However, with a single row in each frame, the direction of the rhythm is uncertain.
As post-processing, we apply an adaptive histogram equalization technique through the contrast-limited adaptive histogram equalization (CLAHE) [51]. This is done to improve the contrast of the visual rhythm, facilitating human perception.
The construction of the visual rhythms is not based on motion estimation, as occurs in other visualizations, shown in Section 2. Therefore, their performance is not dependent on any motion estimation technique, which makes the representation of the video motion more reliable. In the context of video stabilization, such independence of methods for motion estimation is crucial to allow a more unbiased assessment of the results.
The complexity of constructing a visual rhythm depends on three main factors: width W of the video frames, height H of the frames, and number N of frames in the video. To calculate an average row in the construction of a horizontal visual rhythm, we need to compute W averages. The calculation of each mean considers H values. Thus, we have θ(WH) as the asymptotic complexity of constructing an average row. The same complexity is taken for the computation of an average column in a vertical visual rhythm. Since either a row or a column should be calculated for each frame of the video, we have θ(WHN) as the final complexity for constructing a visual rhythm.
Among the good practices in the construction of visual rhythms for the evaluation of video stabilization results, we recommend the following: • Crop the frames of the stabilized video so that there are no pixels with null information (since null information may imply inadequate row or column averages); • Preserve the frame rate of the video in order to not change its number of frames or generate visual rhythms of different sizes; • Rescale the video frames to the original size in order for the visual rhythms to have the same size.

Insights into objective metrics
This subsection provides some insights into the calculation of objective metrics from the visual rhythms for the evaluation of the video stabilization process. It is important to mention that we do not intend to replace existing objective metrics in the literature with the proposed objective metric, but to show that a metric can be extracted to distinguish unstable from stable videos.
In the visual rhythm, the behavior of the movement present in the video is represented by the shapes of the curves. A more stable video has rhythms with smoother curves. As shown in Fig. 7, the directions of the visual rhythm can be observed in each column pair of pixels. Objective metrics can be calculated from the texture of visual rhythms. We conjecture that a softer visual rhythm has more regular directions, with less abrupt changes in the near directions. Thus, to obtain a new objective metric from the visual rhythm, the directions and their changes must be computed. Figure 9 illustrates the strategy for calculating the metric.
Initially, we calculate the visual rhythm gradients in order to obtain the directions of each pixel of the rhythm. This was implemented through the Sobel filter [52]. The gradients are decomposed into magnitude and angle information.
A thresholding with the Otsu algorithm [53] is applied to the magnitude values to determine the edges of the visual rhythm. This is done in order to consider only the edge angles in the following calculations. Then, a co-occurrence representation is calculated based on the gray level co-occurrence matrix (GLCM) [54]. However, it considers the co-occurrence of the angles of the edges in the direction of the angles themselves.
Initially, we eliminated the sign from the angles, leaving them in the range of 0 to 180 • . For the calculation of the co-occurrence matrix M, we consider n directions D = {d 0 , d 1 , ..., d n }, resulting in a matrix of size n × n. The angles are then quantized in possible directions. For each pixel i belonging to the edge, we have its angle θ i ∈ D, from which we calculate the closest pixel j in the direction of θ i . Then, it counts as a co-occurrence at position M θ i ,θ j , that is, an increment at M θ i ,θ j . For cases where θ i are different from the important angles, we have two pixels j 1 and j 2 . Thus, the two positions of the matrix are incremented proportionally to the distances of the angles.
Finally, the matrix is normalized by the sum of its elements. Thus, the value of the matrix at position M θ i ,θ j indicates the probability that θ j is the next direction of the visual rhythm, since the previous one was θ i . From the co-occurrence matrix generated, we can The homogeneity feature, when calculated from the co-occurrence matrix of the edge angles, will assume larger values the closer the angles of consecutive directions.
Several other measures could be developed to extract useful information to qualify the stabilization from their visual rhythms. For this, a thorough investigation is necessary to identify which aspects are important to characterize an unstable motion and how to obtain such aspects through visual rhythm. These tasks may involve both handcrafted features and machine learning techniques.

Results and discussion
This section describes and evaluates the experimental results obtained with two datasets. All the videos considered in our experiments were obtained from two publicly available databases: GaTech VideoStab 1 [55] and the database proposed by Liu et al. 2 [30]. Table 1 reports a summary of the first database with videos in alphabetical order. We will refer to the videos in this database through the identifiers assigned to each of them. Table 2 presents the database proposed by Liu et al. [30], which is divided into six categories, containing a total of 139 videos. We will refer to the videos in this dataset by the name of the category followed by the identifier of each video, attributed by the authors. Due to space limitations, we report only a few visual rhythms that illustrate the results obtained from these databases, which have been confirmed in the other videos. Figure 10 presents the visual rhythms generated for the video #12 before and after the video stabilization process. In order to obtain the stabilized version of the video, we submit it to YouTube, which applies one of the state-of-the-art digital video stabilization approaches [55]. The width of all the images presented in this section was considered constant for a better organization.  From the horizontal visual rhythm of the unstable video, shown in Fig. 10c, we can notice the twitches and irregularities present in the lines. On the other hand, in the horizontal visual rhythm of the stabilized video, shown in Fig. 10b, there are more continuous, well-defined and softer lines. Analogously, the vertical visual rhythm of the unstable video, shown in Fig. 10c, has twitches and irregularities that are eliminated in the visual rhythm of the stabilized video, shown in Fig. 10d. We can also observe that vertical and horizontal rhythms are not influenced by each other, where certain motion regions occur in one but not in the other. For the video Regular 8 , we present a comparison of the visual rhythms obtained through the average of the rows or columns, and through the column or central row. In this case, we present the horizontal and vertical visual rhythms only for the unstable video.
It can be seen from Fig. 11a that the visual rhythm with only one row can be negatively influenced by the vertical motion of the video, with artifacts that do not correspond to the horizontal motion, such as the discontinuities present in the rhythm, whereas the visual rhythms presented by their average are more consistent with the motion present in the video. An analogous behavior can be seen in the vertical rhythm shown in Fig. 11c. Figure 12 presents the visual rhythms of the unstable video #1. For this video, we present the rhythms obtained after the stabilization of YouTube, in addition to a  stabilization with inferior performance. Figure 13 shows the horizontal and vertical rhythms for both versions of the stabilized video. By comparing the visual rhythms for the unstable video and the rhythms for the stabilized videos, it is possible to confirm the validity of using visual rhythms to compare versions of stable and unstable videos. In addition, from the visual rhythms of the two different methods, illustrated in Fig. 13, we can observe the occurrence of less twitches and smoother lines throughout the entire rhythm, both for the horizontal and vertical rhythm. This shows that the visual rhythm can be used in the comparison of two different video stabilization methods.
The horizontal and vertical rhythms for the original and stabilized video QuickRotation 0 are shown in Fig. 14. In this case, the video was stabilized with the method proposed by Liu et al. [30]. The version of the video QuickRotation 0 stabilized with Youtube was not shown here since the method modified its frame rate, reducing the number of frames and making the visualization of the stabilized video considerably smaller than the original video. Besides confirming the smoother lines obtained in the visual rhythm for the stabilized video, it is possible to observe totally vertical lines in the horizontal visual rhythms, which indicates a very fast horizontal movement of the camera. It is also possible to see that the horizontal lines are inclined in their origin, which indicates that the displacement is from left to right.
In Fig. 15, we present the horizontal and vertical visual rhythms for the original and stabilized video Zooming 0 . The video was stabilized through the method proposed by Liu et al. [30].
In the visual rhythms for video Zooming 0 , it is also possible to see the presence of well defined, regular lines in the visual rhythm of the stabilized video. In addition, it is possible to observe inclined and declined lines in the horizontal visual rhythms present simultaneously in the beginning of the video, which indicates the existence of zoom. Figure 16 shows visual rhythms for a video where there is a low-texture background and a moving object. This scenario can be challenging for the proposed representation, since we do not separate the background from the objects in the construction of the rhythm. Nevertheless, the visual rhythym representation makes it possible to distinguish the unstable from the stable videos.   Table 3 reports the results of the homogeneity extracted from the horizontal and vertical visual rhythms for the video sequences listed in Table 1, where the original videos are stabilized by the YouTube method [55]. We can observe that the obtained results are able to distinguish original and stabilized videos. However, further investigation is needed regarding the extraction of other features from the co-occurrence matrix, which may be complementary to the homogeneity information. In addition, the results from the proposed metrics will be compared to objective metrics available in the literature.

Conclusions and future work
This work presented the use of visual rhythms for the subjective evaluation of video stabilization. The vertical visual rhythm is constructed from the average of the columns of each frame, whereas the horizontal visual rhythm is constructed from the average of the rows of each frame. We were able to characterize and separate the horizontal and vertical movements of the video, determining how and when they happen. The stability of a video can be determined from the regularity and smoothness of the curves of each visual rhythm. In addition, the presence of more complex movements, such as zoom, can be verified in the visual rhythm.
As directions for future work, we intend to thoroughly investigate objective evaluation metrics for the stabilization of videos, calculated from the visual rhythms.