Perceptual hashing method for video content authentication with maximized robustness

Perceptual video hashing represents video perceptual content by compact hash. The binary hash is sensitive to content distortion manipulations, but robust to perceptual content preserving operations. Currently, boundary between sensitivity and robustness is often ambiguous and it is decided by an empirically defined threshold. This may result in large false positive rates when received video is to be judged similar or dissimilar in some circumstances, e.g., video content authentication. In this paper, we propose a novel perceptual hashing method for video content authentication based on maximized robustness. The developed idea of maximized robustness means that robustness is maximized on condition that security requirement of hash is first met. We formulate the video hashing as a constrained optimization problem, in which coefficients of features offset and robustness are to be learned. Then we adopt a stochastic optimization method to solve the optimization. Experimental results show that the proposed hashing is quite suitable for video content authentication in terms of security and robustness.

A number of perceptual video hashing methods have been proposed until now. Because video can be seen as a sequence of images, it is quite meaningful to look at methods for image hashing when we study the video hashing. Generally, image hashing can be roughly categorized into three types: (1) Image descriptor based methods. These methods extract perceptually important features from images and quantize those features into final hashes, e.g., histogram [2], Canny descriptors [3]; (2) Image matrix transformation based methods, which decompose image matrix into various components and select the most important part to form hashes. Examples include Discrete Cosine Transformation (DCT) [4], Radon transformation [5], etc.; (3) Machine learning based methods, which try to find correlation between features within high dimensional space, e.g., locally linear embedding [6] and core alignment [7].
Compared to image perceptual hashing methods, video hashing methods could incorporate the temporal property between sequential frames. Methods can be divided into two sections, i.e., spatial domain based and temporal-spatial domain based. The former usually chooses or generates representative/key frames for the video. The final hash is a concatenation of hashes of representative/key frames. Yang et al. [8] developed a video hashing based on speed up robust feature (SURF) descriptor. However, it was for video copy detection and was quite sensitive for frame content operations. Xiang et al. [9] used mean value of luminance histogram to infer hash value. It was catered for geometric distortion tolerance and showed strong robustness against content preserving operations.
Perceptual video hashing methods based on temporal-spatial domain generate hashes from information of both inter-and intra-frames. Pioneer work of this kind was the 3D-DCT method [10]. It exhibited good robustness against noise adding, luminance enhancement, etc. But its discrimination for content changing manipulations was not satisfactory. As for the representative frame research, Esmaeili et al. [11] proposed TIRI (Temporally Informative Representative Images) frame construction, which effectively fuses a series of consecutive frames. Value of a pixel on the TIRI frame is actually a combination of values of pixels on corresponding positions of those frames. Compared to the 3D-DCT method, TIRI method is less time consuming and is able to capture more semantic information. Many works have utilized TIRI for video hashing, for instance, the saliency video hashing [12] and the visual attention based hashing [13].
Some perceptual video hashing adopted matrix decomposition or machine learning methods. For example, Song et al. [14] chose a quantization based hashing, where the least quantization error is optimized using iterative methods. It was for video retrieval and was paid more attention to robustness. Another type of video hashing is based on deep learning method, e.g., multi-model stochastic recurrent neural networks for video hashing [15], binary encoder to decoder architecture for self-supervised video hashing [16], unsupervised hashing based on semantic structure [17] and cross-modal based deep hashing for video [18,19]. Yang et al. [20] addressed the security issue of hamming space search, which improved robustness against the vulnerability of deep learning. However, these methods are mainly used for the video retrieval, video captioning and visual recognition [21][22][23]. They are more suitable to handle video semantics processing and our work is primarily focused on video perceptual representation.
Perceptual video hashing is widely accepted for its two main characteristics, i.e., robustness and sensitivity. Let V denote the video to be hashed, and let H(•) denote the hash function. Symbol V sim represents video which are the results of perceptual content preserving operations on V. Likewise, symbol V dif represents those which are the results of perceptual content distorting operations. The robustness can be explained as where pr represents probability, parameters τ and θ 1 are small numbers near zero. Robustness requires that hash distance between V and V sim should be as small as possible. Similarly, sensitivity characteristic can be shown as where parameter θ 2 should be close to zero as much as possible. Sensitivity needs that large hash distance between V and V dif exists. Thus video with content changing can be easily detected by comparing its hash to that of the original video.
Different applications have different requirements for the strength of robustness and sensitivity. For example, when perceptual hashing is designed for video content retrieval, it is better to allow much more robustness than sensitivity. Because video retrieval should return as many similar video sets as possible. On the contrary, if video hashing is used for content authentication, sensitivity should be preferred. Since video authentication is to decide whether video content is deliberately manipulated, hash should be sensitive to those operations. As it can be observed from (1) and (2), distance between H(V) and H(V sim ) or H(V) and H(V dif ) is compared to a threshold τ. Thus the threshold serves as a boundary between robustness and sensitivity. However, to use only a scalar value τ to determine the result is not enough, because it neglects the semantics and priori information for a specific circumstance.
In this paper we propose a novel perceptual video hashing with maximized robustness for content authentication. The idea is illustrated in Fig. 1, where red hollow triangle represents feature of original video V, small circles and rectangles represent features of V sim and V dif , respectively. When we design a video hashing for authentication, we keep Fig. 1 Conceptual illustration of video hashing for authentication with maximized robustness. Red hollow triangle represents feature of original video. Small circles and rectangles represent features of similar and dissimilar video. The feature offset lambda can define a new boundary shown in green dotted circle, with new feature green triangle as its feature center and new radius epsilon 2 as its robustness. Compared with its former robustness epsilon 1, the robustness is maximized but without loss of sensitivity in mind that we should detect those V dif as much as possible. But we also need some robustness for V sim . In regard to video security, sensitivity has higher superiority. Therefore, we only allow robustness flexibility after we are ensured that V dif could be correctly spotted. If we treat the hollow triangle as the center to compare, we could find a maximized robustness length ε 1 , with which defines a boundary (i.e., the red dotted circle) for sensitivity and robustness. However, if we move the center to solid green triangle by a feature adjustment λ, we could find a new robustness length ε 2 , which is much larger than ε 1 . Thus we obtain an improved robustness without loss of sensitivity. The solid green triangle is the new center to compare. The new boundary (i.e., the green dotted circle) extends the allowable features space. Therefore, the robustness is maximized but without cost of sensitivity. Remaining task is to tactically find parameters λ and ε 2 .
In our previous paper, we proposed core alignment for image hashing [7]. Although there seem connections between these two, several main differences exist. First, motivations are quite different. In our last work, we tried to find the largest discrimination between similar and dissimilar contents by minimizing hash distances. In this paper, we endeavor to obtain the largest robustness when security is first preferred. Thus this approach is more conservative for content authentication when it comes to robustness requirement. Second, mathematical formulations and problem solving methods are quite different. Method here is to optimize two coefficients simultaneously while previous method was to find coefficients sequentially. Third, simulation and results are distinct in terms of open data sets and performances.
The novelty of this paper is that we incorporated the new idea of maximized robustness into perceptual video hashing mathematical problem forming and solving for content authentication. Instead of treating robustness and sensitivity properties as of equal importance, we make sure that sensitivity is first met before we find the robustness. The proposed video perceptual hashing method with maximized robustness is show in Fig. 2, where hashing is divided into two parts, i.e., learning part and hashing part. In the learning part, video sets are composed of original video, together with their variants under perceptual content preserving and changing operations. After preprocessing, we construct features space, including features of original and modified video. We learn  Fig. 2 Block diagram of video perceptual hashing with maximized robustness. In the learning part, video, together with their variants under perceptual content preserving and distorting operations, are preprocessed to form the feature space. The maximum feature offset is learned to be regarded as the maximized robustness. In the hashing part, a particular video is processed to obtain its feature, and the feature is adjusted by the feature offset. Hash function is performed to the feature to generate the final hash from those features to achieve maximized robustness, which results in features offset. In the video hashing part, when a video is to be perceptually hashed, same operations of preprocessing and feature extraction are conducted. The hash function considers the learned adjustment and takes video's feature as input to produce the final hash.
The rest of this paper is organized as follows. In Sect. 2 we formulize the mathematical problem of video hashing with maximized robustness as a constrained optimization, where variables of features offset and robustness are explained. We solve the problem in Sect. 3 by Fish School Search algorithm, where two variables are learned simultaneously. In Sect. 4 we present simulation results and discussion. Finally, we conclude the paper in Sect. 5.

Video perceptual hashing problem formulization
Although various video formats are available, we only consider raw video for perceptual hashing method research. A raw video is a temporal sequence of frames, which do not undergo compression. If the video to be hashed is a compressed one, we adopt advanced multimedia processing tools to transform it into raw format. Considering that video may be presented in different color spaces and luminance vector carries the most significant perception information, we choose only luminance component for perceptual hashing.
We normalize the input video in terms of frame size and frame rate. Each frame is rescaled into width W and height H. Frame rate is re-sampled into F frames per second. Then we adopt method of luminance difference between adjacent frames to group frames into different sets, where each set denotes a scene [24]. Note that preprocessing operations on the video, i.e., frame size scaling, rate sampling and scene grouping, improve the robustness to certain extent.
Let ϒ denote a preprocessed video with ϒ = {v 1 , · · · , v k } , where v k represents the kth group. The ith group consists of a frame set, denoted by {f i1 , · · · , f il } . In order to efficiently obtain hash of each group, we adopt TIRI method to represent those frames within a group. The TIRI representative frame is a linear combination of frames, which is defined as: on the pth frame within kth group. Coefficient w p is a weight for the pth frame and it is determined by γ p . According to the empirical study, TIRI frame shows the best representative performance when γ is set 0.6 [11]. In our method, video hash is made up of representative frames' hash. The hash of a video is the concatenation of hashes of all TIRI frames. Note that videos with different number of representative frames have different hash code length. But length of each representative frame is the same.
In regard to find the best features offset for all video, we consider all TIRI frames of whole training video. We denote all representative frames by a set T,T = {T 1 , · · · , T i , · · · , T n } , which means that total n TIRI frames are generated for n groups of original video in the training part. For each frame, we extract its Then, for each video, we conduct content preserving and distortion operations. A number of versions for each video are obtained. In order to describe the notations more clearly, we allow the number of operations on each group to be the same. Assume numbers of content preserving and distorting operations on video are P and Q, respectively, then for each original representative frame's feature, there are P and Q features for those operations. We denote features under content preserving operations by We denote the variables of features offset and robustness by and ε , respectively. Then the question to find the maximized robustness can be written as follows: where the objective is to maximize the robustness. Variables and ε define a feature space, which provides allowable perceptual content preserving operations' consequences. The first constraint requires that for every feature belonging to the content distorting video, it should maintain a distance larger than the robustness. The distance is measured between the feature and the improved feature. Thus by comparing the distance, we are able to detect those content distorting video. The second constraints needs that the two variables should be valid for every video group, which states that the result of robustness is the consensus of all circumstances.
An interesting point to look at is that we do not include in the constraints the distance between the feature of content preserving video and the improved feature. Two cases of the distance exist, which are as follows: The first one implies that for certain feature of space Φ A , its distance to the improved feature is larger than the robustness. This scenario means that although the feature is indeed an allowable feature from content preserving operations, it still would be judged dissimilar because security requirement is more enhanced by a newly defined boundary. We call circumstance of (5) unachievable robustness. On the contrary, circumstance of (6) says that some feature of space Φ A remains in the allowable feature space. This is called achievable robustness. Therefore, we claim that our perceptual video hashing is quite conservative and it emphasizes much more on the security, which is the requirement for our purpose to authenticate video content.

Optimization problem methods
Recall that in the constrained optimization problem of (4), two variables are to be found, i.e., features offset and robustness. The robustness variable exist both in the objective function and the first constraint, while features offset only appears in the constraint. The maximized robustness is obtained once we find the appropriate feature offset. As for our constrained problem solution, traditional deterministic methods are not applicable, since derivatives of the objective and inequality constraint are hard to find. We look at these two variables from a stochastic perspective. The features offset is brought about by human eyes' adjustment to different types of allowable operations with different strengths upon the original video content. To some extent the adjustment could be thought of the result of various noises on the original feature. A noise represents a kind of operation on the content. Therefore, we use a stochastic way to solve the problem.
Considering the specific characteristics of the problem, we adopt fish school search algorithm to tackle the optimization. The fish school search algorithm imitates food finding behavior of fish schools [25]. The population based algorithm has been widely used to find the best solution in various optimization applications. It feeds the fish and fish acts both individually and collectively. The fish swim toward the food according to the positive gradient of fitness function. Each fish in the school is treated as a potential solution to the problem. Fish swims based on the local and collective information. Because we need to find two variables, we define two kinds of fish school according to problem. However, these two schools are correlated by the first constraint. We define individual fish j and ε j for features offset and robustness, respectively. Sizes of both schools are equal to M. During each of iteration, fish firstly moves as follows: where rand (− 1, 1) generates a random number within range (− 1, 1), step ind1 and step ind2 are two coefficients used to control displacement of the movement. Symbols t + 1 and t represent the count after and before the individual movement, respectively. The movement is valid on condition that the first constraint is achieved. Otherwise, fish needs to remain the position of iteration t.
Then fish is updated through collective-instinctive movement. Individual fish is moved by an average movement, which is calculated as follows: where symbol �ε • j stands for fitness enhancement achieved and �ε • j = ε j (t) − ε j (t − 1) . Symbol �ε j stands for movement displacement for the jth fish and �ε j = ε j (t) − ε j (t − 1) . Each fish is updated by the average movement as follows: Accordingly, we define the average movement for features offset fish school as follows: Each fish j is updated as: Also we need to find the collective-volitive movement for both fishes. Barycenters are calculated as follows: where B 1 stands for the barycenter of robustness fish school and w j (t) stands for feeding weight of the jth fish. The weight is calculated as follows: where max( ε j ) represents the maximized value of fitness function variation. Note that the weight is bounded by W scale and it varies from 1 to W scale . Initial values of all weights are set to W scale . If total robustness fish weight has improved from last iteration, each fish is moved towards the barycenter according to (15). Otherwise, each fish is moved away from the barycenter according to (16).
where coefficient step vol1 is used to control the displacement movement like step ind1 . Similarly, we calculate the barycenter for features offset fish school as follows: The features offset fish is moved towards or far away from the barycenter on the same condition of robustness fish. The movement displacement is as follows: After certain number of iterations, the algorithm stops and features offset with the maximized robustness is returned as the optimal solution. Note that in our hashing optimization, features of Φ A are not included. However, they provide a valuable clue to initialization of fishes. In practice, we choose the minimal difference between the original feature and the allowable feature to be the initial value of feature offset fish j . The minimum distance is used to set the the initial value of robustness fish ε j . They are calculated as follows: The overall searching algorithm is described in Fig. 3, where the inequality constraint of (4) is enforced upon every run of fish updates. The output is the optimal result of features offset and robustness. Note that for these two fishes, the robustness fish with the maximum fitness value is what we are looking for. Thus this fish is regarded as the maximized robustness and its corresponding features offset fish is chosen to be the optimal features offset. The optimal features offset obtained are then used for features adjustment in the video perceptual hashing part.

Results and discussion
We validate our hashing method based on video which are downloaded from an open video database, i.e., open video project. The database is maintained by interaction design laboratory of University of North Carolina Chapel Hill. It contains various types of video, in terms of contents formats and duration. We download 50 video for our simulation. Contents of downloaded video include education, history, speech and documentary. Formats include MPEG-1 and MPEG-2. Durations include one minute, one to two minutes, two to five minutes and five to ten minutes.
Video are first preprocessed before they are input to the hashing method. We normalize video frame size as 320 × 240 and frame rate as 10 frames per second. By comparing luminance difference, we divide video frames into various groups. Each group corresponds to a certain perceptual understanding for humans. Then we calculate TIRI frame for every group. We randomly choose 25 video for training and the remaining is used for testing. We implement the simulation by tool of Matlab with version R2012a on a computer with 8 GB memory and 3.9 GHz CPU. (18)  In order to mimic the operations video may undergo over the network, we conduct content preserving operations and content distorted attacks on the video. TIRI frames are generated for the changed video. The content preserving operations are listed as follows.  Step 1. Initialize features offset fish j λ λ and robustness fish j ε ε according to (21) and (22), iteration variable t=0.

Input: Features space
Step 2. Refresh individual fish movement. Update features offset fish and robustness fish according to (7) and (8). If for certain fish the inequality constraint does not hold, then update is invalid. Fish equals to that of last iteration.
Step 3. Refresh collective-instinctive fish movement. Calculate the average fish movements I 1 and I 2 according to (9) and (11). Update the fishes according to (10) and (12). If for certain fish the inequality constraint does not hold, then this round of update is invalid. Fish equals to that of last iteration.
Step 4. Refresh collective-volitive fish movement. Calculate the barycenters according to (13) and (17). Update the features offset fish according to (15) and (16). Update the robustness fish according to (18) and (19). If for certain fish the inequality constraint does not hold, then update is invalid. Fish values remain the same as that of last iteration.
Step 6. If t<T, then go to step 2. Otherwise stop fish school searching.
Output: Optimal features offset j λ λ with maximum value of j ε ε . Attacks that deliberately change the perceptual contents of video are as follows.
• Block overlaying on original frames by three types, i.e., white, black and Mosaic blocks, each type has two blocks of 50 × 50, two blocks of 100 × 100 and one block of 200 × 200, respectively. • Pasting on original frames by a totally different image, with randomly chosen position and size of content being pasted to be 10%, 20%, 30% and 40%. • Block shuffling by dividing a frame into blocks of equal size and randomly rearranging them to form a new frame, with number of blocks to be 2, 4, and 16.
We In regard to frame feature representation, we choose Radon transformation to form feature vector for each TIRI frame. Radon transformation has superior advantage to describe image feature, which shows strong robustness over rotation, scaling and translation operations [5,26]. We apply discrete Fourier transform to these coefficients and choose norm of transform as TIRI frame feature. In our experiment, angle is chosen to be 0 to 179 degrees with one degree step and order is chosen to be one to six. Length of our TIRI feature vector is 546 with each element being a real number. We also adopt median filtering method to generate binary hash value. Therefore, our hash length for TIRI frame is 546 bits. We choose metrics of true positive rate (P T ) and false positive rate (P F ) to compare the performance of methods. Definitions of P T and P F are as follows.
The claimed authentic or unauthentic TIRI frames mean that those frames are judged secure or insecure by hashing methods. On the contrary, the correctly claimed authentic or incorrectly claimed unauthentic TIRI frames mean that those frames judged by hashing methods secure or insecure are actually secure, whose understanding is supported by a priori knowledge. P T numerically shows the robustness to some extent, while P F shows the security of hashing methods correspondingly. In the simulation we adopt ROC (Receiver Operation Curve) to demonstrate performances of robustness and security simultaneously.
We choose for performance comparison three related hashing methods, i.e., Radonbased [5], TIRI + DCT [11] and LSH-core method [7]. Since our hashing method adopt Radon feature as TIFI feature vector, we chose a hashing method also based on Radon feature to evaluate. The chosen Radon-based method utilized the third order moment of Radon transformation to obtain frame's statistical feature and adopted the first 15 DFT coefficients to construct final hash. Its hash length was 150 bits. We include TIRI + DCT method here because it also used TIRI representative frames to deal with large redundant video frames. It differs from ours in that DCT was conducted on TIRI frame blocks and two coefficients of each block are concatenated to form final hash with 640 bits. LSH-core method was distinguished in that its object function neglected the strict first priority of security as in the proposed maximized robustness. It adopted learning to find best feature core and applied LSH to reduce the dimensionality, resulting a hash length of 350 bit.

Experimental results
We show performance comparisons in Fig. 5, where ROC curve for all fours methods are drawn. The x-axis and y-axis denote P F and P T , respectively. Note that although 32 similar and 16 dissimilar versions of TIRI frames exist, we show the averaged values under all operations in Fig. 4 to understand the whole performance. Note that in ROC comparison the higher the curve, the better the performance. For example, when we set value of P T at 0.95, values of P F for our method, LSH-core, Radon-based and TIRI + DCT are 0.03, 0.06, 0.07 and 0.10, respectively. When we constrain P F to be value of 0.05, we can achieve P T value of 0.97. Values for the others are 0.93, 0.91 and 0.90, respectively. Thus it demonstrates that our method has maximized robustness when security is limited.
In order to analyze the effects of various perceptual content preserving operations upon the robustness and sensitivity, we present brief results in Table 1, where true positive rate and false positive rate performances of four methods are shown. For each type of operations, we show the averaged results across different coefficients settings. Values of P F and P T are obtained when optimal thresholds are used for each type. It can be (23) P T = number of correctly claimed authentic TIRI frames number of claimed authentic TIRI frames (24) P F = number of incorrectly claimed unauthentic TIRI frames number of claimed unauthentic TIRI frames noted that our method has much higher true positive rate than those of the other three methods as for each operation. Moreover it has lower false positive rate. This implies that our method can achieve more robustness but still can preserve strong sensitivity. Because video content authentication is the primary goal for hashing method, it is quite meaningful to compare the false positive rate performance under various attacks. We show the results in Fig. 6, where values of P F are obtained when optimal thresholds are set for all methods.  Figure 6d is for pasting attacks, where x-axis denotes pasting type, i.e., No. 1 ~ 4 representing pasting content of TIRI image being 10%, 20%, 30% and 40%, respectively. As for the block reshuffling attacks, all four methods can detect manipulations with value P F of zero.  Table 1 Robustness and sensitivity performances of the perceptual content preserving operations Here shown in this table are the false positive rates and true positive rates for all four methods under various operations. The listed each operation has the averaged results for each method at each row. The higher the true positive rate, the better the robustness; likewise, the lower the false positive rate, the better the sensitivity the method has

Methods
Our method LSH-core Radon-based TIRI + DCT It is observed that our method has the lowest value of P F among all methods, which states that our method is superior to others when video content security is required. For instance, our method has false positive rate of 0.06 for two 100 × 100 white block overlapping, while LSH-core, Radon-based and TIRI + DCT need 0.11, 0.14 and 0.16, respectively. From Fig. 5a-c it is seen that curves for these three types block overlapping are not fundamentally different, which means that types of blocks do not have distinct effects on the false positive rates.

Discussion
From Fig. 5 it can be seen that our method is the most secure under same robustness requirements. In other words, robustness is maximized in our hashing method compared with other methods. Note that TIRI + DCT has the lowest ROC curve in Fig. 4, which means it has the highest false positive rates given the robustness requirement. The reason may be that this method only chooses low frequencies of DCT coefficients and those coefficients could capture quite coarse perceptual information. When this method is used for content authentication, large useful information cannot be implied in the hash value, leading to higher value of false positive rates. It can also be observed that our method has 3% security improvement compared with LSH-core method when P T is 0.95. This is due to that our hashing is more conservative on security criteria. As is stated in the constrained problem optimization, security requirement is the first priority. However, it can be seen from Fig. 5 when we allow no false positive rates, we could obtain nearly similar values of P T for our method and LSH-core method. This can be understood that although we strengthen the security requirement for perceptual hashing, loss of robustness does not happen. On the contrary, since we try to achieve the best feature offset to adjust final hash, we obtain slightly improved robustness.
With regard to performances comparisons of perceptual content preserving operations in Table 1, it is meaningful to analyze different types of operations in terms of robustness and sensitivity. Overall it can be noted that operations of Rotation, Scaling and Translation exhibit much better true positive rate performances than the remaining operations for all four methods. This can be understood that these superior performances result from the Radon feature, which has strong robustness against the three operations. However, false positive rate performances across the four methods have much more obvious differences. For instance, as for Translation, our method has about 7%, 16% and 24% improvements for LSH-core, Radon-based and TIRI + DCT, respectively. It means that our method dose not loss much sensitivity compared with the other methods. As for the Gaussian noise and the Salt & Pepper, they have slightly similar effects on the performance, because these two methods cause similar consequences on pixels values of TIRI frames. Similarly, all four methods show good results as for the Intensity changing. As for the Average and Median filtering, our method and LSH-core show much better results than the other two. This is due to that the feature to be considered as center indeed needs to be adjusted in order to maintain the best differentiation degree of feature space.
From Fig. 6, it can be seen that our method exhibits more effective when it comes to content altering manipulations. It can detect more unauthentic TIRI frames than other methods. Some interesting phenomenon can be observed here. First, when sizes of blocks or pasting contents grow, false positive rates decrease for all four methods. This can be understood that when overlapping blocks become larger, more perceptual contents of frames are distorted and it is much easier for hashing methods to detect those manipulations. In other words, when attacks on frames become fiercer, effects of those attacks surpass the thresholds of hash value comparison and unauthentic frames results are triggered. Second, our method has relatively lower false positive rates. The reason is that we take into consideration the robustness and security prior knowledge when we form the perceptual content hashing problem. Moreover, we put the security as the constraint condition. Robustness is divided into achievable and unachievable ones as in (5) and (6). Thus it shows much better performance when we authenticate video content.
Note that in all four methods, only our method and LSH-core need training to obtain optimal coefficients. In our simulation, we record the training and testing time for all video of four methods. The average testing time for one TIRI frame is 1.41 s, 1.35 s, 1.14 s and 1.09 s for our method, LSH-core, Radon-based and TIRI + DCT, respectively. Average training time is 2.84 s and 2.45 s for our method and LSH-core.
Although our method performs well in terms of security and robustness, it is slightly more time consuming, especially when real-time hashing requirement is needed.

Conclusion
In this paper we have proposed a video perceptual hashing method to authentication video content based on maximized robustness idea. The proposed maximized robustness means that video hashing should obtain its largest robustness property only the security requirement is first met. We have addressed the ambiguous problem between security and robustness properties in video hashing. First, we formulate the mathematical problem as a constrained optimization, where two coefficients, i.e., features offset and robustness should be decided. The optimization utilizes a priori knowledge of to what extent similar or dissimilar video under content preserving or manipulating operations should be. The constraints in the optimization strictly define the security characteristics of video hashing, which tells how robustness the hashing could achieve. The optimization is learned by a population based solving method. Two coefficients are learned simultaneously. We evaluate the proposed method on a video set and comparisons are conducted in terms of robustness and security. Experimental results show the superiority of our hashing method when it comes to video content authentication. Future work would be focused on video perceptual hashing based on deep learning, which takes into consideration the relationship between low-level and high-level semantics to enhance robustness and security.