Fast distributed video deduplication via locality-sensitive hashing with similarity ranking

The exponentially growing amount of video data being produced has led to tremendous challenges for video deduplication technology. Nowadays, many different deduplication approaches are being rapidly developed, but they are generally slow and their identification processes are somewhat inaccurate. Till now, there is rare work that studies the generic hash-based distributed framework and the efficient similarity ranking strategy for video deduplication. This paper proposes a flexible and fast distributed video deduplication framework based on hash codes. It is able to support the hash table indexing using any existing hashing algorithm in a distributed environment and can efficiently rank the candidate videos by exploring the similarities among the key frames over multiple tables using MapReduce strategy. Our experiments with a popular large-scale dataset demonstrate that the proposed framework can achieve satisfactory video deduplication performance.


Introduction
Due to the increasing popularity of mobile devices and social networks, huge numbers of videos are being created and shared online. This explosive growth in the amount of video data being produced has made storing and rapidly searching it all very challenging. In practice, many videos are duplicates, or near-duplicates, so detecting these copies has become a very important technique for reducing the storage and computation required.
In recent years, many content-based duplicate detection techniques have been developed that aim to identify such copies automatically in massive datasets. For example, a million-video-scale near-duplicate video retrieval system [1] has been developed that quantizes the key frames' features into visual words and uses an inverted file index to implement rapid search. In contrast to prior research, Song et al. [2] presented a near-duplicate video retrieval method based on compact hash codes learnt *Correspondence: luojie@nlsde.buaa.edu.cn 3 State Key Lab of Software Development Environment, Beihang University, Beijing, China Full list of author information is available at the end of the article from multiple visual features, a promising solution that enables fast signature generation based on binary codes from multiple views.
Most of the existing hashing methods have been proposed to handle the popular vectorial data like the images. They can also be directly applied to indexing the gigantic video data by treating frames as images [11,[22][23][24][25]. Song et al. [2] proposed a multiple feature-based hashing to capture different aspects of visual content. Cao et al. [23] effectively selected a number of informative features to characterize the video content under a submodular hashing framework. Xia et al. and Wang et al. [26,27]  further employed the subspace representation to generate framewise hash code by considering local structure of the consecutive frames. However, these video hashing solutions only take the video frames as the still images and generate their binary codes independently. In practice, it is well known that videos are quite different from images with more complex semantic and temporal information.
The video deduplication should take the temporal order of the frames into the consideration. Despite this progress of the highly developed hashing techniques, little attention has been paid to methods of building efficient indexes with hash codes or generating good ranking lists by aggregating results from multiple indexes. There are a few studies that attempt to address the image indexing by learning a number of the complementary hash tables [19,28,29]. However, these techniques heavily rely on the specific and usually expensive learning algorithms, which can hardly be compatible to the generic scenarios and existing hashing algorithms.
In this paper, we address this problem by proposing a generic, yet fast video deduplication framework based on hash codes. This framework supports hash table indexing and searching based on any existing binary hashing algorithm, and it can adaptively combine ranking results from multiple tables by considering key frame similarities. To the best of our knowledge, this is the first attempt to study a general hashing-based video deduplication framework that can support large-scale video databases.
To handle the large-scale duplication problem, the distributed computing is popular and successful technique in the literature. Kumar et al. [30] proposed a technique whereby a chunking algorithm divides the data stream into fixed-size chunks, from which hash values are generated via the MD5 algorithm and then used by a MapReduce (MR) model to identify duplicates. Moise et al. [31] proposed using MR for efficient index creation and search, enabling billions of descriptors to be indexed and large batches of queries to be processed. Following the prior research, in this paper, we further enhance our hashingbased video deduplication method with a distributed framework, which simultaneously exploits both the computing power of the distributed nodes and the nature distributed storage of the gigantic video data nowadays.
Note that the whole paper extends upon a previous conference publication [32] with additional exploration and experiments on the general distributed computing framework for the hash-based video deduplication. The rest of this paper is organized as follows. Section 2 introduces the proposed hashing-based video deduplication framework. Section 3 elaborates our approach to fine-grained ranking over multiple table indexes, which attempts to capture the videos' similarities. Section 4 describes how we use MapReduce to process the video data both online and offline. Section 5 presents the results of experiments on a popular benchmark, demonstrating the proposed method's effectiveness. Finally, Section 6 concludes the paper.

Methods -hash-based video deduplication framework
In this section, we first outline our framework and its main components, then introduce the hashing-based video indexing process used by the framework. Figure 1 gives an overview of our framework for largescale video deduplication. Given a query video, this can efficiently find matches between that video and those in the database, rapidly identifying duplicate videos. The framework consists of four main components: video hashing, index construction, video archiving, and video deduplication. These components perform the following functions.

Video deduplication framework
Video hashing: In this step, we process the query video by first extracting key frames, then generating visual features (via the Color and Edge Directivity Descriptor (CEDD) approach in this paper) for each one. Based on these features, we represent each key frame by a set of binary hash codes from different hashing functions. Many different hash functions could be used here, such as projection-based [16,33] or prototype-based [34] functions.
Index construction: Using multiple hash tables has been found to be very helpful for achieving high recall performance for big data search [35,36]. In this step, we therefore build multiple hash tables based on the binary codes obtained above. Again, there are many possible strategies for this, including multi-index hashing [37] and complementary hash tables [19,28].
Video archiving: Using the above two steps, all the videos in the database are represented as binary codes and imported into multiple hash tables. Figure 2 shows the structure of these indexes, where each unique hash code corresponds to a bucket containing similar key frames. This step is carried out offline.
Video deduplication: The given query video is first hashed to generate a set of binary codes for each key frame. Then, for each hash code, we check all the buckets in the corresponding hash table within a small Hamming distance of it, considering the videos containing the key frames in these buckets as candidate results. We then rank them by similarity (Section 3), enabling duplicate videos to be easily detected.

Hash table indexing
One of the most important parts of the above framework is the hash table indexing step, as it guarantees low memory consumption and satisfactory deduplication per- formance. To achieve the desired performance, we combine several efficient techniques, including multiple-table indexing, multi-probe search, and Hamming distancebased ranking. Suppose that, for each key frame x, we generate a set of B binary hash codes can be used, of which the simplest is random projection. Then, we build L hash tables by evenly partitioning the code y into L subcodes of length M L , which is usually less than 32 in practice. These subcodes can then be used to build a set of hash tables {T l , l = 1, . . . , L}, each based on hashes of length M L . In these hash tables, each bucket contains key frames from videos in the database. Figure 2 shows an example hash table based on 4-bit codes, where a total of 2 4 buckets each store the key frames extracted from different videos that share the same hash code. Using these tables, we perform separate hash

Similarity ranking over multiple tables
The multiple-hash-table lookup process returns several different result sets, so one critical problem is how to combine these to form the final ranking list. This section describes our similarity ranking strategy.

Frame similarity
During the online search phase, we first divide each key frame's hash code into L equal parts, then look up each subcode in its corresponding table. The most common table lookup strategy is to search all buckets within a small radius. Therefore, given an allowed lookup radius R, the maximum Hamming distance for each table is R L . To compute the similarity of two videos, we should first capture the similarity relationships among their key frames. For the ith key frame x qi of the query video and thejth key frame x dj of the dth video in the database, we combine the Hamming distance-based similarities for each hash table to derive the overall similarity of the key frame pair in the natural way: where r l is the Hamming distance between the l-th subcodes of the query and database frames and α is a weighting parameter that controls the contribution of the Hamming distance. Essentially, when r l ≤ R L , the videos are highly likely to be duplicates, so we set α = 1; otherwise, they are unlikely to be duplicates, so we set α = 0.
In summary, the above similarity definition is based on the fact that when the subcode distances are smaller and more key frames match, the videos are more likely to be duplicates.

Video similarity
The above similarity definition ignores the videos' temporal information, but prior research has demonstrated that including such information can improve performance [38]. Therefore, we also consider the temporal consistency between the matched frames. However, considering high-order temporal sequences is quite complex and timeconsuming, so we simplify the problem by focusing only on the temporal orders of pairs of frames, instead of the full list.
Specifically, we define the order preservation ratio (OPR) as follows: Here, we consider two pairs of matched frames, namely, the k i th and k j th frames of the query video and the corresponding k i th and k j th frames of the dth database video, denoting the order of the k i th frame in the dth video by I dk i and defining the other variables likewise. This captures the idea that the key frames of two duplicate videos will be consistently ordered in the fact that OPRs for matched frame pairs will then remain constant. We can therefore use this to filter out false positive video matches. Based on this intuition, we refine the similarity metric between the query and dth database videos by considering the temporal order of all possible pairs of two matched frames and simply summing the similarities of pairs with the same OPR. This means that if more frames match in consistent order, they will contribute more to the similarity. If we have two sequences of matched frames, I qk 1 , . . . , I qk m from the query video and I dk 1 , . . . , I dk m from the database video, we calculate the similarity matrix G d = g d ij , 1 ≤ i, j ≤ m as follows. Each entry g d ij sums the similarities of the individual frames in a given matched pair: where v d i is an OPR value. Figure 3 shows two cases, one where multiple matchedframe pairs share the same OPR and another where they do not. Here, we can see that sequences where more frames match in a consistent order will have higher similarities due to the summation process, helping us to distinguish true duplicates from false positives with lessconsistent frame matches.
Based on the histogram, we obtain the final similarity between the query and database videos by taking the histogram's maximum value, which reveals the dominant matching order. To eliminate the effect of video length, we normalize the similarity as follows: a b where m is the number of matched key frames for the query and database videos. Based on this similarity metric, we can easily rank all candidates in descending order.
Since the candidate set is quite small relative to the size of the database, by efficiently computing the Hamming distance and OPR, we can generate this similarity ranking quite quickly.

Distributed deduplication
The distributed framework proposed in this paper is based on the MR model, which assigns tasks equally to each Hadoop DataNode. In this section, we will elaborate how MR is used for distributed video processing, during both the offline and online stages.

Offline video data processing
First, all the videos in the database are processed, and these tasks are assigned to an average of M DataNodes, one per video. When the initial preprocessing step is complete on each DataNode, hash codes are generated for each key frame and L DataNodes are allocated to build hash tables. Next, the key frame hash codes generated for the M videos are looked up in the L hash tables by MR, and the matching results are obtained. Finally, the video storage process is completed based on the matching results.
The video preprocessing and hash table creation processes are described in detail above, so now we discuss how MR is used to perform the matching operations. Early in the map processing phase, the input data is split into groups by the InputSplit method and parsed into intermediate key/value pairs, which are then used as input to the reduce method in order to obtain the final results.
During the map phase, we generate key/value pairs with a subcode as the key and the ID number of the corresponding key frame (VF) as the value (<subcode, VF> in Fig. 4)

Online video data processing
First, we preprocess the input video to generate key frame hash codes, then we look up these hash codes in the hash tables and obtain the matching results using MR (Fig. 5). Here, the keys and values are the hash table TL and the corresponding hash code for the input video.
As shown in Fig. 6, the input data is first parsed into a series of key/value pairs <K1, V1>, where K1 represents the hash table TL and V1 represents the key frame's subcode. The map phase creates a sequence of L key/value pairs based on K1, namely <K2, List (V2)>, which give the key frame IDs V2 corresponding to the subcode K2. The reduce phase traverses the list of key frames V2 corresponding to subcode K2, as follows: for (V2 = first, V2 != NULL, V2 = V2.next ) if V2 = lookup(K2), emit<K2,V2>; Finally, we obtain key/value pairs <K2, V2> of hash table entries (buckets) and key frames that give, for each subcode K2, the similar key frames stored in the same bucket. Figure 6 demonstrates the case of hash table lookup for MapReduce-based. Define the hash table T L as the key, and corresponding queried hash code is the value . First,

Results and discussions
In this section, we evaluate the proposed framework on a large-scale video deduplication task. Datasets and protocols: Here, we adopted the widely used UQ_VIDEO [2], which is a combined video dataset created from the CC_WEB_VIDEO [39] by adding videos downloaded from YouTube. The YouTube videos were selected based on the most popular queries from the Google Zeitgeist Archives from 2004-2009. The UQ_VIDEO dataset contains a total of 169,952 videos, making it the largest web video dataset designed for experiments. In addition, it provides 3,305,525 key frames extracted from these videos. For testing, CC_WEB_VIDEO contains 24 manually defined nearduplicate web videos for use as queries. For each query video, there are several ground truth videos that are identical or nearly identical, but different in terms of features such as the file format, encoding parameters, editing operations, or length.
With regard to performance metrics, we employed the common precision and recall metrics for hash table lookup. Essentially, we retrieved all key frames that fell into the buckets of any table within a given Hamming radius of the query hash code. In most experiments, we used the popular LSH algorithm to generate the hash codes. However, we also investigated the effect of using different hashing algorithms, including iterative quantization (ITQ) [16]. We tried building several different numbers of hash tables from the hash codes, each using codes of different lengths.
Code length: First, we studied the effect of changing each table's code length. In this experiment, we built four tables, with code lengths of 16, 20, and 24, using 64, 80, and 96 LSH functions. Figure 7 shows their precisionrecall performance for lookup radii of D = 0, l = 1, . . . , 4. These precision-recall curves show the overall video duplicate detection performance, with larger areas under the curve indicating better performance. Here, we can see that 96-bit hash codes achieved the best performance in both respects, which is consistent with the fact that the optimal code length value should close to log 2 n , where n is the number of frames [37,40] (3,305,525 in this case). Even though we obtained the best performance with 96-bit functions, the results were also satisfactory with 64 and 80 bits. However, using fewer hash bits for each table means more frames fall into each bucket, leading to increased computation costs for the similarity ranking process, due to the larger number of candidate-matched frame pairs. Therefore, in practice, the ideal code length should be a balance between high performance and fast execution. Table 1 lists the computation times in all three cases, showing that, as the code length increased, the time required for similarity ranking decreased. In addition, increasing the search radius increased the time taken significantly. This is mainly because both short codes and large radii increase the collision probability, leading to more candidates for similarity ranking. After balancing the precision-recall performance with the computational cost, we chose to use 80-bit hash functions for all other experiments. Lookup radius: In addition to the hash length, we also considered the effect of changing the lookup radius. Figure 8 shows the results of using different lookup radii (D) with different hash lengths. Here, we can easily see that increasing the search range improves the overall performance, especially the recall performance. This is because more candidates participate in the similarity ranking process, enabling our ranking method to distinguish duplicates more easily from this larger candidate set.
Hashing algorithms: In the literature, any well-designed hashing algorithms have been proposed and shown to be more powerful than the basic LSH approach in many applications. We therefore compared the LSH-based video deduplication results above with those of state-ofthe-art hashing algorithms. Here, we examined the most successful of them, namely, iterative quantization hashing (ITQ), to demonstrate the effect of different hashing algorithms on our task. Figure 9 shows the relative performance of LSH and ITQ. We can clearly see that LSH yields better performance than ITQ, indicating that, when building multiple tables, LSH may actually be better even than welldesigned hashing algorithms like ITQ. We believe this is mainly because these methods were not originally designed for multiple tables, and thus ignore table complementarity [19,28]. The computational costs of using LSH and ITQ also emphasize this point, as LSH only took 1.77 s, while ITQ required 12.81 s due to a lack of discriminative power when building multiple tables. Similarity ranking: Next, we evaluated the proposed similarity ranking method. We compared it with a naive baseline approach that scores the candidate videos based on a basic voting strategy, without considering the temporal order or Hamming distance. Figure 10 shows the Fig. 9 Performance comparison for LSH and ITQ hashing results, demonstrating that the proposed method significantly outperformed the native solution, which has been widely used in hash-based applications. Since our method can be applied to arbitrary data sequences, it could be beneficial in many similar applications in the future.
Computational cost: Compared with the processing time required by a single machine, using MR for largescale video data processing is more efficient. In this study, we created an MR test environment consisting of four physical machines, one NameNode and three DataNodes, each configured as shown in Table 2.
The MR model must copy the data from disk to the Hadoop Distributed File System (HDFS) when dealing with video data, which delays processing by the time required to copy the data across the network. We therefore carried out the preprocessing steps on disk instead of copying the data to the HDFS, which greatly reduced the time required to copy the data. Figure 11 shows that MR had a clear processing speed advantage when dealing with more than 100 million  of video frames, and that this steadily increased with the amount of video. For example, with 900 million of video frames, MR was 44.2% faster than using a single machine.

Conclusions
To achieve the fast deduplication of a large-scale video dataset, this paper proposed a distributed framework based on locality-sensitive hashing, which is generic and powerful to use any existing hashing algorithm to build multiple hash table indexes. Based on the efficient indexes, we then developed an efficient similarity ranking method that combines the search results from multiple tables by considering both the Hamming distances between key frames and the frames in temporal order. By further introducing the distributed computing strategy based on the MapReduce, the efficiency of hash-based deduplication is further improved at both offline indexing and online search stages. We conducted several experiments on largescale video datasets to evaluate the different aspects of our method, and the results indicate that the proposed method is robust and efficient for large-scale video deduplication.

Fig. 11
Comparison between single machine and MR processing performance