A video coverless information hiding algorithm based on semantic segmentation

Due to the fact that coverless information hiding can effectively resist the detection of steganalysis tools, it has attracted more attention in the field of information hiding. At present, most coverless information hiding schemes select text and image as transmission carriers, while there are few studies on emerging popular media such as video, which has more abundant contents. Taking the natural video as the carrier is more secure and can avoid the attention of attackers. In this paper, we propose a coverless video steganography algorithm based on semantic segmentation. Specifically, to establish the mapping relationship between secret information and video files effectively, this paper introduces the deep learning based on semantic segmentation network to calculate the statistical histogram of semantic information. To quickly index the sender’s secret message to the corresponding video frame, we build a three-digit index structure. The receiver can extract the valid video frame from the three-digit index information and restore the secret information. On the one hand, the neural network is trained through the original image and the noisy image in this scheme; therefore, it can not only effectively resist the interference of noises, but also accurately extract the robust deep features of the image. The frames of video generate the robust mapping to the secret information after the semantic information statistics. On the other hand, semantic segmentation belongs to pixel-level segmentation, which has high requirements for network parameters, so it is difficult for attackers to decrypt and recover secret information. Since this scheme does not modify the primitiveness of video data, it can effectively resist steganalysis tools. The experimental results and analysis show that the video coverless information hiding scheme has a large capacity and a certain resistance to noise attack.


Pan et al. EURASIP Journal on Image and Video Processing
(2020) 2020: 23 Page 2 of 18 information disclosure and cracking to prevent the theft, abusement, and infringement of video files. 2. Verify integrity. In the process of long-distance transmission and storage of video, video files are vulnerable to compression noise or other attacks, resulting in the loss or deletion of video parts. Ensuring that video files are complete means that the integrity of the video can be judged.
Information hiding is an effective way to meet the above requirements. Early secure communication technologies are based on cryptography, such as DES (data encryption standard) and RSA (public-key encryption algorithm). The model is shown in Fig. 1.
The sender uses the encryption algorithm to encrypt the output confidential information, and the receiver decrypts the received ciphertext through a corresponding key to obtain the information. However, the secret information encryption method is "graffiti" encryption with unreadable cipher, which explicitly told the attacker what information is important information and aroused the suspicions of the attacker. Besides, with the continuous improvement of computer performance and the ability to process large amounts of data, ciphertext information is more likely to be cracked. Once the ciphertext message is deciphered, the information security will suffer a devastating attack.
For the problems discussed above, information hiding technology has been put on the agenda.
Compared with the traditional secret science technology, information hiding technology adopted, the encryption model is difficult for attackers to find the important information. Information hiding technology [1,4] is using the redundancy of the carrier signal itself and human visual sensitivity to embed the secret information vehicle the information existing in numerous other media information, which makes it hard for the attacker to find the target to attack and decipher. Based on the redundancy of the carrier, the carrier is modified to some extent. However, under the background of big data and cloud computing, especially the general analysis and blind information hiding analysis method, the traditional information hiding method is also effective against some types of steganalysis. Once an attacker detects the presence of private information, even though the attacker cannot decrypt the hidden information, he can attack these carriers. The rea- son why the traditional hiding algorithm cannot resist the steganalysis is the modification caused in the process of hiding. In order to fundamentally resist steganalysis, the coverless information hiding scheme proposed by Sun and other scholars does not mean that it does not require the carrier. It directly establishes the mapping relationship between secret information and hidden carrier according to the characteristics of the carrier rather than modifying the carrier. Since coverless information hiding technology does not modify carrier information, an attacker cannot obtain the information even if he may get the original carrier that includes the secret information. Therefore, coverless information hiding technology has a unique anti-steganalysis ability. Both theoretical research and the maturity of coverless still have a long way to go; it is still a relatively new field with potential value.
In the early stage, some scholars proposed a zero-watermark algorithm similar to the coverless hiding information. Huang et al. [5] proposed a VQ-Based robust multiwatermarking algorithm. In order to send secret information, the sender converted the secret information into a binary hash sequence and divided it into several equal-length segments. For each segment, search for it in the inverted index structure to find the image with the same hash sequence of that segment. A series of stego-images related to images is collected and transmitted. For the receiver, the hash sequences of these received images are generated by the same hash algorithm. The existing coverless information hiding schemes are mainly based on text and image, few relevant literature are based on video. Zhou et al. proposed steganography of coverless information hiding based on gray images [6]. Luo et al. [7] proposed a coverless real-time image information hiding algorithm based on image block matching and dense convolution network. Yi et al. [8] proposed a coverless information hiding algorithm based on Web text. Zhang et al. [9] proposed a robust coverless image steganography based on DCT and LDA topic classification scheme. Liu [10] proposed a coverless steganography based on image retrieval of DenseNet features and DWT sequence mapping. Ruan et al. [11] proposed a GIF-based method for information hiding without carrier. This method quantifies each GIF image in the existing carrier image library and extracts the attribute value of its extension to hide secret information. Zhou et al. [12] proposed a coverless information hiding method based on hog hashing, which is generated by using a hog-based hash algorithm. Zheng et al. proposed a coverless information hiding method based on robust image hash [13]. Duan et al. [14] proposed a coverless information hiding method based on the generation model.
At present, coverless information hiding algorithm about videos has been proposed, and some researchers have proposed some zero-watermark algorithms for video copyright protection. This technology constructs watermark by extracting video features without modifying any video data. Jiang et al. proposed an improved pseudo 3D-video zero DCT domain watermarking algorithm [15]. The Euclidean distance between frame method is adopted to select key frames, and the three-frame difference method is used to get the moving target. Bu et al. proposed a video zero-watermark algorithm [16] based on contourlet transformation. This algorithm made contourlet changes for each frame in the original video, then took its low-frequency coefficient, and calculated the coefficient to construct zero-watermark. It is worth mentioning that digital watermarking should focus on avoiding data modification and deletion, improving the robustness of various attacks. These schemes cannot meet the requirements of resisting steganalysis. Video not only has rich semantic features such as image texture, shape, and color, but also has continuity in time and space. Semantic segmentation is an important method for the content analysis of video images. In this work, we study the semantic information of video image and map it with hidden information to realize the information hiding. The major contribution of this work is we build a framework based on semantics of video segmentation without covering information hiding, which introduced the MobilenetV2 [17] convolution neural network in video coding, feature extraction. Because MobilenetV2 neural network belongs to lightweight neural network, it is easy to handle the video file with its superior performance. At the same time, video is decoded and segmented by upsampling and 1×1 convolution module. One convolution module model can effectively obtain the classification scores of pixels in rough positions and obtain the final pixel-intensive output through upsampling. Meanwhile, this paper proposes an information mapping algorithm based on semantic information statistical histogram. Finally, the scheme analyzes the influence of different parameters on capacity and the robustness.
The structure of this paper is as follows: Section 2 introduces the basic content, Section 3 introduces the proposed methods, and Section 4 gives the experimental results and comparison. We conclude this article in Section 5.
In this paper, neural network is used to understand and segment the image semantically, and the semantic features are extracted to hide secret information. Semantic segmentation aims to understand the image at the pixel level and expects all pixels in the image to be labeled with the target category. MobilenetV2 convolutional neural network classifies images at the pixel level, thus solving the problem of semantic segmentation at the semantic level. Due to the above characteristics of semantic segmentation, it has high security when applied to information hiding. From the perspective of attackers, it is difficult for them to crack secret information from the semantic level, and it is difficult to restore secret information with different neural networks and parameters.
Each layer of the convolutional network used for semantic segmentation is an h × w × d 3d array, where h and w are spatial dimensions and d is the feature or channel dimension. The first layer is the image with h × w pixels and d color channels. Locations at high levels correspond to locations in the image where they are connected, called the receiving region.
The convolution net is based on translation invariance. By convolution, pooling and excitation function and other basic layers in the local input domain only depend on the relative space coordinates. In a certain layer x ij for coordinates (i, j) of vector data, computation formula is as follows: (2020) 2020:23 Page 5 of 18

Fig. 2 Linear bottlenecks + inverted recurrent block
where k is the convolution size, s is the step size or downsampling factor, and f ks determines the type of layer, such as matrix multiplication or average pooling, maximum pooling, or a nonlinear excitation function. MobilenetV2 adopted in this paper is a lightweight network, which not only solves many parameter problems, but also solves the problem of low latitude data collapse caused by relu layer and how to use features for reuse by proposing linear bottlenecks + inverted recurrent block as the basic network structure. Data collapse is a problem encountered in deconvolution. At the same time, if the weight value of a convolution node changes to 0 in convolution network training, the output of the node is 0 for any input, the gradient value through the relu layer is 0, and the node will never recover, which can be effectively solved by feature multiplexing. Figure 2 shows the basic structure of MobilenetV2.
The complete parameters of MobilenetV2 are shown in the Table 1. The bottlenecks of MobilenetV2 are the most basic unit in the module. Bottleneck modules are respectively by the expansion, convolution, and compression, and MobilenetV2 uses bottleneck module to simplify the calculation of network.  In the one convolution module adopted by the decoder, a convolution of 1 × 1 with 150 channel dimensions is added at the end to predict the score of each classification, followed by a deconvolution layer for bilinearly sampling rough output to pixel dense output.
As shown in the structure diagram, the neural network is to extensively train encoder and decoder, and then extract and screen out the most representative features layer by layer in frames. It is the training of neural network that makes the secret information obtained from the mapping of carrier feature also certainly robust.

Statistics of semantic information
As mentioned in Section 2.1, semantic segmentation refers to understand the images at the pixel level. The specific process is to comprehensively consider the weight of the relationship between pixels, than divide the pixels or even the whole picture according to the given threshold. Given that the shot is continuous, the semantic information of the front and rear frames, such as the scene category, and the subtle difference of target location, can be considered as the same semantic information. According to the above considerations, semantic information is calculated to improve algorithm robustness. Figure 3a represents the original picture frame. Figure 3b represents the segmentation graph, which have been divided into sub-blocks of M × M blocks, where the size of M is 3. Figure 3c represents the maximum semantic percentage value in each sub-block. The color represents the semantic type, and the value represents the ratio of that semantic type to the partitioned sub-block size. Figure 3d is the semantic histogram of the sub-block. Suppose that each frame image has row×col pixels, and each pixel belongs to a category. We will use MobilenetV2 for pixel-level semantic segmentation. For every frame of the image semantic segmentation image, the semantic image segmentation can be divided into 3 × 3 of the same size of the semantic chunk. Semantic information histogram can be obtained by counting of the highest semantic ratio.
The steps of semantic information statistics are as follows: 1. The size of the image is (row × col)/9 after the semantic segmentation graph is segmented.
2. Semantic types of sub-blocks of segmentation graph after block segmentation.
3. The number of semantic types in a partitioned graph sub-block. 4. Statistics of the maximum semantic occupancy of all segmented sub-blocks.

Methods
In this section, we describe the process of hiding and extracting secret information. The video coverless information hiding scheme proposed is shown in Fig. 4. First, we set up a video database consisting of videos of various topics, which is shared by the sender and receiver and stored on the cloud platform, thus effectively saving storage space. Secondly, the image frame of each video file is segmented to obtain the statistical histogram of semantic information. The histogram feature is mapped to hash sequence, and the video index database is established. For the sender, the secret message is sliced into segments, and each bit group can be mapped to a histogram of the semantic information.
By searching the index library, the appropriate video is selected as the carrier its mapping index records as auxiliary information to send to the receiver. The received auxiliary information will be used to determine the video frame, and the secret information can be restored by calculating the hash sequence from the video carrier. Throughout the process, the carrier video remains original without any modifications. Therefore, it can resist the detection of steganalysis.

Hash sequence mapping based on semantic statistics
As described in Section 2.2, semantic information can reflect the content information of video frame, and the hash sequence of semantic information can be generated from Fig. 5. First, each frame of the video image is extracted. Secondly, we carry out semantic segmentation and finally divide the semantic segmentation graph into sub-blocks of M × M blocks of the same size, and count the semantic information block by block.
Through the process shown in Section 2.2, the corresponding statistical histogram of semantic information of each block is obtained. The histogram is composed of statistical semantic types and semantic proportions: suppose we count one of the M × M blocks image segmentation graphs, we will get that there are n semantic types, and the semantic ratios corresponding to each semantic type in the segmentation block are H ={h 1 , h 2 , . . . , h n }. At the same time, we keep all the segmentation sub-block semantic ratio in the largest proportion, in regular scanning way for sorting, the corresponding semantic largest proportion also sorting; we will get the maximum M × M semantics of graph. Finally, we compare the size of the graph with the largest semantics before and after, and the biggest semantic of the size of the chart, and the resulting length of M×M−1 the hash sequence. In Fig. 5, we assume that M is equal to 3, so the length of the hash sequence is 8. The size of the vertical graph of semantic proportion is compared from left to right to generate a hash sequence of {01000100}.
1. For a given video file, the frame is expressed as PV = {p 1 , p 2 , . . . , p m }. Semantic segmentation network is used to obtain semantic information images. The performance of these semantic segmentation networks is different, which can be selected according to the actual situation. It is shown as follows: where PM i is the semantic image of frame i in the video file of PV, m is the number of frames of the video, and Segmentation() is the semantic segmentation function, which is selected according to the requirements. 2. According to the obtained semantic segmentation image PM i , assuming that the size of each frame is row × col, the PM i is uniformly divided into a number of semantic segmentation blocks of M × M, with the size of each block being row M × col M , and M can be any positive integer greater than 1. We customize a scan to arrange the semantic segmentation block B i in the specified order from top to bottom and from left to right: 3. According to the obtained semantic image sub-block B i , we need to calculate all the semantic proportions h in the corresponding block: 4. Maximum semantic proportion HM in statistical segmentation sub-blocks: where max is a function to calculate the maximum value, and the maximum semantic occupancy ratio in all segmented sub-blocks B i is calculated to obtain the semantic statistical information H of PM i : 5. By comparing before and after the vertical graph of semantic information statistics, the K-bit hash sequence B s is obtained:

Establishment of video index database
The key step of the method in this paper is to search the qualified video in video database and transmit it. We mainly take index ID + video ID + number of frame ID, where the index ID refers to the ID number of the index, which is incremental and represented by w. The video ID represents the path to the folder and the file name, which is represented by p. Number of frame ID means that frame v hides secret information in video with path p, and we use v to represent frame ID. In order to ensure efficient and accurate search, we build the index structure as shown in Fig. 6. For example, if the hash sequence of a frame in Video Walking.avi after image segmentation is {00000001}, its corresponding index ID is 2, that is, w is 2. At the same time, the video corresponding to video ID information p is the path corresponding to the index ID of 2. If this frame appears in frame 25, then V is 25. Then, the corresponding auxiliary information of this frame is (w:2,p:. . . \video\ Walking.avi,v:25), which also included in the ID of the index information table.

Coverless video steganography algorithm
In the process of hiding secret information, this paper effectively solves the mapping problem of secret information to video. The process is as follows: (1) Establish a public video library, which is shared by the sender and receiver.
(2) For each video in the library, semantic segmentation is carried out according to Section 2.1.
(3) For each semantic segmentation image, statistical semantic information is shared.
(4) Establish the histogram according to the semantic information after each statistic, hash code according to the method is described in Section 3.1, and finally get the hash sequence.
(5) Establish video index database, as described in Section 3.2. (6) The process of secret information matching carrier video is as follows: First, the secret information S with length g is divided into large M binary information segments: where N is the length of each binary information segment. When the total length g of secret information cannot be divisible, 0 is added in the last segment to form a sequence of length N, and the number of 0 is recorded.
(7) For each information fragment, we search the corresponding item in the index database, which is equal to the information fragment of secret information. It is possible to have multiple index entries mapped to the same hash sequence. To improve the efficiency of information extraction, we try to select index entries with the same video file. For the same video file, select the index with the smallest index ID and larger than the previous information segment. Algorithm 1 describes the sending process in detail.

Information extraction
For the receiver end, the hidden information can be successfully extracted by computing the hash sequence of the carrier video. The secret information extraction process is as follows: (1) Corresponding frames can be found according to video ID and frame ID, and the semantic segmentation graph can be obtained according to the description in Section 2.1.
(2) Obtain the semantic information histogram according to Section 2.2.
(3) According to the hash method described in the histogram and Section 3.1, the hash sequence of the corresponding frame is obtained.
(4) Repeat the above steps until all hash sequences are obtained.

Experimental results and discussions
Experimental environment: Intel(R) Core(TM) i7-7800x CPU @3.50 ghz, 64.00 GB RAM, and two Nvidia GeForce GTX 1080 Ti GPUs were used in the experiment. Deep learning adopts Pytorch framework, which is a high-level neural network API. We can use TensorFlow with Pytorch. All experiments were completed in MATLAB 2016a and Pycharm. Training, validation data set: the data set used in this paper for training is the ADE20K data set published by MIT. ADE20K has more than 25,000 images that are densely annotated with open dictionary tags. The annotated images cover the various scene category, with a rich variety of scenes suitable for target segmentation and part segmentation. Among them, the training data set has 20,210 images, the verification set has 2000 images, and the test set has over 2000 images. Figure 7 shows part of the ADE20K training data set.
Given that neural network has a strong anti-interference ability to noise, it can be learned from a large amount of data at the same time. Therefore, in order to improve the segmentation robustness of image with noise. In this paper, a new training data set is developed, which includes the original data and the noise image. On the basis of ADE20K data set, different factors of Gaussian noise, salt and pepper, speckle, and compressed images were added to train together; Fig. 8 shows the original image and the noise image. Based on the pre-training model, 30 cycles of training were conducted on the new training data set, with each period iterating for 5000 times. Finally, the training accuracy reaches 80% and verification accuracy reaches 75%.
The mobilenet network was used in this paper to test the DAVIS-2017 data set, and the DAVIS-2017 is shown in Fig. 9. DAVIS is a high-quality and full high-definition data set, including 90 video sequences of different scenes and a total of more than 6000 pixels of fully matched annotated images. In this paper, we will analyze and test the scheme from three aspects of capacity, robustness, and security on the DAVIS-2017 data set.

Capacity
This section examines the capacity of the program. The capacity of information hiding depends on the length of hash sequence, which is related to the semantic segmentation Capacity C refers to the total length of the bitstream that can be generated for a video file. It is easy to see that the length of the hash sequence is positively correlated with the number of blocks n, and the capacity increases with the increase of j. Since there is  currently coverless steganographic scheme based on video, this scheme will be compared with the existing single image steganographic scheme. The results are shown in Table 2.
The capacity of the algorithm depends on the length of hash sequence. According to the hash algorithm in Section 3.1, the hash length of the image depends on the number of semantic segmentation blocks and the parameter j. Compared with other image steganography algorithms, our method is more extensive, and the specific results are shown in Table 3.
Ideally, the maximum available significant bits of capacity C are m × (J − 1) bits. If the maximum value can be achieved, the (J − 1) bit information of each frame mapping is different, but the actual situation is that the semantic information between frames changes slowly, which is related to the number of frames of the camera. The possible situation is that the bitstream mapped by consecutive frames is the same. On the one hand, it is a waste of carrier resources. On the other hand, if the carrier lacks key frames due to transportation, we can extract the secret information we need from the adjacent frames to ensure the integrity of the secret information. Based on the above, we calculate the effective capacity C E of the carrier. For a given video file, the frame is expressed as PV = {p 1 , p 2 , . . . , p m }. After the process of segmentation, we can get a bitstream B i = {b 1 , b 2 , . . . , b 8 } for every frame. The effective capacity can be expressed as follows: Here, D Bi represents that the sequence information of the i frame is converted into decimal value. We can simply get the decimal number of the whole video D = {D B1 , D B2 , . . . , D Bm }. Then, the effective capacity is obtained: Effective capacity C E refers to how many different mapping sequences can be generated in the entire video file. The effective capacity is related to block parameter M and parameter j, while the effective capacity is positively related to parameter j. The maximum hiding capacity of 8-bit secret information is 256, and the effective capacity of DAVIS data set is shown in Fig. 10. Here,we assume M is equal to 4.

Robustness
This section describes the robustness to noise. The robustness of the algorithm largely depends on the ability of segmentation network to resist noise. In this experiment, we studied the robustness of different partition methods against pepper and salt noise, Gaussian noise, speckle noise, and JPEG compression attacks with different quality factors. The quality factor Q was 70% and 90%, respectively. Given that video frame is represented as PV = {p 1 , p 2 , . . . , p m }, the bitstream generated by each frame of image is B i = {b 1 , b 2 , . . . , b 8 }; then, the calculation formula of accuracy rate is as follows: In this scheme, the results of different semantic segmentation networks against noise attacks are shown in Table 4. It can be seen that our scheme is robust. Different partitioning methods mean that semantic information is properly filtered and has different effects on robustness. If the video is irresistibly deleted, according to the hypothesis analysis before, we can extract the secret information we need from the consecutive frames before and after the index frame, which greatly improves the robustness. At present, there is no research related to video coverless information hiding. This paper compares with the novel classical image coverless information hiding scheme. The specific results of the experiment are shown in Table 5. Under the attack of compressed noise, it has certain robustness. However this scheme is weak and sensitive to other noises. In Section 4.4, the reason of result is analyzed.

Security
The database adopted in this scheme is the video database uploaded on the cloud, which is shared by the sender and the receiver. Moreover, there is no modification in the carrier, so there is no steganalysis of video data. In these database, video can be uploaded in real-time, and the codebook can be updated.
In order to improve security, we can encrypt the auxiliary information. Meanwhile, we can use chaotic sequence encryption algorithm, RC4, SEAL, and other sequence encryption algorithms, which are determined according to the needs of users.

Discussion
In this section, we will further discuss some problems encountered in the experiment and give analysis and suggestions. On the one hand, deep learning and semantic segmentation are applied to the coverless information hiding scheme, which is a new coverless information hiding idea. This paper adopts lightweight semantic segmentation network, MobilenetV2, to train and segment all kinds of images with noise. The training of neural network is like a tool that can provide a lot of information materials. Firstly, it depends on the training data and, secondly, the depth of the network, and finally, it classifies the extracted information. This is like a heavy copper lock, with high security for information hiding. Semantic segmentation shows the rich semantic information of the video frame and capacity. It skillfully avoids steganalysis through the mapping rules only known by both sides. By applying coverless information hiding to video data, video data has large capacity and is not vulnerable to attack. This scheme has good robustness against compression attack of video data. On the other hand, the anti-interference ability of the scheme to some noises is not strong. One of the reasons may be the insufficient depth and breadth of the network segmentation. This paper uses Mobilenet, which belongs to portable network. The purpose of this paper is to apply it to video carriers. Second, the training data set is not comprehensive enough to cover all scenes of daily life. This will lead to low segmentation accuracy and robustness in the test data set.

Conclusion
In this paper, a video coverless information hiding scheme based on semantic segmentation is proposed. In the direction of coverless information hiding, we make a preliminary attempt in the fields of video, neural network, and semantic segmentation. Using video as transmission carrier has the advantages of large capacity and not easy to be detected by attackers. We conduct semantic segmentation on video and obtained the statistical histogram of semantic information. The hidden bit sequence is mapped to the hash sequence through the histogram, which provides a new idea for coverless steganography. The receiver can extract the secret information via calculating the semantic information from the carrier videos. In the whole process of secret information transmission, the carrier videos have not been modified. Therefore, the scheme can effectively resist the steganalysis. The scheme applied deep learning to information security, which can resist the attack of noise to a certain extent, but the robustness of the scheme largely depends on the network. In the future, we will try to improve the capacity and robustness. How to give full play to the advantages of neural network to enhance the robustness is also the focus of our future work.