Steganographic visual story with mutual-perceived joint attention

Social media plays an increasingly important role in providing information and social support to users. Due to the easy dissemination of content, as well as difficulty to track on the social network, we are motivated to study the way of concealing sensitive messages in this channel with high confidentiality. In this paper, we design a steganographic visual stories generation model that enables users to automatically post stego status on social media without any direct user intervention and use the mutual-perceived joint attention (MPJA) to maintain the imperceptibility of stego text. We demonstrate our approach on the visual storytelling (VIST) dataset and show that it yields high-quality steganographic texts. Since the proposed work realizes steganography by auto-generating visual story using deep learning, it enables us to move steganography to the real-world online social networks with intelligent steganographic bots.


Introduction
Steganography aims to hide the existence of secret information. It can hide the secret messages in videos, images, texts, and so on. For instance, imagine a scenario where two users want to exchange prohibited ideas or secret information under monitoring party; it is easy to suspect both sides of the communication in a world where most communication takes place in a transparent environment.
One of the traditional ways of transmitting secret messages is to publish seemingly normal news or advertisements in newspapers. But only a real spy with the right key can decode the news. This manual method has been replaced by algorithmic method. But the development of natural language processing technology and social network makes it possible to use this traditional method again. With this motivation in mind, we hope to design a system that enables two users to exchange encrypted messages openly and transparently on the social network platform by posting status updates.
Extensive researches [1][2][3][4] have been carried out for image processing or image steganography. Furthermore, more and more text based information hiding methods [5][6][7][8] have appealed to a tremendous proportion of researchers' interests in recent years. Fang et al. design a text information hiding method by dividing the dictionary containing all words in advance and then encoding the words in a fixed-length coding way [7]. Yang et al. present a steganography framework that embeds secret information into text by constructing a Huffman tree based on the probability distribution of words [8]. However, in real-scenarios, information representations on social media usually contain multiple modal contents. Due to the trend described above, some works [9,10] have explored cross-modal steganography tasks. In [10], they use word-by-word hiding method with fixed length of secret bits. Different from [10], the secret data are embedded at the sentence level. An improved steganographic scheme SSH based on beam search is proposed in [9]. Our approach also uses the word-by-word hiding method, but we embed secret data with variable length in each word. This makes the embedding process more efficient in terms of text quality. In addition, previous works are focusing on generating steganographic image description based solely on image caption. One of the main problems of image caption task [11] is that it can only recognize the event of the image simply and mechanically and cannot tell the stories of photos in the user's voice and share with others by posting them to social networks.
Toward filling this gap, we propose the task of steganographic visual story (SVS) automatic posting, which aims to generate steganographic visual stories from selected photos in local or online albums. Huang et al. [12] proposed the task of visual storytelling and constructed VIST dataset. And there are several works such as [13,14] based on it. In order to overcome the problem of stego text deviating from the image themes, we propose mutual-perceived joint attention (MPJA) to generate the text-aware visual representation and the vision-aware textual representation, so that the generated steganographic stories are more natural and more readable. On the basis of fully understanding the image content by neural network, the story words with secret messages can be generated with natural human custom according to MPJA. Our method does not need to modify the images, and there is no comparable original text, so it can be more difficult to be detected than traditional methods.
The rest of the paper is organized as follows. Section 2 describes the structure of steganographic visual story generation model with MPJA and its adaptive information hiding and extracting algorithm. In Section 3, we report the results of experiments with our proposed method when generating with both MPJA and adaptive data embedding and contrast their performance to previous art. Finally, the conclusion remarks of this paper are given in Section 4.

Proposed method
The whole process of posting secret information on social network and secret extracting is shown in Fig. 1. In the embedding process, shown on the left side of Fig. 1, we use the method of photo album clustering [15] to select images suitable for uploading from local albums or online albums and automatically exclude some low-quality photos [16], such as blurred ones. Then, we choose a certain number of photos from the clustered albums in a timed sequence and put them into the pretrained steganographic visual story generation model (see Fig. 2). The sentence stories generated by each photo are linked together as a post for uploading pictures on social networks and finally posted on various social network sites, such as Facebook and Twitter. The proposed work uses text as the The process of posting secret information on social network cover, and the images are used to train a language model to generate the stego text. The goal of using the images is to make sure that the stego text and the images have the same semantic information so that the stego text will not arouse suspicion when the stego text and the images are posted by the data hider.
We can see the extraction process on the right side. All the receivers only need to have permission to access the photos and text posted by the sender. So, there is no need for direct contact between senders and receivers; the sender can even be a robot. All the data receivers receive the same images and text description. Therefore, they are actually trying to recover the same secret message. As long as they hold the neural network model and can download the media files, they are all able to reconstruct the embedded information. It is worth mentioning that the data hider and the data receiver should share the image order before feeding the images into the neural network, which can be controlled by a secret key. This indicates that, though different secret keys correspond to different orders of the images, once they are used for data hiding, the order should be fixed and shared between the data hider and data receiver.

Photo encoder
As shown in Fig. 2, our encoder module is composed of two separate encoders, one that models the content of the image sequence and the other one that models the relationship between input images. For modeling the content of the image sequence, we used the extracted feature vectors from the ResNet [17] to describe the images. We chose ResNet over other convolutional neural networks because we consider the balance of computationally expensive and precision. Every image needs to be resized to 256 ×256 with respect to its ratio. In addition, we crop the image (if needed) from the center region to fit the ResNet input layer because we assume that the important information in the image is placed in the center.
In order to accurately use a few words to simulate the user's thoughts and feelings about the uploaded photos, we should consider the visual information of the photos themselves. While seeing the first image, we start the story with a sentence that describes and estimates the context of the particular image. For the next image in the sequence, we not only analyze the current image but also consider the influence of the previous image and the latter image, because this is only way to preserve the temporal correlation between events in the image, so that the text we generate is more in line with the logic of human narrative. It is a logical process to organize the content expressed in the pictures.
To achieve this, it is important to keep the temporal dependence between the sentence story generated by the current image in the sequence and the sentence story generated by the before and after images. Recurrent neural network [18] has made great success in processing sequence data, because it can learn the potential dependencies between sequence data elements, and it also has been proved to be suitable for modeling image features vector sequences. And we experimented with different types of recurrent neural network [18] to model the relationship between all input images and found that we could achieve better story flow by the use of bidirectional long short-term memory networks [19,20](Bi-LSTM) to obtain the context information from input images. In addition, we also apply the idea of concatenated coding for better aggregated representations. Our initial visual representation v i is concatenation of image features and the output of Bi-LSTM.

Mutual-perceived joint attention
Soft attention mechanism [21] utilizes additional weights on the interrelated outputs of the nodes, which improves the performance of the basic encoder-decoder model in machine translation. In the task of story generation from image sequences, however, each sentence should be visually grounded on not only each image but also overall context. To represent the relationship between input images and generated text, we design a scheme based on attention mechanism, called mutual-perceived joint attention. We implement them via calculating the similarity matrix that focuses sequentially on the images and generated words when generating story-like sentences.
We use V = {v 1 , v 2 , . . . , v N } to express the visual representation of input photo sequence, where v i ∈ R 1×k is a one-dimension feature vector generated from our photo encoder, V ∈ R N×k contains N visual representations of single photo. Each v i (1 ≤ i ≤ N), as a visual representation of the ith photo, is then used to decode the ith story sentence respectively. We use the mutual-perceived joint attention (MPJA) mechanism to capture the internal relationship between visual content and textual content and show the part of their mutual perception. MPJA can help us to focus on which part of image can control the text generation and which word better corresponds to the characteristics of given image.
To have awareness of each other, mutual perception of image and generated text can be measured by calculating similarity matrix D. The similarity matrix D ∈ R N×N is computed as: where W ∈ R k×k is a learnable weight matrix, V ∈ R N×k is the visual representation through photo encoder, L ∈ R N×k is the textual representation through generated N word embeddings, k is the dimension of the embedding of words, N is the number of visual representations, and T is transpose operation of matrix. It is worth noting that we normalize the similarity weights via softmax normalization, which tends to help the model to focus on the most relevant concepts. The task-relevant part is added to the original visual representation; hence, we can obtain the new visual representations after textual perception V l : Similarly, the new textual representations L v after visual perception can be obtained by: That provides the generation model with more necessary and useful information from the new visual and textual representations after MPJA.

Steganographic story generation
The steganographic story generation module aims to generate a reasonable and coherent story with hidden secret messages based sentence-level decoding. Figure 2 visually shows its decoding process. Specifically, when the decoder is generating the ith sentence, the source information includes two parts: the text-aware visual representation v l i of the ith image, and the vision-aware textual representation l v i,t−1 of the previously generated word in ith steganographic sentence. Our sequence decoder also employs a unidirectional LSTM layer [19]. Meanwhile, the unidirectional LSTM based models outperform the Bi-LSTM [20] based models in decoder. It does not illustrate the unidirectional LSTM is better than the Bi-LSTM in story generation. It only indicates that in the current experimental settings, the unidirectional LSTM-based model outperforms the bidirectional one. Apart from the general state update, the tth hidden state s i,t is further designed to take the two representations of mutual perception into consideration: where ⊕ denotes the vector concatenation which has the same meaning in Fig. 2 and it allows the decoder to pay different attention to different parts of the generated text. We refer to the previous works [22], adding a softmax classifier to the output layer to calculate the possible probability of each word to facilitate the embedding of secret information.

Adaptive information hiding
Information hiding and extraction are two completely opposite operations. The process of information hiding and extraction is basically the same. It is also necessary to use the Calculate the variance s 2 based on l c , that is Append w to Text; 12: Update b by removing top l bits from b; 13: break; return Text; Different from the traditional method, which embeds fixed-length secret bits in each word, we propose an adaptive information hiding algorithm. The process of embedding of fixed-length secret bits in each word is relatively simple, but not all the words in sentences are suitable for embedding the same number of secret bits. How to select embedding strategy which affect text quality at least becomes an important technique problem, because the variance of probability distribution of candidate list is carried out to reveal the dispersion of selection probability. We think that if the variance of probability distribution of candidate list with length M is less than a certain threshold T, no matter what kind of secret data is embedded, it will not cause too much deviation to the semantics of the whole sentence. The initial embedding length M will be initialized when we have selected the final length of secret bits at current word after some loops and start preparing the next one. We determine the threshold T as constant. If we need to embed more bits of secret information, we can set a larger threshold value. But it may affect the text quality instead. Similarly, we can adjust the content of generated sentences to make it more in line with the content of input images through lowering this value, but with lower embedding rate. Hence, with a variable capacity of secret bits per word, we can embed the maximum number of secret bits word while ensuring the quality of steganographic text. if s 2 < T then 10: Extract the decimal number corresponding to the location of w in l c ; 11: Convert w to l binary bits; 12: Append the l bits to l c ; 13: break; return Extracted secret bitstream b; Since the average length of the bit string carried by each word is at most 4, the overall time complexity is extremely low (as the sentence is short).

Embedding process
We simulate the embedding process of adaptive information hiding with initial size of selected bits M = 4. The first M secret bits to be embedded are "0010. " At time 0, the image representation extracted by our photo encoder is fed to LSTM to get the probability distribution p 0 of current word. According to p 0 , we descend the prediction probability of all the words and select the top 2 M sorted words to form the candidate list. Next, the variance s 2 based on the probability of entire candidate list is compared to a threshold T which is decided by the sender and receiver together. If the variance s 2 is less than the threshold T, we will choose the third word in the candidate list according to the embedded secret bits "0010. " But if the variance s 2 exceeds the predefined threshold, it will be considered to be embedded unreasonably. Then, the system can take corrective action by reducing the size of selected bits M until the appropriate length of secret bits is reasonably inserted into current word. In extreme cases, we can even choose not to embed secret information, although this is not going to happen in practice. Next, since the first M secret bits are fixed, the probability distribution p 1 of next word according to the previously generated word at time 0 and input image. As a result, each word adaptively changes the amount of embedded secret information based on current probability distribution.

Extraction process
Information hiding and extraction are two completely opposite operations. In [8], the needs the first word of each sentence as a key into the network which will calculate the distribution probability. Our method uses the input images as a key instead of words. The receiver needs to know the initial size of secret bits M and the threshold T, and the receiver has to follow the extraction process with the same trained steganographic visual story model used by the sender to get the embedded information. Specifically, our extractor requires the images downloaded from SNS, the network structure, and network parameters to extract the secret data. If the model and parameters change during the embedding process, the receiver should be informed of the new model and parameters. We need to determine the specific value M of each word first by comparing with threshold T, and then decode the secret bits according to position of each word generated by sender in the candidate list. For example, we get the first word "I" in the steganographic stories posted by sender and obtain M = 3 by calculating. If the word "I" is also the first word in candidate list, then the extracted secret bits are "000". And there are no secret bits embedded in this word if we get M = 0.
It is worth mentioning that the tested photos will be compressed by the SNS when photos are uploaded to social networks. It usually leads to a slight decline in the quality of the uploaded photos. After testing the photos uploaded to Twitter, we can still get the same story sentences and decode the embedded secret information successfully. Because our approach does not rely on modifying the image pixels to embed secret data, it only generates the steganographic stories of images. Therefore, slight image compression by SNS does not affect the text generation and the extraction process.

Results and discussion
To verify the proposed scheme, we have conducted many experiments on VIST dataset. We use binary random sequences as secret data. We test the text quality of steganographic stories generated from photo albums chosen randomly in the dataset and carry out its security analysis. Figures 3 and 4 show examples after embedding secret bits for single and multiple input images. Compared with original text, we can easily observe that the steganographic story sentences still remain the core semantic feature of image content under different embedding rates. Thus, we can consider the stego text is indistinguishable from the text created by humans according to the same images.

Dataset
We conduct experiments on the VIST dataset [12], which consists of 10,117 Flickr albums and 210,819 unique photos. The stories were created by workers on Amazon Mechanical Turk, where the workers were instructed to choose five images from the album and write a story about them. Every story has five sentence stories and every sentence story is paired with its appropriate image. We think that such visual stories in the dataset may be closer to the real environment.

Evaluation metrics
On the VIST dataset, we evaluate our models in terms of perplexity [23] on a valid set. We then pick the model with best perplexity on the valid set and compute the BLEU [24] and METEOR [25] on the test sets to evaluate overlap between outputs and references.

Network training
We train our models using the Adam optimizer [26]. Each word is embedded into a vector of 256 dimensions. The batch size is 128, and the training set is shuffled between epochs. The learning rate is initially 0.001, and this is divided by 10 when the validation accuracy stopped improving. Also, we apply batch normalization and dropout layers to prevent overfitting and improve the performance. We finally trained the model around 48 h with a Titan RTX 24 GB (GPU).

Results
We show our results in Table 1. Huang et al. [12] proposed a baseline approach which consists of a sequence to sequence model, where the encoder takes the sequence of images as input and the decoder takes the last state of the encoder as its first state to generate the story. Different from our method, they use gated recurrent units (GRUs) [27] for both the image encoder and story decoder. Yu et al. [28] proposed a model composed of three hierarchically attentive RNNs to encode the album photos and compose the story. It is worth noting that they used an additional RNN to select representative photos. Kim et al. [13] proposed a deep learning network model, that generates visual stories by combining global-local (glocal) attention and context cascading mechanisms. Their model got the highest score in the human evaluation of the Visual Storytelling Challenge 2018. Overall, our VS model with MPJA obtains the best performance on perplexity, BLEU-1 (bilingual evaluation understudy), BLEU-2 [24], and METEOR [25]. In contrast, with the help of MPJA, our model can utilize relevant information parts of images and text effectively and thus is capable of generating better text for the given images. Further, we conducted incremental experiments to study the effect of proposed mechanisms by adding them incrementally, as shown in Table 1. It verifies the effectiveness of the proposed mutualperceived joint attention mechanism on modeling context representations for generating appropriate story sentences. To compare the performance with or without MPJA, we find that for the generation of each story sentence of single image, the model with MPJA estimate the importance of each sentence in context performs better than the approach without MPJA. It can be found that MPJA mechanism helps to improve the quality of visual story generation.
Note that the quality of the generated steganographic text shows a sharp drop, when we start embedding secret bitstream. It is easy to understand why the quality of generated text decreases as the number of embedding rate (ER) increases. Because words in some position in the sentences are not suitable for embedding too much secret information, forced embedding will only make the quality of generated text poor. Hence, when we applied the adaptive information hiding algorithm, it performed better than embedding secret bitstream directly. Because the values in candidate list are between 0 and 1. To facilitate the comparison of the variance and threshold, we usually make the threshold T 100 times larger. In our experiment, we usually set M = 4, T = 250. Similar to the normal visual story (VS) generation model, MPJA can also help the steganographic visual story (SVS) model to achieve a better performance.
Finally, Fig. 3 shows some examples of steganographic visual stories generated from single photo with different embedding rates. Colored words represent the core words to express the content of photos. As shown in Fig. 3, the story sentence is still fluent after embedding with a relatively high embedding rate, but it is clear that there is deviation between semantic focus of some steganographic sentences and the content of images with the rising embedding rate. When we use adaptive information hiding method, the story sentence still remains the core semantic feature of image content. In Fig. 4, we choose four different albums to generate steganographic visual story sentences, and each album contains 5 photos. It can be seen that our method can still generate fluent and consistent steganographic sentences with multiple input images. When the embedding process has multiple input images, they do not need to be in a fixed order. Our model will automatically find the relationship between images, but a sequential order of input images (such as according to their subjects and dates of execution) will help us to generate a more readable and more coherent steganographic story.

Security analysis
The security analyses are conducted from both subjective aspect and objective aspect.
In subjective aspect, we analyze the semantic difference between the cover texts and the stego texts by opinion scoring. Five candidates are chosen to take part in the human evaluation tests. Each candidate evaluates a text file that contains 200 sentences from two sources: one is generated by our steganography algorithm, and the other is created by workers manually based on the same image. We apply adaptive information hiding method to our experiment, since we find that the quality of generated text with this method is the best. We examine three aspects of correct classification ratio: precision, recall, and accuracy. Their formulas are shown as follows.
where TP (true positives) refers to the number of normal story sentences written by humans that are correctly labeled by the volunteer. TN (true negatives) refers to the number steganographic sentences that are correctly labeled by the volunteer. FP (false positives) is the number of steganographic sentences that are incorrectly labeled as nonstego. FN (false negatives) is the number of normal sentences that are mislabeled as sentences after embedding secret data. Accuracy calculates the proportion of true results (both true positives and true negatives) among the total number of cases. Smaller accuracy means higher security performance. The results are shown in Table 2. The accuracy of correct discrimination of steganographic sentences is 0.58, which is similar to the result of random guessing. We can see that the steganographic story sentences generated by our method are so similar to the normal ones that can hardly be distinguished. In order to test the objective security, it is important to see whether our method is able to resist various attacks from text classifier based on semantic similarity. We conduct a series of experiments to test the difficulty to distinguish stego texts from cover texts under different embedding rates. As shown in Fig. 5, we use the algorithm in [30] for semantic space mapping, and then use the t-SNE [31] algorithm to visualize our result. The distributive characteristics of the words in two-dimension space reflects the statistical characteristics of texts from different aspects. We can see that there are still some  5 The distribution difference between the stego text generated by SVS and the cover text without embedding in the semantic space under different embedding rates deviations of stego texts, but the overall distribution is still in the same area as the cover texts. For our experiments, the distribution of stego texts deviates from the cover texts seriously with the rise of embedding rate. And our adaptive information hiding method can reduce deviation of stego texts. It proves that the sentences generated by our method are almost indistinguishable in semantic space. We use a novel universal text steganalysis model based on convolutional neural network. For more details about the text steganalysis methods, refer to [29]. We generate around 10,000 story sentences based on 2000 albums. Secret bits are embedded in half of these sentences with adaptive information hiding method. Eight thousand sentences are randomly selected to form the training set, on which text steganalysis model is built. Two thousand samples are used as testing set to calculate the accuracy of detecting stego texts. We can see from the Table 3 that adaptive information hiding method has the lowest detection rates compared with same language model embedded with different embedding rates. For SVS model, a lower embedding rate can get better anti-steganalysis performance. Overall, experiments show that the proposed schemes achieve high text quality and anti-steganalysis performance.
Moreover, our proposed framework can resist existing threats of image downsampling processing. For example, some attackers may compress or resize the visual pictures. With the help of cross-modal information processing, it becomes impossible to destroy the hidden information in stego texts for the attackers because it is difficult to influence the generated text for slight image compression.

Conclusion
In this paper, we introduce the task of automatic generating steganographic visual story. We also propose mutual-perceived joint attention (MPJA) to model the potential relationship between the photos and generated stories. The MPJA model also helps to improve the quality of generated text. In addition, the proposed model employs adaptive information hiding to effectively select the most suitable words for hiding secret information of different length and enhance the coherence of the output via vision-aware decoding. Evaluation results show that SVS with MPJA outperforms baseline models. And the steganographic visual stories generated by our scheme are proved to be hard to be