### A review of steganography embedding techniques and principles of DCT domain steganography

There are two types of information embedding used in steganography: embedding in the spatial domain and in the frequency domain (transform domain) [1, 4]. Spatial domain techniques address the intensity of image pixels directly to encode the bits of the secret message. Among these methods, the least-significant bit (LSB) [1] is the oldest and the most popular, due to its simplicity in implementation. A wide variety of spatial domain steganography methods has been developed. A recent example uses texture synthesis to create a cover for spatial embedding [5]. The main weakness of spatial techniques is the sensitivity to common operations, such as transformation and compression.

In the case of frequency (transform) domain steganography, the cover medium is first transformed to another domain. In image processing, the frequency is typically related to the Fourier transform, but in steganography, it is related to DCT. Proper use of DCT coefficients as information containers provides good visual quality and security [1]. The development of a uniform embedding distortion function to find a codeword with the lowest distortion is an example of such techniques [6]. Spreading the embedded information between DCT coefficients leads to fewer image distortions and lower detection accuracy [6, 7].

Both types are combined in adaptive (or model-based) steganography. Adaptive steganography is based on both the spatial and frequency domains with an additional layer of a mathematical model [4]. Here, data hiding may be accomplished in different domains providing less disturbance to the cover image. Embedded regions of the cover image are determined based on a special condition. For example, in [8], the edge regions of the image were detected and used for spatial embedding. In [9], the parameters of the edge detector were automatically determined, making spatial embedding more flexible. Similarly, local image complexity was analyzed in [10], highlighting the most suitable regions for embedding. In [11], an appropriate treatment of image pixels improved the steganographic security.

A recently developed example of adaptive steganography is a technique based on the curvelet transform [12]. Low-frequency curvelet coefficients were used to provide high quality stego-objects. In [13], the authors proposed a steganographic technique resistant to image compression. This was achieved by carefully analyzing the DCT coefficient relationship. In [14], the concept of using two steganographic containers simultaneously was presented. Authors claimed that the security level was improved by analyzing two images of the same scene. Analyses of local image patches and different embedding strategies were performed in [15, 16].

The principle of DCT embedding is quite simple. DCT steganography starts with partitioning the cover image into 8 × 8 pixel blocks (Fig. 2a). For each block, the pixels are transferred into the frequency domain. This results in an 8 × 8-frequency energy matrix (also known as a coefficient matrix) that describes the block (Fig. 2b). The frequency increases from the top left corner to the bottom right corner of the matrix (Fig. 2c). The top-left coefficient is referred to as the zero frequency, and it contains the average intensity of the block. HVS is very sensitive to low frequencies and associated distortions. That is why lossy compression techniques usually neglect the information stored in high frequencies. Thus, to avoid compression of the secret message and to reduce its effect on visual quality, the information may be encoded using the relative values of the DCT coefficients, corresponding to middle frequencies. The DCT domain steganography techniques, in general, are more robust and less detectable than spatial techniques.

### Cover image analysis

In this subsection, an analysis of the cover selection problem was performed. Novel procedures for the efficient selection of the best image container are described, including using the global image characteristics to pre-filter the cover image candidates, manipulations with local spatial blocks, new features, and distance metrics for block matching.

#### Global image features and complexity measures

Since the problem of cover selection requires significant computational burden, a good idea is to perform initial filtering on the cover image database. To do that, a set of image complexity measures and global characteristics can be evaluated. One of the commonly applied characteristics is the DCT complexity measure [1], shown in Eq. (1):

$$ {\mathrm{IC}}_{\mathrm{DCT}}=\sum \limits_{\left(i,j\right)\in A}\sum \mid \mathrm{DCT}\left(i,j\right)\mid $$

(1)

where set *A* corresponds to the lowest frequency DCT cells [1]. A more sophisticated metric was proposed in [17]. The idea is to split the input image into blocks based on a predefined condition. Such an approach is called quad-tree. It was demonstrated that the most efficient measure of the image complexity in this case could be evaluated as shown in Eq. (2):

$$ {\mathrm{IC}}_{\mathrm{quad}-\mathrm{tree}}=\sum \limits_{i=1}^n{\left(2{x}_i\right)}^i $$

(2)

where *n* is the number of tree levels, and *x*_{i} is the number of pixels on each level. Other commonly applied complexity measures include homogeneity (a metric related to the high-frequency content of an image), the number of corners and edges [18], and uniformity [19].

Based on the values of the image complexity measures, the most complex cover images may be filtered and extracted for further analysis. However, in the case of a uniform secret image with a low number of gray levels, filtering based on global complexity measures is ineffective [20]. For instance, in the case of the input secret image having quite uniform content with a moderate amount of gray levels, filtering with respect to the image is not optimal.

To make the cover image selection more generic, we propose choosing the cover image that has an entropy [1] value higher than the secret image, as shown in Eq. (3). The entropy of an image may target the difference between neighbor pixels. The highest will be an image with greatest entropy:

$$ \varDelta {E}_I={E}_1-{E}_2\ge 0,\kern1em {E}_i=-\sum \limits_i{p}_i\log {p}_i $$

(3)

where *p*_{i} represents the bins of the image histogram. In this case, Eq. (3) implies that the amount of information required to encode the cover image is larger than for the secret image. For more stringent filtering of the cover candidates, we propose analyzing the histogram bins as well. The following condition is checked in this case:

$$ H{F}_i\left({I}_1,{I}_2\right)=\left(H{\left({I}_1\right)}_i-H{\left({I}_2\right)}_i\right)\ge 0 $$

(4)

where *H*_{i} is the ith histogram bin and *I*_{1}, *I*_{2} are the images to be compared. In our consideration, these are the cover and secret images. Equation (4) implies that the number of pixels for each histogram gray level is sufficient for encoding the secret image pixels for the same gray level.

The proposed filtering algorithm based on image entropy and histograms was compared with the DCT complexity and the measure based on a quad-tree. For this purpose, 1000 images from the BOSS image database [21] were used in this experiment. Using one of the image quality measures, 100 best covers were found. Three secret images with different average brightness and complexity level were embedded in two steps: local block embedding in the spatial domain and hiding the positions in the DCT domain. The embedding principle was described in Section 2.1. The peak signal-to-noise ratio (PSNR) was used to estimate the quality of the obtained stego-images as stated in Eq. (5):

$$ \mathrm{PSNR}=10\lg \frac{L^2}{\mathrm{MSE}} $$

(5)

$$ \mathrm{MSE}=\frac{1}{N_x\cdot {N}_y}\sum \limits_{i=1}^{N_x}\sum \limits_{j=1}^{N_y}{\left(C\left(i,j\right)-S\left(i,j\right)\right)}^2 $$

(6)

where MSE in Eq. (6) represents the mean square error, *N*_{x}, *N*_{y} describe the image size, *L* is the maximal gray level of the image, *C*(*i*, *j*) is the value of the cover image pixels, and *S*(*i*, *j*) is the value of the stego-image pixels.

Figure 3 shows the time-series of the obtained PSNR after embedding the secret image blocks for each of the three secret images using the three filters. One can observe that using a gray levels distribution based on the entropy and histogram leads to good and stable results. In contrast, the image complexity measures are sensitive to the type of the input secret image. For example, the DCT metric demonstrates the comparable visual quality results for the most complex secret image (third image) but does not provide stable results for the rest of secret images. The quad-tree complexity measure was suitable for secret images with a large amount of small-scale details. Thus, the proposed entropy and histogram filtering ensures that the amount of information required to encode the cover image is larger than for the secret image and results in a high PSNR level at the same time.

#### Local spatial block analysis

Selection of features evaluated in local blocks is important for the overall cover analysis procedure. One of the goals is to ensure that the PSNR is quite high after the embedding process. According to Eq. (5), maximization of the PSNR corresponds to minimization of MSE Eq. (6). Since data embedding in the spatial domain is performed by local block replacement, the contribution to MSE is determined by the pixel intensities from these blocks. Thus, Eq. (6) for MSE may be simplified as:

$$ \mathrm{MSE}=\frac{1}{N_{\mathrm{blocks}}^{\mathrm{cover}}}\sum \limits_{iB=1}^{N_{\mathrm{blocks}}^{\mathrm{secret}}}{\mathrm{MSE}}_{iB}, $$

(7)

$$ {\mathrm{MSE}}_{iB}=\frac{1}{N_b^2}\sum \limits_{i=1}^{N_b}\sum \limits_{j=1}^{N_b}{\left(C\left(i+{i}_{i\mathrm{CB}},j+{j}_{i\mathrm{CB}}\right)-S\left(i+{i}_{i\mathrm{SB}},j+{j}_{i\mathrm{SB}}\right)\right)}^2, $$

(8)

where \( {N}_{blocks}^{\sec ret}={N}_{pixels}^{\sec ret}/{N}_b^2,{N}_{blocks}^{\operatorname{cov} er}={N}_{pixels}^{\operatorname{cov} er}/{N}_b^2,{N}_{pixels}^{\sec ret}\ \mathrm{and}\ {N}_{pixels}^{\operatorname{cov} er} \) are the number of pixels in the secret and cover images, *N*_{b} is the block size, MSE_{iB} is the local block MSE (local MSE), and (*i*_{iCB}, *j*_{iCB}) and (*i*_{iSB}, *j*_{iSB}) are the coordinates of the local block corner for the cover and the secret images, respectively. The definition of the local MSE in Eq. (8) led to directly using pixel intensities as components of the local feature vector. Minimizing the Euclidean distance between the feature vectors of the cover and secret blocks maximized the PSNR value in this case.

Such a “direct” approach is quite different from that proposed in [3], where the authors used the mean, variance, and skewness in 2 × 2 sub-blocks of the 4 × 4 local block to form the feature vector. In the experimental section, both techniques are compared.

In order to improve the visual quality of the stego-image, the local blocks of the secret image were rotated and flipped before embedding. The orientation that provided minimal local MSE was chosen. Possible image manipulations are defined by the following expressions Eqs. (9, 10, 11, 12, 13, 14,15):

$$ {\mathrm{dst}}_{i,j}={\mathrm{src}}_{N_b-i,j}-\mathrm{vertical}\ \mathrm{flip} $$

(9)

$$ {\mathrm{dst}}_{i,j}={\mathrm{src}}_{i,{N}_b-j}-\mathrm{horizontal}\ \mathrm{flip} $$

(10)

$$ {\mathrm{dst}}_{i,j}={\mathrm{src}}_{j,i}-\mathrm{flip}\ \mathrm{over}\ \mathrm{the}\ \mathrm{main}\ \mathrm{diagonal} $$

(11)

$$ {\mathrm{dst}}_{i,j}={\mathrm{src}}_{N_b-j,{N}_b-i}-\mathrm{flip}\ \mathrm{over}\ \mathrm{anti}-\mathrm{diagonal} $$

(12)

$$ {\mathrm{dst}}_{i,j}={\mathrm{src}}_{N_b-j,i}-90{}^{\circ}\mathrm{rotation} $$

(13)

$$ {\mathrm{dst}}_{i,j}={\mathrm{src}}_{N_b-i,{N}_b-j}-180{}^{\circ}\mathrm{rotation} $$

(14)

$$ {\mathrm{dst}}_{i,j}={\mathrm{src}}_{j,{N}_b-i}-270{}^{\circ}\mathrm{rotation} $$

(15)

here, src and dst are the initial and the resulting blocks, respectively, and *i*, *j* is the pixel index inside the block.

The important point here is that the orientation of each local block is coded in the DCT domain together with its position. Thus, decreasing the PSNR with improved spatial embedding is followed by increasing the amount of hidden information (and thus the distortions) in the DCT domain. The required capacity (in bits) for this case is determined as in Eq. (1):

$$ {N}_{bits}={\log}_2\left({N}_{blocks}^{\mathrm{cover}}\right){N}_{blocks}^{\sec ret}\ast \left(1+\frac{1}{floor\left({\log}_{10}\left({N}_{blocks}^{\operatorname{cov} er}\right)\right)}\right)\ast {K}_H, $$

(16)

where *K*_{H} is the Hamming coding multiplier. It is calculated through the codeword length and the block message length. In this study, *K*_{H} = 7/4.

The second term in the brackets of Eq. (16) is related to image block manipulation. The descriptors of rotation/flipping of several blocks are grouped into one decimal number of the same bit length as the numbers required for coding hidden block indices (\( {\log}_2\left({N}_{\mathrm{blocks}}^{\mathrm{cover}}\right) \)). The number of blocks with grouped orientation descriptors is determined by the denominator of the second term in the brackets of Eq. (16) where the floor function rounds a number to the nearest integer, if necessary. The number of DCT coefficient pairs required for embedding is easily calculated from Eq. (16) for a given cover image size.

The effect of image block manipulation on the PSNR of the stego-image was illustrated with an experiment on 50 cover images taken from the BOSS database. The images were chosen by filtering 1000 images with the image entropy and histogram analysis (Section 2.2.1). The image demonstrated in Fig. 2a (resized to 32 × 32 pixels) was used as the secret image. Figure 4 illustrates the PSNR before and after embedding in the DCT domain with and without image block manipulations. The block size was fixed to 4 × 4. The PSNR before DCT embedding for the case using rotation/flipping of the local blocks (green bars) was always the highest value. For most cover images (92%), PSNR after DCT embedding with image block manipulation (red bars) was also higher than without (cyan bars). Visual analysis (comparing Fig. 4 e and f with Fig. 4d) also showed noticeable improvement. This result was used to propose the manipulations with local blocks as a stable improvement of secret image embedding using a combined technique.

The block size *N*_{b} also had a noticeable impact on the embedding quality. The visual quality of the stego-image after spatial embedding depended on the block size. In Eq. (16), \( {N}_{\mathrm{blocks}}^{\sec \mathrm{ret}} \) and \( {N}_{\mathrm{blocks}}^{\mathrm{cover}} \) are dependent on *N*_{b}, so the amount of information to be encoded in the DCT domain is also determined by the block size. Similar to the case of image block manipulation, two factors have the opposite influence on the PSNR. Thus, an experiment, similar to the PSNR experiment above, was conducted with 100 cover images randomly collected from the BOSS database and a single secret image (Fig. 2a) of size 32 × 32. Blocks of 2 × 2, 4 × 4, and 8 × 8 were tested. Figure 5a demonstrates that the PSNR values after embedding decreased only in the spatial domain as *N*_{b} increased. However, after embedding in the DCT domain, the situation was not as obvious (Fig. 5b). Comparing the mean values and standard deviations (Fig. 5), the 2 × 2 size demonstrated the worst performance, as most information was encoded in the DCT domain. In the case of the 4 × 4 blocks, higher mean PSNR values were obtained along with a higher variance. For the 8 × 8 blocks, the variance was lower and the mean PSNR value was lower as well. The 4 × 4 blocks were used for further analysis in order to compare with [3].

Now, the actual embedding procedure will be described. The distance in Eq. (8) was calculated using different local block manipulations for all the blocks in the cover and secret images in search of the minimum. The coordinates of the local block for embedding were obtained along with the orientation of each block of the secret image. Spatial embedding was performed using these data, and the position and orientation of all the blocks were then coded in the DCT domain [1,2,3]. The number of bits required for coding the index of each block in the cover image was calculated as \( {\log}_2\left({N}_{\mathrm{blocks}}^{cover}\right) \). The rotation and flipping indices were grouped together in order to be described with the same number of bits (see the description in Eq. 16).

#### Similarity measure of local blocks

Local block similarity is calculated based on the distance between feature vectors. A commonly used metric is the Euclidean distance as stated in Eq. (17):

$$ {d}_{ij}={\left[\sum \limits_l{\left({u}_l-{v}_l\right)}^2\right]}^{1/2}, $$

(17)

where *i*, *j* are the indices of the block in the cover and the secret image, *u*, *v* are the corresponding feature vectors, and *l* is the index of the feature vector component. In [3], the authors used statistical moments of sub-blocks to form the local block’s feature vector. The most suitable container was chosen by the maximum number of most similar blocks. However, such an approach does not guarantee that all blocks of the secret image have similar blocks in the chosen cover image. Thus, the PSNR may become quite low in some situations. Instead of calculating the number of the most similar blocks, a novel distance metric is proposed as in Eq. (18):

$$ {\overline{d}}_k=\frac{1}{N_{\mathrm{blocks}}^{\mathrm{secret}}}\sum \limits_j{d}_{\mathrm{min}}^j $$

(18)

where \( {d}_{\mathrm{min}}^j \) is the minimal distance evaluated for the *j*th block of the secret image in the analyzed cover image *k*. Equation (18) calculates the distance to all the blocks of the secret image, and it provides more stable results.

The proposed measure in Eq. (18) directly minimizes the PSNR value for spatial embedding (compared with Eqs. (18) and (8)) but does not consider DCT embedding. To overcome this, the following algorithm was proposed. \( {d}_{\mathrm{min}}^j \) provides the corresponding block index *i* of the current cover *k*. Having all indices, the bit sequence for embedding in the DCT domain can be formed. This will facilitate the embedding and calculate the mean MSE for all the DCT blocks used for embedding represented by \( {\bar{S}}_k \). The resulting complex measure is a multiplication of \( {\bar{d}}_k \) and \( {\bar{S}}_k \):

$$ {D}_k={\overline{d}}_k{\overline{S}}_k $$

(19)

The effectiveness of the proposed measure was confirmed with an experiment. A set of 100 cover image candidates was used from the BOSS database along with one secret image of size 64 × 6. Cover selection was performed using the algorithm based on the complex measure Eq. (19) (Fig. 6a), using the algorithm based on the maximum number of most similar blocks [2, 3] (Fig. 6b), and using random selection (Fig. 6c). The results are presented in the form of distance Eq. (17) from all the blocks of the secret image. To provide a simple comparison, the distributions were normalized on the maximum distance determined from all three distributions (found for the randomly chosen cover, see Fig. 6c). Comparison of Fig. 6 a and b revealed that the proposed measure, based on the mean distance and the PSNR, provided a smoother distribution with fewer peaks.

#### Most suitable cover image (MSCI)

The above has been combined into a cover selection framework called the most suitable cover image (MSCI). According to its name, the framework adaptively (for a given secret image) picks the best cover image from an image database. Figure 7 illustrates the main components and the flow of MSCI. The database processing step is carried out only once for each database image in order to minimize the execution time. All necessary features (global and local) are extracted from all the cover candidates and saved in the feature database. Global features are represented by the entropy and the histogram (Section 2.2.1). The pixel intensities of each block are utilized as local features (Section 2.2.2).

When a secret image is embedded, the same features are extracted from it. Global filtering is accomplished by removing the images with negative values in the histogram and the entropy metrics (see Sections 2.2.3 and 2.2.4). Cover images that pass this step are referred further as cover candidates. The blocks are analyzed using the proposed intensity-based local features. The cover that provides the lowest complex measure Eq. (19) (Section 2.2.4) is chosen to be the most suitable cover image.

The computational complexity of the MSCI can be estimated as follows: local block matching involves \( 2{N}_b^2 \) operations. The total amount of operations required for matching the given secret image with a particular cover container is determined as shown in Eq. (20):

$$ {N}_{\mathrm{operations}}=\sum \limits_k2{N}_b^2{N}_{\mathrm{blocks}}^{\mathrm{secret}}{N}_{\mathrm{blocks}}^{k- th\;\mathrm{cover}}=\frac{2{N}_{\mathrm{pixels}}^{\mathrm{secret}}}{N_b^2}\sum \limits_k{N}_{\mathrm{pixels}}^{k- th\;\mathrm{cover}} $$

(20)

here, \( 2{N}_b^2 \) is the number of operations required to match a single pair of blocks, and *k* is the number of covers. The total amount of operations required for matching the given secret image with a particular cover container is determined as shown in Eq. (21):

$$ {N}_{\mathrm{operations}}=\frac{2{N}_{\mathrm{pixels}}^{\mathrm{secret}}{N}_{\mathrm{pixels}}^{\mathrm{cover}}{N}_{\mathrm{cover}\mathrm{s}}}{N_b^2} $$

(21)

The number of operations required for local block matching is proportional to the number of cover images, *N*_{covers}. Therefore, the complexity of MSCI is *O*(*N*_{covers}*N*_{operations})=\( O\left({N}_{\mathrm{cover}\mathrm{s}}{N}_{\mathrm{pixels}}^{\mathrm{secret}}{N}_{\mathrm{pixels}}^{\mathrm{cover}}\right) \). The advantage of MSCI is that the complexity can be reduced by tuning the parameters in the global filtering step, i.e., the number of cover candidates is controlled. Thus, a balance between the algorithm performance and the visual quality and security of the resulting stego-images can be found.

### Related work

Cover image selection was first proposed in [2]. The cover image was divided into a sequence of local blocks of size 4 × 4. Sample mean and variance were calculated for each block. This operation was applied to images decomposed with Gabor filters of different scale and orientation. Thus, each local block was described via a 24-dimensional feature vector (two features for 12 Gabor images). For each block of the secret image, the closest block from a set of cover images was found (based on the Euclidean distance between feature vectors). The cover image with the maximum number of closest blocks was chosen as the most suitable container for a given secret image.

An improvement to this technique was proposed in [3]. The authors used a different approach for feature vector construction. Mean, variance, and skewness were calculated for 4 × 4 sub-blocks within each block. Mean pixel values from the local neighborhood were used to improve visual quality. As a result, the 16-dimensional feature vector (three features for four sub-blocks, and four features from the neighborhood) was constructed for each block. The cover selection procedure was similar to that described above. Using the information from the local neighborhood improved the visual quality of the stego-image.

An interesting analysis of the cover image embedding capacity was conducted in [20]. Here, the authors studied the resistance of different image containers to steganalysis attacks. Specifically, the relation between image complexity and embedding capacity was analyzed. As a result, the safe capacity was determined for each image container. Recent steganalysis algorithms utilize machine-learning approaches for the classification of cover and stego-images. Support vector machines (SVM) are the tool of choice for such problems [22]. However, in [23], it was illustrated that usage ensemble classifiers based on random forest results provided comparable performance with significantly lower training complexity [20].

In contrast to the methods introduced above, there is a group of algorithms based on cover image analysis, which do not consider a certain secret image. In [24], the authors introduced an agent-based system for image feature analysis. Images with the highest contrast and high entropy were chosen as the most suitable candidates. However, no steganalysis tests were conducted in this experiment. Steganography performance dependent on the global characteristics of the cover image was studied in [25]. A specific cover selection technique based on the analysis of the correlation coefficient within the cover image was proposed in [26]. Authors represented the cover image as a Gauss-Markov process and demonstrated that the images with smaller correlation coefficients led to lower detectability. But the embedding rate, unfortunately, was only considered to a maximum of one. The technique was also limited to spatial domain steganography.

A series of experiments on cover selection and steganalysis were performed in [27]. Steganographic security was analyzed for three different scenarios: when the cover image selector had no knowledge, partial knowledge, or full knowledge of the applied steganalyzer type and the classification principle. A set of different image quality metrics were tested together with different steganalysis tools. Proper use of the steganalyzer type decreased detection rate.

The steganalysis technique in [28] used the features based on intra- and inter-block DCT correlations while [29] used the fusion of DCT-based and Markov-based features. In [30], the authors obtained the receiver operating characteristics (ROC) to evaluate the classifier performance [30]. Furthermore, two recent steganalysis algorithms were proposed in [31, 32]. The authors in [31] proposed a new feature set for steganalysis for JPEG images with less complexity, less dimensionality, and better performance compared with other proposed JPEG domain steganalysis features. In [32], a steganalysis feature extraction technique based on 2D Gabor filters is offered for adaptive JPEG steganography. The evaluation of the detection performance of the proposed steganalysis feature can be enhanced effectively compared with other methods.

In [33], the authors proposed a cover selection method that is secure against both the single object steganalysis and pooled steganalysis at the same time. Additionally, a detailed explanation of steganography security in image level and individual level was showed while pointing on the theoretical weakness of existing cover-selection methods in individual level. The authors shortened the difference between the selected cover images and the whole set of possible images using the maximum mean discrepancy (MMD) distance during the cover selection to enhance the security in individual level. Experimental results demonstrated that the security in individual level and image level is assured.

Another interesting recent work describing text localization and web image understanding was presented in [34, 35], respectively. In [34], a novel solution was proposed to fast localize text in complex backgrounds which is considered as a challenge. This solution was based on two effective algorithms; stroke-specific FASTroke keypoint detector and the component similarity clustering algorithm. Performance results showed that this method outperformed the existing solutions and reported best performance as FASTroke generates less than twice the amount of components and at least 10% more characters are recognized. The problem of image understanding was presented and solved in [35] by proposing cross-modality bridging dictionary. Images were considered as probability distribution of semantic groups for the visual appearances. Moreover, the probability distributions were transferred for related categories by proposing knowledge-based semantic propagation. Experimental results showed the effectiveness of this method as it outperformed the state-of-the-art methods on four public datasets.

A new spatial-temporal attention model (STAT) was introduced in [36] in order to solve the problems of error recognition for the videos and missing the description details. STAT model was proposed within an encoder-decoder neural network for video captioning. It focuses on both the spatial and temporal structures in a video, and significant region was selected in the subset of frames instead of the subset only for accurate word prediction. Performance results showed that STAT generated detailed and accurate descriptions of videos. It also accomplished state-of-the-art performance on two well-known benchmarks.