Semi-structured data protection scheme based on robust watermarking

Semi-structured data is a widely used text format for data interchange and storage. This paper proposes a robust watermarking scheme of data protection for semi-structured data, which uses JSON format as an example for illustration. We first parse JSON file into a data structure of distinct pairs. Afterwards, we generate a transfer matrix to get the intermediate sequences, which are then encoded using error-correction codes and embedded into the pairs. A private key is shared by the data hider and the recipient to resist collusion attacks. On the recipient’s side, data extraction can be successfully carried out even the received stego data are tampered. The imperceptibility is realized by embedding data into the less significant digits of numeric data in the cover file. The proposed scheme can be extended on several other formats. The experimental results show that the proposed scheme is robust to various kinds of typical attacks such as contextual truncating, modification, and redundancy addition.


Introduction
Semi-structured data are widely used for information storage and data interchange for its convenience and efficiency in data transmission. It is a form of data which contains tags to separate semantic elements and allows hierarchies of records and fields within the data. Some typical applications of semi-structured data are Extensible Markup Language (XML), JavaScript Object Notation (JSON), etc. Nowadays, the dissemination of digital content and data has brought about a series of problems, for example, contextual tampering and illegal duplication. The prevalence of semi-structured data highlights the importance of the protection of data and their copyrights.
Digital watermarking has been used in copyright protection and content authentication for various kinds of data formats. Secret data are embedded into the cover data in an imperceptible way, and can be successfully extracted regardless of attacks during data transmission. Robustness and fidelity are two basic requirements for digital watermarking. The embedded watermark must be robust enough against a variety of possible attacks such as contextual truncating or tampering that can destroy the watermark existence. Meanwhile, the watermarked data need to be close enough to the original data so that users can barely distinguish the differences between them. Generally, there is a trade-off between robustness and fidelity. A stronger robustness usually causes lower fidelity, and vice versa.
In the last decades, researchers mainly focus on the advancement of watermarking technology in multimedia, including algorithms for audio [1][2][3], image [4][5][6][7][8], and video [9][10][11]. Researchers have also proposed many watermarking schemes for texts and databases. For text watermarking, a typical method is conducting data embedding based on document structures by shifting rows and columns [12]. In semantic schemes, He et al. [13] use synonym of the substitution and chaotic map to embed the watermark. In syntactic schemes, Mali et al. [14] propose a technique which used grammatical rules like conjunctions and pronouns to generate the watermark. Image-based schemes [15] convert the text into an image at first. For the certain document format, Zhang et al. [16] define new characters using TrueType for Microsoft Word to embed the watermark into the Word document. Kuribayashi et al. [17] propose a scheme which used the spaces between words and lines for PDF format. Database watermarking schemes can be classified into two types: distortion-based watermarking and distortion-free watermarking [18,19]. The distortion-free watermarking uses the data itself to generate watermarks and is generally a fragile watermark. Oppositely, the distortion-based watermarking schemes modify the data with the minimal impact to embed the watermark. The watermarking technology utilizing contextual redundancy [20] embeds the watermark in the redundant bits or meaningless strings. The technology based on data classification [21,22] is to find the classification data in the database, i.e., the data with a limited value for embedding. Recently, some schemes based on reversible watermarking technology are proposed. Zhang et al. [23] use expansion on data error histogram. Jawad et al. [24] propose a scheme based on reversible and difference expansion, which estimates the distortion using the mean and standard deviation of the watermark relationship.
However, there was little effort paid on copyright protection for the semi-structured data. The abovementioned schemes for text watermarking are generally inapplicable or not effective on semi-structured data. Also, the schemes are generally not robust enough against some typical attacks such as deletion. In this paper, we propose a robust watermarking scheme for semi-structured data protection. For simplicity, we take JSON format as a representative to present the scheme.
The cover data is preprocessing ahead of data embedding, which generates valid embedding pairs for the file. We generate pseudorandom sequences using the hashing result of the keys of the pairs as seeds. We then segment the watermark data and construct a transfer matrix to transfer the segments into intermediate sequences. The sequences are further encoded using error-correction coding and embedded into the pairs. A private key is shared by data hider and recipient to resist collusion attacks. On the recipient's side, data extraction can be successfully carried out even the received watermarked data are modified. The experimental results show that the proposed scheme is robust to various kinds of typical attacks. The proposed scheme can be easily extended on other typical formats, e.g., sheetformatted data like Microsoft Excel or comma-separated values (CSV). Figure 1 illustrates an application of the proposed scheme. The data hider embeds different watermarks for different users before sharing the cover files. The existence of watermarks is imperceptible and does not affect normal usage of the file. If the recipients illegally upload the data to the Internet, the data hider can locate the source of the information leakage by conducting watermark extraction from the leaked file.
The rest of the paper is organized as follows. In Section 2, we present the embedding and extracting procedures of the proposed scheme. Experimental results are presented in Section 3, and Section 4 concludes the paper. In this section, we depict the proposed watermarking scheme for semi-structured data. The overview of embedding and extraction procedure is shown in Fig. 2.
The right-going flows represent the procedures of watermark embedding, and the left-going flows represent the procedures of watermark extraction. Note that matrices and sets are written in boldface, and sequences are represented in italic.

Preprocessing
Denote the watermark sequence as W with length l w . We divide the watermark sequence W into k segments. W = {w 1 , w 2 , w 3 , …, w k }. The length of each w i is denoted as p where p = l w /k. Figure 3 a provides an example of a cover JSON file. JSON file employs hierarchical or parallel format for data storage. The elements in a JSON file can be categorized into three types, namely, JSON objects that hold data in a hierarchical way, JSON array that holds elements in a parallel way, and JSON primitives that directly hold data. JSON objects and JSON primitives contain pairs of keys and values, while JSON arrays carry parallel data without having their own keys. We use a tree structure to represent the hierarchies of data in JSON, as shown in Fig. 3 b, where the received JSON data is considered the root JSON object, and data are stores in all the leaves.
We parse the cover JSON file, denoted as J, into pairs P, which are the basic units for data embedding in the scheme. Each pair consists of a key and a value. We first initialize P as an empty set, and build the tree structure T for J to get the hierarchical relation. Then we start from T and go through all nodes in the tree. If we meet a JSON object during traversal, we append its key to the end of prefix, where prefix is originally an empty string. Then we append a "_" in the end to represent the hierarchical relation, and continue to visit the children of this node. If we meet a JSON array, we simply visit the elements in parallel. If we meet a JSON primitive, we append its key to prefix. The pair of the corresponding leaf node is p i ={prefix, value}, where value is the value of that leaf node. Then we conduct backtracking and continue traversing till all the nodes are visited. After parsing, the pairs P = {p 0 , …, p q } that consists of all pairs constructed from leaf nodes of the tree. q denotes the amount of leaf nodes. Fig. 3 c illustrates the parsing result of the file in Fig. 3 a.
To avoid semantic changes, we only hide data into pairs P num whose value only contains numeric data. Other pairs will be excluded during data embedding, whose values will remain unchanged. Also, the data  hider can define a set of pairs P ban that are excluded from data embedding by offering their keys. For example, the information of telephone numbers, ids, or timestamps should be excluded since they do not tolerate any modification. Therefore, we only embed data into the rest of the pairs P ′ = P num − P ban . We denote the valid pairs as p Further, to alleviate embedding distortion, data hider can define a maximal-accepted modifying tolerance T. Denote respectively the value of x where ⌊x⌋ represents taking the largest integer smaller than x. For example, when the tolerance is set as 0.1, b = 2, so we exclude the leading two digits from the embedding procedure.
Denote the required embedding length as s. We do not embed data into the pair if lenðVðx 0 i ÞÞ−b≤ s . Here, len(x) represents the length of string x. Therefore, the abovementioned pairs are excluded from P ′ as well. The calculation of s will be discussed in Section 2.2.
We further apply hashing algorithm on the keys of each x 0 i so that each pair has a unique hash h 0 i . The calculation is as follows.
where ðx 0 i Þ k takes the ASCII code of the k th character of x 0 i . σ is a biased parameter, which is set as the private key K shared between the data hider and the recipient. h 0 i is ensured to be unique in the cover data. Afterwards, we generate a set of sequences S rand = {S 1 , S 2 , …, S n }, where S i is pseudorandomly generated by using h 0 i as seed, and the elements in S i are within the range of [0, 1]. Finally, the pairs in P ′ are sorted in ascending order according to the lexicographical order.

Watermark embedding (Fig. 4)
We first construct an intermediate sequence M that consisting of n segments. M = {m 1 , m 2 , m 3 , …, m n }. The length of each m i is also p.
Afterwards, we define a transfer matrix G as follows: where G is a k × n sparse matrix that each g i, j in G either equals to 0 or 1.
The construction the sparse matrix G is specified as follows. First, we generate the cumulative distribution function cdf RSD of robust soliton distribution. The probability density P i of RSD is defined as follows: where ρ i is the ideal soliton distribution, and τ i is defined as: where R ¼ c ffiffiffi k p ln ð k δ Þ denotes the expected ripple size. For the i th column of G, we flip certain amount of different g j, i from zero to one according to the following principle.
where a[b] represents taking the b th element from sequence a, and x i determines the number of selected g j, i that should be flipped.
Therefore, the total numbers of ones in each column also follows the robust soliton distribution (RSD).
We further define the procedure of transferring W to M as: Here, ⊗ represents XOR operation between matrices, in which each w i ∈ W is calculated respectively by: Then, the intermediate sequence M can be generated. Afterwards, we iteratively embed M into the pairs P ′ . For an intermediate packet m i , we apply cyclic coding as error correction coding. Then the to-be-embedded packet t i can be generated as t i = Cyc(m i ). where the length of t i is denoted as s (s > p). Next, we transfer the data format of x 0 i from numbers to string, and use least significant bit (LSB) modification to conduct data embedding. Denote the embedded data of x 0 i as y 0 i . We then embed s bits to a value of a pair. Denote the value is X and the to-be-embedded bits are a. Then, according to S i , a[j] will be embedded in (Δ + j) mod len(X), where Δ = S i [1] + x i , and therefore where Y denotes the watermarked value. We repeat this operation until all s bits are embedded into the pair and then iteratively conduct the same embedding procedure to all the pairs in the cover file.
Finally, we get the watermarked pairs as Y 0 ¼ fy 0 1 ; y 0 2 ; …; y 0 n g. At last, we restore the parsed pairs to the original format. We go through all the nodes of the tree structure. If the values are different from those of Y ′ , we modify them to y 0 i . So that the watermarked JSON file J ′ is obtained.

Watermark extraction
When the recipient gets the doubted semi-structured data J * , the hidden watermark can be detected and extracted. The flowchart of the extracting procedure is depicted in Fig. 5. Note that the recipient is convinced to know the private keys used in watermark embedding for the generation of random sequences; otherwise, he cannot perform the watermark extraction. The recipient first conducts preprocessing, that parses J * into P * . He applies the same hashing algorithm on the key and generate the pseudorandom sequence S i , which is the same as the sequence during data embedding. Afterwards, he starts watermark extraction by initializing G as an empty n × k matrix. Then, S i [1] is used to get x i , and [S i [1] + 1, …, S i [1] + x i ] are used to flip g j, i , as introduced in Section 2.2. Then he can get the exact transfer matrix G used for data embedding.
Afterwards, the pairs in P * are sorted in ascending order according to the lexicographical order. For each pair, denote the modified value as Y, and the embedded bits as a. According to S i , a[j] will be embedded in (Δ + j) mod len(X), where Δ = S i [1] + x i , and therefore the recipient gets the hidden bits by (10): Then he applies cyclic decoding on a to recover the intermediate sequence a i from a 0 i . If a 0 i is not a valid cyclic code, the recipient is convinced that the data included in the pair is tampered. Thus, he does not utilize the packet for watermark extraction.
For valid packets, he further uses belief propagation (BP) for decoding. He first finds the intermediate packets whose x i =1, i.e., those which are connected to only one watermark packet. After recovering these packets, they are utilized to decrease the degree of corresponding packets, for example, a packet whose x i =2 and is connected to the known packet can also be recovered. By iteration, all the source packets can be decoded. Finally, he gets the watermark by concatenating all the recovered packets.

Extension
The proposed scheme can be extended on sheetformatted data as well. We take CSV format for an illustration. The embedding and extraction procedures of CSV watermarking using the proposed method can be concluded by Fig. 2 as well. We denote the data contained in the cell located in the i th row and j th column as c i, j . Also, we denote the watermark sequence as W, and W is equally divided into k segments. W = {w 1 , w 2 , w 3 , …, w k }. The length of each w i is p.
In the preprocessing stage, a cover CSV file is first parsed to the data structure of pairs. Denote the sheet as r rows and l columns, and the rows are regarded as pairs. We first choose a column that satisfies the following two requirements: (1) all the elements inside are not empty and (2) the appearance of repeated elements is the least. The selected column is regarded as index column, denoted as t. For k th row of the file, the key of pair k is c k, t .For the rest cells in the row, we define the collection C k ¼ fc k;v 0 ; c k;v 1 ; …; c k;v l g that contains valid data for watermark embedding, where cells containing non-numeric data are also excluded to avoid semantic changes. Similarly, the data hider can define a collection of banned columns denoted as C ban , and C k ∩ C ban = ∅. Therefore, cells that are not included in C k will remain unchanged during embedding. To alleviate embedding distortion, b leading digits of x 0 i are excluded according to (1) given a threshold T. Here, if the length of c k;v i is less than b, c k;v i is excluded from C k . We further sort the collection C k according to the length of the values of the cells. Denote the sorted collection as C k;v r ). Finally, the pair k is generated as P k ={c k, t , C 0 k }, and the valid pairs are denoted as P = {P 0 , P 1 , …, P q }, where q denotes the amount of valid pairs. Invalid pairs are defined by (11).
where s is the minimum required length for embedding. Therefore, the invalid rows will be excluded during watermark embedding. The sketch of preprocessing of a cover CSV file is shown in Fig. 6.
We generate hash code for each pair using (2), where x 0 i in the equation is replaced with c i, t , which is the key of packet i. Afterwards, pseudorandom sequences S rand = {S 1 , S 2 , …, S r } can be generated under the same principle.
The procedures of transfer matrix construction, error correction encoding is similar to that of semi-structured data. In the iterative embedding stage, for i th row, the tobe-embedded cyclic code is further fragmented and On the recipient's side, data preprocessing for sheets is also carried out ahead of watermark extraction. The recipient gets the pairs and the corresponding pseudorandom sequences. Then he can successfully construct the transfer matrix. Under the same principle, he knows how many bits are hidden in a given c 0 i; j , and he extracts the secret bits with the help of the retrieved pseudorandom sequence. He then checks whether the recovered data sequence is a valid cyclic code. Finally, back propagation is also applied for data extraction.

Results and discussion
To verify the proposed scheme, we have conducted many experiments on an amount of JSON and CSV files. The sources are provided by some companies that are allowed for experimental uses. We use binary random sequences as digital watermark. We test the watermark robustness to several attacks. Figure 7 shows two short JSON files before and after embedding 8 bits watermark as an example. From Fig. 7, we can easily observe that the long floating numbers are modified to carry the secret data, while the shorter ones and strings remain unchanged. Thus, we can consider the modification is imperceptible.

Settings and evaluations
The main parameters in the proposed scheme are the length of each watermark packet p, the length of to-beembedded packet s, and two parameters of RSD c and δ as is described in Section 2. In the experiment, we set p = 4, s = 7, c = 0.1, δ = 0.5.
The proposed scheme is low in computational complexity, which makes it easier to be embedded on hardwares. The watermark embedding and extraction can be done within several seconds by a personal laptop with 2.60 GHz CPU and 8.00 GB RAM.
For an objective performance assessment of the proposed scheme, the overall distortion after watermark embedding can be defined as: where the semi-structured data file contains N key-value pairs, and U i equals to 0 or 1 indicates whether the i th key-value pair has been modified. A i represents the modified magnitude. A i =|y i − x i |/x i , where x i and y i respectively denote the original value and the modified value. Specially, if the i th key-value pair is deleted, A i is set to 1. A larger D indicates that the introduced distortion is larger.

Embedding performances
In Table 1, we measure the introduced distortions according to (12) under different embedding capacities. As is shown, the introduced distortion gradually remains stable even the embedding capacity grows higher, which is consistent with the analysis of our method. Meanwhile, the distortion of data is quite small, which  indicates the watermarking scheme that has little impact on the original data. We further test the robustness of semi-structured data watermarking to three types of attack when l w = 64 bits. Here, typical attacks including pair deletion, value modification, and redundancy insertion are applied in various degrees. We use four JSON files of different sizes for testing: file1, file2, file3, and file4, each containing 100, 200, 800, and 1500 pairs. We perform the same attack for multiple times on each watermarked file. The ratios of successful extraction under different attacks for semistructured data are shown in Table 2, where extractions with wrongly retrieved bits are not considered successful extraction. In the table, P d , P i , and P m respectively denote the percentage of key-value pairs which are deleted, inserted, or tampered in the file.
The proposed watermark scheme is generally robust to all the abovementioned attacks, even the attacks are strong. Especially, the scheme shows promising embedding performance against pair deletion and redundancy insertion. Also, files with more valid pairs gradually show stronger robustness. The main reason of high robustness is that: for pair deletion, the watermark can be extracted using the remaining valid pairs. For contextual modification, the recipient can identify the tampered location using cyclic code checks and discard the tampered pair. For insertion, data extracted from the inserted pairs can hardly pass the cyclic code checks, and thus they are also discarded.
The proposed scheme is also robust to combined attacks, as is shown in Table 3, e.g., the doubted file is a both modified and truncated version of watermarked file. The results prove that the scheme is credible and applicable in real uses.
We also test the robustness of the extended part of the proposed scheme. We also use several cover CSV files for copyright watermarking. Here, we use the same embedding capacity l w = 64 bits. The attacks include sorting, row deletion, value modification, row insertion, and column insertion. The test is applied on four CSV files. In Table 4, P dr and P ir are the percentage of rows which are added or inserted to the file. P m denotes the percentage of values in the sheet which are tampered and l i denotes the number of inserted columns.