 Research
 Open Access
 Published:
Semistructured data protection scheme based on robust watermarking
EURASIP Journal on Image and Video Processing volume 2020, Article number: 12 (2020)
Abstract
Semistructured data is a widely used text format for data interchange and storage. This paper proposes a robust watermarking scheme of data protection for semistructured data, which uses JSON format as an example for illustration. We first parse JSON file into a data structure of distinct pairs. Afterwards, we generate a transfer matrix to get the intermediate sequences, which are then encoded using errorcorrection codes and embedded into the pairs. A private key is shared by the data hider and the recipient to resist collusion attacks. On the recipient’s side, data extraction can be successfully carried out even the received stego data are tampered. The imperceptibility is realized by embedding data into the less significant digits of numeric data in the cover file. The proposed scheme can be extended on several other formats. The experimental results show that the proposed scheme is robust to various kinds of typical attacks such as contextual truncating, modification, and redundancy addition.
Introduction
Semistructured data are widely used for information storage and data interchange for its convenience and efficiency in data transmission. It is a form of data which contains tags to separate semantic elements and allows hierarchies of records and fields within the data. Some typical applications of semistructured data are Extensible Markup Language (XML), JavaScript Object Notation (JSON), etc. Nowadays, the dissemination of digital content and data has brought about a series of problems, for example, contextual tampering and illegal duplication. The prevalence of semistructured data highlights the importance of the protection of data and their copyrights.
Digital watermarking has been used in copyright protection and content authentication for various kinds of data formats. Secret data are embedded into the cover data in an imperceptible way, and can be successfully extracted regardless of attacks during data transmission. Robustness and fidelity are two basic requirements for digital watermarking. The embedded watermark must be robust enough against a variety of possible attacks such as contextual truncating or tampering that can destroy the watermark existence. Meanwhile, the watermarked data need to be close enough to the original data so that users can barely distinguish the differences between them. Generally, there is a tradeoff between robustness and fidelity. A stronger robustness usually causes lower fidelity, and vice versa.
In the last decades, researchers mainly focus on the advancement of watermarking technology in multimedia, including algorithms for audio [1,2,3], image [4,5,6,7,8], and video [9,10,11]. Researchers have also proposed many watermarking schemes for texts and databases. For text watermarking, a typical method is conducting data embedding based on document structures by shifting rows and columns [12]. In semantic schemes, He et al. [13] use synonym of the substitution and chaotic map to embed the watermark. In syntactic schemes, Mali et al. [14] propose a technique which used grammatical rules like conjunctions and pronouns to generate the watermark. Imagebased schemes [15] convert the text into an image at first. For the certain document format, Zhang et al. [16] define new characters using TrueType for Microsoft Word to embed the watermark into the Word document. Kuribayashi et al. [17] propose a scheme which used the spaces between words and lines for PDF format. Database watermarking schemes can be classified into two types: distortionbased watermarking and distortionfree watermarking [18, 19]. The distortionfree watermarking uses the data itself to generate watermarks and is generally a fragile watermark. Oppositely, the distortionbased watermarking schemes modify the data with the minimal impact to embed the watermark. The watermarking technology utilizing contextual redundancy [20] embeds the watermark in the redundant bits or meaningless strings. The technology based on data classification [21, 22] is to find the classification data in the database, i.e., the data with a limited value for embedding. Recently, some schemes based on reversible watermarking technology are proposed. Zhang et al. [23] use expansion on data error histogram. Jawad et al. [24] propose a scheme based on reversible and difference expansion, which estimates the distortion using the mean and standard deviation of the watermark relationship.
However, there was little effort paid on copyright protection for the semistructured data. The abovementioned schemes for text watermarking are generally inapplicable or not effective on semistructured data. Also, the schemes are generally not robust enough against some typical attacks such as deletion. In this paper, we propose a robust watermarking scheme for semistructured data protection. For simplicity, we take JSON format as a representative to present the scheme. The cover data is preprocessing ahead of data embedding, which generates valid embedding pairs for the file. We generate pseudorandom sequences using the hashing result of the keys of the pairs as seeds. We then segment the watermark data and construct a transfer matrix to transfer the segments into intermediate sequences. The sequences are further encoded using errorcorrection coding and embedded into the pairs. A private key is shared by data hider and recipient to resist collusion attacks. On the recipient’s side, data extraction can be successfully carried out even the received watermarked data are modified. The experimental results show that the proposed scheme is robust to various kinds of typical attacks. The proposed scheme can be easily extended on other typical formats, e.g., sheetformatted data like Microsoft Excel or commaseparated values (CSV).
Figure 1 illustrates an application of the proposed scheme. The data hider embeds different watermarks for different users before sharing the cover files. The existence of watermarks is imperceptible and does not affect normal usage of the file. If the recipients illegally upload the data to the Internet, the data hider can locate the source of the information leakage by conducting watermark extraction from the leaked file.
The rest of the paper is organized as follows. In Section 2, we present the embedding and extracting procedures of the proposed scheme. Experimental results are presented in Section 3, and Section 4 concludes the paper.
Proposed method
In this section, we depict the proposed watermarking scheme for semistructured data. The overview of embedding and extraction procedure is shown in Fig. 2. The rightgoing flows represent the procedures of watermark embedding, and the leftgoing flows represent the procedures of watermark extraction. Note that matrices and sets are written in boldface, and sequences are represented in italic.
Preprocessing
Denote the watermark sequence as W with length l_{w}. We divide the watermark sequence W into k segments. W = {w_{1}, w_{2}, w_{3}, …, w_{k}}. The length of each w_{i} is denoted as p where p = l_{w}/k.
Figure 3 a provides an example of a cover JSON file. JSON file employs hierarchical or parallel format for data storage. The elements in a JSON file can be categorized into three types, namely, JSON objects that hold data in a hierarchical way, JSON array that holds elements in a parallel way, and JSON primitives that directly hold data. JSON objects and JSON primitives contain pairs of keys and values, while JSON arrays carry parallel data without having their own keys. We use a tree structure to represent the hierarchies of data in JSON, as shown in Fig. 3 b, where the received JSON data is considered the root JSON object, and data are stores in all the leaves.
We parse the cover JSON file, denoted as J, into pairs P, which are the basic units for data embedding in the scheme. Each pair consists of a key and a value. We first initialize P as an empty set, and build the tree structure T for J to get the hierarchical relation. Then we start from T and go through all nodes in the tree. If we meet a JSON object during traversal, we append its key to the end of prefix, where prefix is originally an empty string. Then we append a “_” in the end to represent the hierarchical relation, and continue to visit the children of this node. If we meet a JSON array, we simply visit the elements in parallel. If we meet a JSON primitive, we append its key to prefix. The pair of the corresponding leaf node is p_{i}={prefix, value}, where value is the value of that leaf node. Then we conduct backtracking and continue traversing till all the nodes are visited. After parsing, the pairs P= {p_{0}, …, p_{q}} that consists of all pairs constructed from leaf nodes of the tree. q denotes the amount of leaf nodes. Fig. 3 c illustrates the parsing result of the file in Fig. 3 a.
To avoid semantic changes, we only hide data into pairs P_{num} whose value only contains numeric data. Other pairs will be excluded during data embedding, whose values will remain unchanged. Also, the data hider can define a set of pairs P_{ban} that are excluded from data embedding by offering their keys. For example, the information of telephone numbers, ids, or timestamps should be excluded since they do not tolerate any modification. Therefore, we only embed data into the rest of the pairs P^{′} = P_{num} − P_{ban}. We denote the valid pairs as \( {p}_i^{\prime}\in {P}^{\prime },i=\left\{1,2,\dots, n\right\} \), where n is the size of P^{′}.
Further, to alleviate embedding distortion, data hider can define a maximalaccepted modifying tolerance T. Denote respectively the value of \( {x}_i^{\prime } \) and \( {y}_i^{\prime } \) as V(\( {x}_i^{\prime } \)) and V(\( {y}_i^{\prime } \)). V(\( {y}_i^{\prime } \)) must be within the range of \( \left[\left(1T\right)\mathrm{V}\left({x}_i^{\prime}\right),\left(1+T\right)\mathrm{V}\left({x}_i^{\prime}\right)\right] \). We first exclude the leading zeros from the digit sequence. Then we further exclude b leading digits of \( {x}_i^{\prime } \) to form the embeddable digit sequence
where ⌊x⌋ represents taking the largest integer smaller than x. For example, when the tolerance is set as 0.1, b = 2, so we exclude the leading two digits from the embedding procedure.
Denote the required embedding length as s. We do not embed data into the pair if \( \mathrm{len}\left(\mathrm{V}\left({x}_i^{\prime}\right)\right)b\le s \). Here, len(x) represents the length of string x. Therefore, the abovementioned pairs are excluded from P^{′} as well. The calculation of s will be discussed in Section 2.2.
We further apply hashing algorithm on the keys of each \( {x}_i^{\prime } \) so that each pair has a unique hash \( {h}_i^{\prime } \). The calculation is as follows.
where \( {\left({x}_i^{\prime}\right)}_k \) takes the ASCII code of the k^{th} character of \( {x}_i^{\prime } \). σ is a biased parameter, which is set as the private key K shared between the data hider and the recipient. \( {h}_i^{\prime } \) is ensured to be unique in the cover data.
Afterwards, we generate a set of sequences S_{rand} = {S_{1}, S_{2}, …, S_{n}}, where S_{i} is pseudorandomly generated by using \( {h}_i^{\prime } \) as seed, and the elements in S_{i} are within the range of [0, 1]. Finally, the pairs in P^{′} are sorted in ascending order according to the lexicographical order.
Watermark embedding (Fig. 4)
We first construct an intermediate sequence M that consisting of n segments. M = {m_{1}, m_{2}, m_{3}, …, m_{n}}. The length of each m_{i} is also p.
Afterwards, we define a transfer matrix G as follows:
where G is a k × n sparse matrix that each g_{i, j} in G either equals to 0 or 1.
The construction the sparse matrix G is specified as follows. First, we generate the cumulative distribution function cdf_{RSD} of robust soliton distribution. The probability density P_{i} of RSD is defined as follows:
where ρ_{i} is the ideal soliton distribution, and τ_{i} is defined as:
where \( R=c\sqrt{k}\ln \left(\frac{k}{\delta}\right) \) denotes the expected ripple size.
For the i^{th} column of G, we flip certain amount of different g_{j, i} from zero to one according to the following principle.
where a[b] represents taking the b^{th} element from sequence a, and x_{i} determines the number of selected g_{j, i} that should be flipped. k ∈ [S_{i}[1] + 1, …, S_{i}[1] + x_{i}]. Therefore, the total numbers of ones in each column also follows the robust soliton distribution (RSD).
We further define the procedure of transferring W to M as:
Here, ⊗ represents XOR operation between matrices, in which each w_{i} ∈ W is calculated respectively by:
Then, the intermediate sequence M can be generated.
Afterwards, we iteratively embed M into the pairs P^{′}. For an intermediate packet m_{i}, we apply cyclic coding as error correction coding. Then the tobeembedded packet t_{i} can be generated as t_{i} = Cyc(m_{i}). where the length of t_{i} is denoted as s (s > p). Next, we transfer the data format of \( {x}_i^{\prime } \) from numbers to string, and use least significant bit (LSB) modification to conduct data embedding. Denote the embedded data of \( {x}_i^{\prime } \) as \( {y}_i^{\prime } \).
We then embed s bits to a value of a pair. Denote the value is X and the tobeembedded bits are a. Then, according to S_{i}, a[j] will be embedded in (∆ + j) mod len(X), where ∆ = S_{i}[1] + x_{i}, and therefore
where Y denotes the watermarked value. We repeat this operation until all s bits are embedded into the pair and then iteratively conduct the same embedding procedure to all the pairs in the cover file.
Finally, we get the watermarked pairs as \( {Y}^{\prime }=\left\{{y}_1^{\prime },{y}_2^{\prime },\dots, {y}_n^{\prime}\right\} \). At last, we restore the parsed pairs to the original format. We go through all the nodes of the tree structure. If the values are different from those of Y^{′}, we modify them to \( {y}_i^{\prime } \). So that the watermarked JSON file J^{′} is obtained.
Watermark extraction
When the recipient gets the doubted semistructured data J^{∗}, the hidden watermark can be detected and extracted. The flowchart of the extracting procedure is depicted in Fig. 5. Note that the recipient is convinced to know the private keys used in watermark embedding for the generation of random sequences; otherwise, he cannot perform the watermark extraction.
The recipient first conducts preprocessing, that parses J^{∗} into P^{∗}. He applies the same hashing algorithm on the key and generate the pseudorandom sequence S_{i}, which is the same as the sequence during data embedding. Afterwards, he starts watermark extraction by initializing G as an empty n × k matrix. Then, S_{i}[1] is used to get x_{i}, and [S_{i}[1] + 1, …, S_{i}[1] + x_{i}] are used to flip g_{j, i}, as introduced in Section 2.2. Then he can get the exact transfer matrix G used for data embedding.
Afterwards, the pairs in P^{∗} are sorted in ascending order according to the lexicographical order. For each pair, denote the modified value as Y, and the embedded bits as a. According to S_{i}, a[j] will be embedded in (∆ + j) mod len(X), where ∆ = S_{i}[1] + x_{i}, and therefore the recipient gets the hidden bits by (10):
Then he applies cyclic decoding on a to recover the intermediate sequence a_{i} from \( {a}_i^{\prime } \). If \( {a}_i^{\prime } \) is not a valid cyclic code, the recipient is convinced that the data included in the pair is tampered. Thus, he does not utilize the packet for watermark extraction.
For valid packets, he further uses belief propagation (BP) for decoding. He first finds the intermediate packets whose x_{i}=1, i.e., those which are connected to only one watermark packet. After recovering these packets, they are utilized to decrease the degree of corresponding packets, for example, a packet whose x_{i}=2 and is connected to the known packet can also be recovered. By iteration, all the source packets can be decoded. Finally, he gets the watermark by concatenating all the recovered packets.
Extension
The proposed scheme can be extended on sheetformatted data as well. We take CSV format for an illustration. The embedding and extraction procedures of CSV watermarking using the proposed method can be concluded by Fig. 2 as well. We denote the data contained in the cell located in the i^{th} row and j^{th} column as c_{i, j}. Also, we denote the watermark sequence as W, and W is equally divided into k segments. W = {w_{1}, w_{2}, w_{3}, …, w_{k}}. The length of each w_{i} is p.
In the preprocessing stage, a cover CSV file is first parsed to the data structure of pairs. Denote the sheet as r rows and l columns, and the rows are regarded as pairs. We first choose a column that satisfies the following two requirements: (1) all the elements inside are not empty and (2) the appearance of repeated elements is the least. The selected column is regarded as index column, denoted as t. For k^{th} row of the file, the key of pair k is c_{k, t}.For the rest cells in the row, we define the collection \( {C}_k=\left\{{c}_{k,{v}_0},{c}_{k,{v}_1},\dots, {c}_{k,{v}_l}\right\} \) that contains valid data for watermark embedding, where cells containing nonnumeric data are also excluded to avoid semantic changes. Similarly, the data hider can define a collection of banned columns denoted as C_{ban}, and C_{k} ∩ C_{ban} = ∅. Therefore, cells that are not included in C_{k} will remain unchanged during embedding. To alleviate embedding distortion, b leading digits of \( {x}_i^{\prime } \) are excluded according to (1) given a threshold T. Here, if the length of \( {c}_{k,{v}_i} \) is less than b, \( {c}_{k,{v}_i} \) is excluded from C_{k}. We further sort the collection C_{k} according to the length of the values of the cells. Denote the sorted collection as \( {C}_k^{\prime }=\left\{{c}_{k,{v}_0}^{\prime },{c}_{k,{v}_1}^{\prime },\dots, {c}_{k,{v}_r}^{\prime}\right\} \), where r is the remaining number of valid values in k^{th} row, and len(\( {c}_{k,{v}_0}^{\prime } \)) ≥ len(\( {c}_{k,{v}_1}^{\prime } \)) ≥ … ≥ len(\( {c}_{k,{v}_r}^{\prime } \)). Finally, the pair k is generated as P_{k}={c_{k, t},\( {C}_k^{\prime } \)}, and the valid pairs are denoted as P = {P_{0}, P_{1}, …, P_{q}}, where q denotes the amount of valid pairs. Invalid pairs are defined by (11).
where s is the minimum required length for embedding. Therefore, the invalid rows will be excluded during watermark embedding. The sketch of preprocessing of a cover CSV file is shown in Fig. 6.
We generate hash code for each pair using (2), where \( {x}_i^{\prime } \) in the equation is replaced with c_{i, t}, which is the key of packet i. Afterwards, pseudorandom sequences S_{rand} = {S_{1}, S_{2}, …, S_{r}} can be generated under the same principle.
The procedures of transfer matrix construction, error correction encoding is similar to that of semistructured data. In the iterative embedding stage, for i^{th} row, the tobeembedded cyclic code is further fragmented and embedded into \( {C}_i^{\prime } \). For \( {c}_{i,{v}_0}^{\prime } \), we embed \( \left\lfloor \mathrm{len}\left({c}_{i,{v}_0}^{\prime}\right)/s\right\rfloor \) bits into the value. We iteratively embed all the bits into \( {C}_i^{\prime } \) until data embedding is finished. Thus, cells with longer value are more likely to carry more additional bits.
On the recipient’s side, data preprocessing for sheets is also carried out ahead of watermark extraction. The recipient gets the pairs and the corresponding pseudorandom sequences. Then he can successfully construct the transfer matrix. Under the same principle, he knows how many bits are hidden in a given \( {c}_{i,\mathrm{j}}^{\prime } \), and he extracts the secret bits with the help of the retrieved pseudorandom sequence. He then checks whether the recovered data sequence is a valid cyclic code. Finally, back propagation is also applied for data extraction.
Results and discussion
To verify the proposed scheme, we have conducted many experiments on an amount of JSON and CSV files. The sources are provided by some companies that are allowed for experimental uses. We use binary random sequences as digital watermark. We test the watermark robustness to several attacks.
Figure 7 shows two short JSON files before and after embedding 8 bits watermark as an example. From Fig. 7, we can easily observe that the long floating numbers are modified to carry the secret data, while the shorter ones and strings remain unchanged. Thus, we can consider the modification is imperceptible.
Settings and evaluations
The main parameters in the proposed scheme are the length of each watermark packet p, the length of tobeembedded packet s, and two parameters of RSD c and δ as is described in Section 2. In the experiment, we set p = 4, s = 7, c = 0.1, δ = 0.5.
The proposed scheme is low in computational complexity, which makes it easier to be embedded on hardwares. The watermark embedding and extraction can be done within several seconds by a personal laptop with 2.60 GHz CPU and 8.00 GB RAM.
For an objective performance assessment of the proposed scheme, the overall distortion after watermark embedding can be defined as:
where the semistructured data file contains N keyvalue pairs, and U_{i} equals to 0 or 1 indicates whether the i^{th} keyvalue pair has been modified. A_{i} represents the modified magnitude. A_{i}=y_{i} − x_{i}/x_{i}, where x_{i} and y_{i} respectively denote the original value and the modified value. Specially, if the i^{th} keyvalue pair is deleted, A_{i} is set to 1. A larger D indicates that the introduced distortion is larger.
Embedding performances
In Table 1, we measure the introduced distortions according to (12) under different embedding capacities. As is shown, the introduced distortion gradually remains stable even the embedding capacity grows higher, which is consistent with the analysis of our method. Meanwhile, the distortion of data is quite small, which indicates the watermarking scheme that has little impact on the original data.
We further test the robustness of semistructured data watermarking to three types of attack when l_{w} = 64 bits. Here, typical attacks including pair deletion, value modification, and redundancy insertion are applied in various degrees. We use four JSON files of different sizes for testing: file1, file2, file3, and file4, each containing 100, 200, 800, and 1500 pairs. We perform the same attack for multiple times on each watermarked file. The ratios of successful extraction under different attacks for semistructured data are shown in Table 2, where extractions with wrongly retrieved bits are not considered successful extraction. In the table, P_{d}, P_{i}, and P_{m} respectively denote the percentage of keyvalue pairs which are deleted, inserted, or tampered in the file.
The proposed watermark scheme is generally robust to all the abovementioned attacks, even the attacks are strong. Especially, the scheme shows promising embedding performance against pair deletion and redundancy insertion. Also, files with more valid pairs gradually show stronger robustness. The main reason of high robustness is that: for pair deletion, the watermark can be extracted using the remaining valid pairs. For contextual modification, the recipient can identify the tampered location using cyclic code checks and discard the tampered pair. For insertion, data extracted from the inserted pairs can hardly pass the cyclic code checks, and thus they are also discarded.
The proposed scheme is also robust to combined attacks, as is shown in Table 3, e.g., the doubted file is a both modified and truncated version of watermarked file. The results prove that the scheme is credible and applicable in real uses.
We also test the robustness of the extended part of the proposed scheme. We also use several cover CSV files for copyright watermarking. Here, we use the same embedding capacity l_{w} = 64 bits. The attacks include sorting, row deletion, value modification, row insertion, and column insertion. The test is applied on four CSV files. In Table 4, P_{dr} and P_{ir} are the percentage of rows which are added or inserted to the file. P_{m} denotes the percentage of values in the sheet which are tampered and l_{i} denotes the number of inserted columns.
As is shown in Table 4, the extended part of the proposed scheme also shows great robustness to typical attacks. Similarly, CSV files with more rows are stronger in robustness. As for column insertion, if the inserted values are participates in the payload allocation in each row, the extracted bits are distorted and cannot pass cyclic code checks. The robustness against column insertion is restrained in the proposed scheme.
Finally, for the security of the proposed scheme, we use a private key shared by data hider and recipient. Different private key results in different embedding locations, which helps to resist the collusion attack, i.e., two recipients cannot infer the embedding location by comparing two the same files with different watermarks.
Conclusions
In this paper, a novel watermarking scheme for semistructured data protection is proposed. The cover file is firstly parsed into pairs. We generate a transfer matrix to get the intermediate sequences, which are then encoded using errorcorrection codes and embedded into the pairs. On the recipient’s side, data extraction can be successfully carried out even the received data are tampered. The proposed scheme can be extended on several other formats. The experimental results show that the proposed scheme is robust to various kinds of typical attacks such as contextual truncating, modification, and redundancy addition. Meanwhile, the introduced distortion is comparatively low. Finally, a private key also helps to resist collusion attacks.
Availability of data and materials
The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.
Abbreviations
 XML:

Extensible markup language
 JSON:

JavaScript object notation
 SS:

Spread spectrum
 DFT:

Discrete Fourier transform
 DCT:

Discrete cosine transform
 DWT:

Discrete wavelet transform
 CSV:

Commaseparated values
 RSD:

Robust soliton distribution
 BP:

Belief propagation
References
 1.
Y. Xiang, I. Natgunanathan, D. Peng, et al., Spread spectrum audio watermarking using multiple orthogonal PN sequences and variable embedding strengths and polarities. IEEE/ACM Trans. Audio, Speech, Language Process 26(3), 529–539 (2018)
 2.
Z. Liu, Y. Huang, J. Huang, Patchworkbased audio watermarking robust against desynchronization and recapturing attacks. IEEE Trans. Inf. Forensics Security. 14(5), 1171–1180 (2019)
 3.
G. Hua, J. Goh, V.L.L. Thing, Cepstral analysis for the application of echobased audio watermark detection. IEEE Trans. Inf. Forensics Security. 10(9), 1850–1861 (2015)
 4.
T. Zong, Y. Xiang, I. Natgunanathan, et al., Robust histogram shapebased method for image watermarking. IEEE Trans. Circuits Syst. Video Technol. 25(15), 717–729 (2015)
 5.
M. Urvoy, D. Goudia, F. Autrusseau, Perceptual DFT watermarking with improved detection and robustness to geometrical distortions. IEEE Trans. Inf. Forensics Security. 9(7), 1108–1119 (2014)
 6.
S.W. Byun, H.S. Son, S.P. Lee, Fast and robust watermarking method based on DCT specific location. IEEE Access. 7, 100706–100718 (2019)
 7.
D. Zheng, S. Wang, J. Zhao, RST invariant image watermarking algorithm with mathematical modeling and analysis of the watermarking processes. IEEE Trans. Image Process. 18(5), 1055–1068 (2009)
 8.
P.C. Su, Y.C. Chang, C.Y. Wu, Geometrically resilient digital image watermarking by using interest point extraction and extended pilot signals. IEEE Tran. Inf. Forensics Security. 8(12), 1897–1908 (2013)
 9.
T. Dutta, H. Gupta, A robust watermarking framework for high efficiency video coding (HEVC)—encoded video with blind extraction process. J. Vis. Commun. Image Represent. 38, 29–44 (2016)
 10.
J.S. Tsai, W.B. Huang, Y.H. Kuo, On the selection of optimal feature region set for robust digital image watermarking. IEEE Trans. Image Process. 20(3), 735–743 (2011)
 11.
M. Amini, M. Ahmad, M. Swamy, A robust multibit multiplicative watermark decoder using vectorbased hidden Markov model in wavelet domain. IEEE Trans. Circuits Syst. Video Technol. 28(2), 402–413 (2018)
 12.
J. Brassil, S. Low, N. Maxemchuk, et al., Electronic marking and identification techniques to discourage document copying. IEEE J. Sel. Areas Commun. 13(8), 1495–1504 (1995)
 13.
L. He, X.l. Gui, An active attack on chaotic based text zerowatermarking. IEEE Conf. Anthol. 14 (2013)
 14.
M. L. Mali, N. N. Patil, and J. B. Patil, “Implementation of text watermarking technique using natural language watermarks,” Proc. Int. Conf. Commun. Syst. Netw. Tchnol. 482486 (2013)
 15.
D. Huang, H. Yan, Interword distance changes represented by sine waves for watermarking text images. IEEE Trans. Circuits Syst. Video Technol. 11(12), 1237–1245 (2001)
 16.
S. Zhang, Z. Yao, X. Meng, et al., New digital text watermarking algorithm based on newdefined characters, Proc. IEEE Int. Symp. Comput. Consum. Control. 713716 (2014)
 17.
M. Kuribayshi, T. Fukushima, N. Funabiki, Robust and secure data hiding for PDF text document. IEICE Trans. Inf. Syst. 102(1), 41–47 (2019)
 18.
I. Kamel, A schema for protecting the integrity of databases. Comput. Secur. 28(7), 698–709 (2009)
 19.
A. Khan, S.A. Husian, A fragile zero watermarking scheme to detect and characterize malicious modifications in database relations. Sci. World J. 2013(796726), 1–16 (2013)
 20.
R. Agrawal, P.J. Haas, J. Kiernan, Watermarking relational data: framework, algorithms and analysis. The VLDB J. 12(2), 157–169 (2003)
 21.
R. Sion, Proving ownership over categorical data, Proc. IEEE Int. Conf. Data Eng. (2004)
 22.
R. Sion, M. Atallah, S. Prabhakar, Rights protection for categorical data. IEEE Trans. Knowl. Data Eng. 17(7), 912–926 (2005)
 23.
Y. Zhang, B. Yang, X. Niu, Reversible watermarking for relational database authentication. J. Comput. 17(2), 59–66 (2006)
 24.
K. Jawad, A. Khan, Genetic algorithm and difference expansion based reversible watermarking for relational databases. J. Syst. Softw. 86(11), 2742–2753 (2013)
Acknowledgements
Many thanks to the anonymous reviewers for their constructive suggestions to help improving this paper.
Funding
This work was supported by the Natural Science Foundation of China (Grant Grant U1736213, U1636206, and U1936214).
Author information
Affiliations
Contributions
Our contributions in this paper were that the first author (JH) participated in the designing of the scheme and drafted the manuscript. The second author (QY) carried out the experiments and participated in the designing of the scheme. The third author (ZQ) conceived of the study, participated in the design, and helped to draft the manuscript. The fourth author (GF) and the fifth author (XZ) helped to design and improve the scheme. All authors read and approved the final manuscript.
Authors’ information
Jiahuan He received the B.S. degree from Shanghai University, China, in 2018, where he is currently pursuing the M.S. degree. His research interests include image processing and multimedia security. Qichao Ying received the B.S. degree from Shanghai University, China, in 2017, where he is currently pursuing the M.S. degree. His research interests include information hiding, image processing, and multimedia security.
Zhenxing Qian received the B.S. and Ph.D. degrees from the University of Science and Technology of China (USTC), in 2003 and 2007, respectively. He is currently a Professor with the School of Computer Science, Fudan University. He has published over 100 peerreviewed papers on international journals and conferences. His research interests include information hiding, image processing, and multimedia security.
Guorui Feng received the B.S. and M.S. degree in computational mathematic from Jilin University, China, in 1998 and 2001 respectively. He received Ph.D. degree in electronic engineering from Shanghai Jiaotong University, China, 2005. From January 2006 to December 2006, he was an assistant professor in East China Normal University, China. During 2007, he was a research fellow in Nanyang Technological University, Singapore. Now he is with the school of communication and information engineering, Shanghai University, China. His current research interests include image processing, image analysis, and computational intelligence.
Xinpeng Zhang received the B.S. degree in computational mathematics from Jilin University, China, in 1995, and the M.E. and Ph.D. degrees in communication and information system from Shanghai University, China, in 2001 and 2004, respectively, where he has been with the faculty of the School of Communication and Information Engineering, since 2004, and is currently a Professor. His research interests include information hiding, image processing, and digital forensics. He has published over 200 papers in these areas.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
AinShams University Hospitals ethics committee approval; (FMASU R59/2018). Informed consent to participate in the study was taken from all participants.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
He, J., Ying, Q., Qian, Z. et al. Semistructured data protection scheme based on robust watermarking. J Image Video Proc. 2020, 12 (2020). https://doi.org/10.1186/s1364002000500y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1364002000500y
Keywords
 Copyright protection
 Collusion attack
 Semistructured data
 Watermarking