Image completion via transformation and structural constraints

Image completion is an approach to fill a damaged region (hole) in an image. In this study, we adopt a novel method which can repair a target region with structural constraints in an architectural scene. An objective function that consists of three terms is proposed to solve the image completion problem. In color term, we compute a parameterized transformation model using detected plane parameters and measure the distance between the target patch and transformed source patch. This model helps to extend the patch search space and find an optimal solution. To improve the patch matching accuracy, we add a guide term that includes structure term and consistency term. The structure term encourages sampling patches along the structural direction, and the consistency term is used to maintain the texture consistency. Considering the color deviation between patches, we add a gradient term into a framework that can solve more challenging problems. Compared with previous methods, the proposed method has good performance in preserving global structure and reasonably estimating perspective distortions. Moreover, we obtain acceptable results in natural scenes. The experimental results illustrate that this novel method is a potential tool for image completion.


Introduction
Image completion methods aim to repair the defects of digital images with plausibly synthesized content to make images look more natural. This task is applied to many image editing applications ranging from object removal to movie clip and image understanding [1][2][3]. In general, there are two main types of image completion methods: diffusion-based methods and exemplar-based methods.
A diffusion-based method completes the target region using partial differential equations which propagate image information from surrounding areas into an unknown region. Bertalmio et al. [4] first proposed a method in which the information was propagated through the edge of a contour line in the occlusion area. Furthermore, these methods have two types: Euler's model [5] and total variation model [6]. They perform well in the images with thin cracks and scratches; however, they are not suitable for large damaged regions.
Exemplar-based methods sample the pixels from a known region of the image and copy them to a damaged region. Efros and Leung [7] proposed a nonparametric method for texture synthesis. The texture synthesis process grows a new image outward from an initial seed, one pixel at a time. The method in Ref. [8] processed an image in a greedy way to research the best matching patches. Due to the greedy strategy, this method resulted in an inconsecutive texture. Sun et al. [9] developed a method that first allowed users to draw lines in a target region. Then, the target region was completed along the lines. Meanwhile, some approaches [10,11] improve Criminisi's effect and pose the completion task as a global optimization problem with a well-defined objective function and propose an algorithm to optimize it. However, the cost of the objective function is usually exorbitant. According to this case, a fast PatchMatch method [12] solves this problem to a considerable extent by propagating the neighborhood information using neighbor patches. Furthermore, this method has been adopted by Adobe Photoshop. As for simple patch translation, it is difficult to find the most suitable patch without extending the search space. In fact, many methods [13][14][15][16][17] have addressed this issue via geometric transformation as well as photometric transformation. Xiao et al. [18] filled in the target region using a sample image. This method adopted an image with a similar texture and structure to enrich the search space. Le Meur et al. [19] used a coarse version of the input image to generate multiple inpainted images with different parameter settings and then recovered the full resolution of the final result. Using a Markov Random Field (MRF) model to build the energy function, many methods [20,21] optimize the energy function for its efficiency in realizing global image consistency. He and Sun [22] calculate the statistical offset to obtain regular structure information. This method demonstrates an excellent result in images with a large amount of duplicated information; however, it is still a problem when images have perspective shape deformation. The methods in Refs. [23,24] using a convolutional neural network (CNN) to generate the contents according to its surroundings. This method provided a great solution for filling in a large region and keeping the image semantically correct. However, it could not handle images with perspective distortion.
In this study, we propose a novel method for inpainting damaged regions in structural scenes. We also extend the method to different natural scenes. It has been discovered that most of the texture in various scenes have structured features (regular or linear). Therefore, we detect these parameters and make use of them to find a transformational relation between source patch and target patch. Moreover, we propose an objective function with two constraints to guide the texture synthesis. Different from previous methods, these two constraints can provide effective guidance when searching for the best matching patches. Finally, we apply a gradient term that is conducive to a gradual adjustment of the colors of our objective function to maintain the texture details.
The three main contributions of the proposed method can be described as follows. First, we adopted a parameterized transformation model to guide the image completion process. Second, we proposed an objective function with two constraints which together guide texture synthesis. Third, we combine the effect of these constraints and gradients into a framework that solves more challenging problems.
Given an input image I with a damaged region (hole), we aim to fill in the damaged region using pixels from the known region. In practice, it is a challenge to fill in a damaged region and obtain satisfying results, especially in architecture scenes. In many real scenes, the shape can change dramatically because of perspective distortions. For each target patch P in the damaged region, we calculate a transformation matrix T i that correlate target patch to the best matching patch Q. To estimate the parameters of T i , we adopt detected plane parameters [25] to generate the transformation model, instead of searching for the best matching patches by simply translation (details given in Section 3.1). When searching for the similar source patches, unconstrained process usually causes poor results. We constrain the patch sampling locations using texture direction and texture consistency (Section 3.3). Furthermore, we add gradient into our framework to obtain a smooth transition of color.
To obtain plausible texture in the hole region, the problem is translated into an optimization scheme. We define an objective function consists of color term, guide term, and gradient term. The color term explains how the source patches should be transformed. The guide term provides constraint, i.e., how the searching process should be limited. The gradient term gives an adjustment that leads to a smooth transition of color. Combining these three terms, we show that the proposed method can effectively improve the completion results in visual consistency. The flowchart of the proposed is shown in Fig. 1.

Objective function
To achieve a high-quality result, we develop an objective function for image completion. The objective function is a measured distance function that includes three terms. Here, we develop a transformation parametrized by θ i for each patch P.
We denote the improved energy minimization function as follows: where t i ¼ ðt x i ; t y i Þ T is the center position of a target patch in Ω and s i ¼ ðs x i ; s y i Þ T is the center position of the corresponding source patch in Ω. Here, Ω and Ω are the labels of known pixels and unknown pixels, respectively. We define θ i as a set of parameters for generating a transformation matrix T i . The three terms E color , E gradient , and E guide are the color term, gradient term and guide term, which together form the function. These terms will be explained in detail in the following sections.

Color term
The color matching term is similar to Ref. [8]: where Pðt x i ; t y i Þ is the target patch centered at t i , and Qðt x i ; t y i ; θ i Þ denote the matched source patch using the transformation matrix T i with the parameter θ i . Here, the color term represents the distance between the target patch and the transformed patch. We use sum of squared distance in the RGB space to calculate the distance. In Refs. [13][14][15], many geometric transformations were applied, e.g., rotation, scale, and flip. On the contrary, we use a homograph matrix to transform the patches into an affine correction space.
We now illuminate how we generate the transformation matrix T i based on the parameter θ i . In many real scenes, the shape can change dramatically because of a perspective distortion. It is difficult to fill in a damaged region if only simple patch transformation is taken into consideration. Xiao et al. [25] solved this problem by detecting planes and making use of them to generate a projective transformation matrix. In Fig. 2, we show the plane detection and posterior probability map. In our paper, we use the detected planar parameters to parametrize T i by where k i is the index of plane and We define a transformation matrix as follow: where H p indicates the projective transformation between source patch and target patch. The matrix H p has the form: indicates a rotation transformation by a 2 × 2 rotation transformation Mðs θ i Þ. We define the matrix H s as follow: where H s indicates the scale transformation by a 2 × 2 scaling transformation Nðs s i Þ . The matrix H c indicates the shear transformation. The matrix H t indicates the translation transformation by translation parameters f x i and f y i . The transformation model is similar to the decomposition of projective transformation matrix [26]. This formula effectively shows the transformation relation between source patch and target patch.

Guide term
Owing to the difficulty of acquiring excellent inpainting results just using color and gradient, we apply a guide term to constrain the patch search. Our guide term includes two constraints: where λ is the weight of the structure term. These two constraints can together guide the completion process.

Structure term
Many approaches [12,27,28] have demonstrated that limiting the search space by labeling the texture region could improve the completion result. Hence, we adopt a method using Gray-level co-occurrence matrix (GLCM) to detect the dominant texture direction and then automatically generate a structure guidance map that serves as a position constraint. The detail about this method can refer to Ref. [28]. In this study, we improve this method by further analyzing the optimal direction angle. Based on Ref. [28], the greater the GLCM contrast, the smaller the similarity between two pixels. We also obtain the relation between offset value (d) and the number of direction angle: the greater the offset value, the more is the directional angles. Zarif et al. [28] analyzed the texture direction using eight direction angles (d = 2). In this study, we compute the minimum of contrast to detect the current direction angle (also called minimum direction angle), as shown in Fig. 3b. We detect more directions to determine the optimal direction. Note that big value of offset may reduce the sensitivity to the texture direction. Thus, we set the maximum of offset d max = 20. The distribution of minimum direction angle is illustrated in Fig. 3c. We adopt the average value of all the minimum direction angle to determine the optimal texture direction.
Given an original image, the content along a direction usually has a similar structure and texture. To develop this property, we use the detected optimal direction to represent the content changes. Rather than limiting the search space using a non-gradient color, a gradient color is adopted, as shown in Fig. 3d. Here, the structure guidance map is regarded as a soft constraint in the completion process. In the structure guidance map, the location of the same color usually has the same texture. The structure guidance map encourages searching similar patches along the same direction.
The structure term E strcuture makes use of the guidance map to constrain the position where the source patches are drawn from (G pos ). The structure term is defined as follows: where L(•) is the ϵ-insensitive loss function L(x) = max(0, x − ϵ). We denote G pos as the position information of the source patch and target patch in structure guidance map. values in G pos should be penalized by this term. Sampling along the texture direction is encouraged to minimize the energy function.

Consistency term
Inspired by Ref. [12,20], we add a consistency term into the completion process to sample patches in adjacent regions. Given a target patch t i , if we can find a matching patch s i , their neighboring patches t n i and s n i are very likely to be the most similar patches. We assume that every patch has neighbors in four directions. We define t n 1 i , t n 2 i , t n 3 i , and t n 4 i as the neighboring patches of t i and s n 1 i , s n 2 i , s n 3 i , and s n 4 i as the neighboring patches of s i , respectively. If the difference between neighboring patches exceeds a threshold, we add a consistency constraint to encourage sampling patches from neighboring areas. The consistency term has the form: where C Here, we set ε = 1 to encourage sampling near the source patch. It helps to maintain the texture consistency.

Gradient term
To improve the results of completion, finding correct patches is necessary. Barnes et al. [12] adopted L 2 patch distance to compute the similarity between two patches. However, PatchMatch [12] may discover patches incorrectly when the texture is complicated, as shown in Fig. 4d. The method of Barnes fails to find the correct texture because it does not consider gradient. Adding a gradient term is helpful for gradually adjusting the colors. We define the gradient term as follows: where ∇Q(s i , t i , θ i ) and ∇P(t i ) denote the gradient of the patches centered at s i and t i , respectively. The gradient term is used to adjust the local color of patch. It can lead to a globally smooth transition of intensity and color [14]-a property that is lacking in patch-based methods. Here, we also use sum of squared distance in the RGB space to calculate the distance. This term can play to our strengths and search for the best similar patch for higher consistency.
In the search step, we adopt PatchMatch [12] method to accelerate our algorithm. When searching for the best matching patches, the position of a matching patch is found first. Then, we search for a transformed matching patch. The nearest neighbor patches are searched in the source region for every target patch to minimize the function.
Unlike previous methods, we reject unlikely patch transformation in scale when finding the similar patches, i.e., scale 1 ≤ S scale (T i , s i , t i ) ≤ scale 2 , where S scale (T i ) indicates the scale estimation. Large range of scale cannot provide effective constrain when finding source patches. Too small range of scale can lead to narrow patch searching space. We set scale 1 = 0.7 and scale 2 = 1.3 as the acceptable range in our experiments and obtain valid results. The approximated scale can be estimated using the first-order Taylor expansion [29].
In the voting step, the overlapping patches containing p have correspondence patches in the source region. Wexler et al. [11] adopted a weighted voting program to fill a target region. Similarly, we take the median of all the votes as the pixel to reduce the blur of pixel colors.
When calculating the patch distance, following HaCohen et al. [30], bias and gain are added to obtain the best matching patches. In this study, we set bias to [− 50,50] and gain to [0.5,1.5]. They are used to reject source patches whose gain or bias deviates the range. This can also help to extend the patch searching space and match wide color difference.

Implementation details
Our algorithm was implemented with MATLAB and C++. The PatchMatch iteration was [20,30]. A large hole region required more iterations. The time of the proposed image completion method can be categorized into two cases. The first case is to generate several guidance maps, which requires several seconds. The second case, which determines the running time, depends on the image size, hole region, and the texture complexity. For instance, given a 400 × 600 image with 120 × 140 damaged region, the inpainting process may require 2-3 min.

Comparison results
To demonstrate the results of the proposed method, we compare our method with several existing image completion algorithms, including Criminisi [8], image melding [14] and He and Sun [22]. We run these methods on six test images, as shown in Fig. 5.
In the first two rows, the buildings contain more than one plane. We can see that the proposed method can deal well with structural scenes. The other methods could not maintain structural consistency if only using patch translation. In the third row, we show buildings with projective distortions. Criminisi's method obviously propagated error information into a damaged region because of the flaw of priority in special cases. Image melding, while taking into account multiple patch transformations, failed to complete the original structure. He and Sun filled in the damaged region based on the offset statistics. However, it could not find the solution in a perspective space. The results in the fourth and fifth rows show that our method can recover structural consistency. We transform sampled patch in source region into target region using transformation model with a scale variation. The last row illustrates that our algorithm demonstrates outstanding performance in maintaining textural consistency.

Qualitative evaluation
To find a satisfactory completion for the user is the real purpose of image completion. One important test is visual inspection and another one is obtaining quantitative results using peak signal to noise ratio (PSNR). The PSNR comparison of six images in Fig. 5 is shown in Table 1.
We observe that the PSNR value of the proposed algorithm is slightly higher overall than the value of other algorithms. It is easy to know that the images completed by our method are better than the other methods in image consistency and coherence for human eyes. Figure 6 shows the comparisons.

Other results
Object removal is also one of the application occasions of image completion. In order to demonstrate the robustness of our method, we compare our method with current methods in the natural scenes. In these scenes, we cannot acquire a set of plane parameters. Our method can also maintain the consistency of textural structure, and the results satisfy human visual coherence. Figure 7 shows the comparisons with methods from Criminisi [8], Komodakis [10], He [22], and Le Meur [19]. In Table 2, we give the quality scores of the inpainted images, as determined by the technique reported in Ref. [31]. The lower the scores, the better the quality of the image. We can see from the contrast result that Criminisi's method introduces texture in a wrong location. Methods of Komodakis's and Le Meur's can hardly guarantee the structure continuity. He's method achieves more satisfactory inpainting result, while small flaw still exists. Compared with those methods, our method achieves better texture coherence and structure continuity.
In Fig. 8, we compare the proposed algorithm with the method using a deep learning model [23]. The input images are 128 × 128. We show the results in structural scenes and natural scenes. Compared with deep learning models, our method has better performance in maintaining the structural integrity and the global consistency of texture. The deep learning model repairs the damaged region using a "generate" way. The quality of results relies on numerous training data and excellent network structure. On the contrary, we estimate perspective distortions using a transformation model and constrain the completion process using the guidance map. The PSNR (dB) value is shown in Table 3. Figure 9 shows the comparison of results by the proposed method and Huang's method [16]. Form the first row, we can see that our algorithm has better performance when inpainting large damaged region. The second row shows the comparison of results in a perspective scene. Due to the lack of search space and scale constraints, the structure was distortions at the end of the building in Huang's result. In the third row, we show the comparison in keeping texture continuity. Huang's method failed to find the demarcation between two kinds of texture. The fourth row demonstrates that our method has a plausible performance in maintaining global texture consistency. We apply a gradient term and a consistency term into our objective function to maintain texture details and encourage sampling patch in adjacent areas. Therefore, the proposed method performs better in both continuity and visual effect. The PSNR (dB) value is shown in Table 4. 5.5 Effect of patch size Figure 10 shows the impact of patch size on the completion results. Our algorithm led to poor performance when using too small patch. Small patch cannot capture enough texture. Similarly, redundant texture was copied when using a too large patch. We apply different patch sizes on an example, as shown in Fig. 10. Figures 11 and 12 show the effect of the guidance map and the parameter λ. The guidance map offers significant guidance for the patch searching process. Here, we show the results of our method with different parameter values and comparisons. Figure 11 shows that the structure guidance map can help preserve structure integrity. In Fig. 12, we show the result and effect of parameter λ. We can see that the structure line of the house cannot be repaired reasonably if the value of λ is too small. On the contrary, the structure texture is discontinuous if the value of λ is too large. In our experiments, λ was set to 2.5 and the performance is receivable.

Effect of gradient term and consistency term
To get some intuition on the importance of the gradient term and consistency term, we illustrate four cases of information usage in Fig. 13. In Fig. 13b, we show the completion result without any guidance. The result is blurry and the structure is wrong. In Fig. 13c, we only use the gradient term in the optimization process. Since the structure information is insufficient, we obtain broken structures. In Fig. 13d, we only use the consistency term. While the completed region has structure information, the texture synthesis has an error in detail. The best result is acquired using both gradient and consistency, as shown in Fig. 13e.

Limitations
It is difficult for our method to handle the texture details if the opposite sides of the hole have textures with very different dominant directions. We fail to complete the structure lines, as shown in Fig. 14. The results may be improved using more sophisticated computer vision methods, which we leave to future work. Running time is also a limitation. Our approach is just a prototype. The time cost of our approach can be further improved using more efficient algorithms.

Conclusion
We have proposed an improved image completion method using structural constraints. First, we adopted a parameterized transformation model with detected plane parameters to extend the patch search space. Furthermore, we proposed an objective function with two constraints to guide the completion process. These two constraints provided effective guidance when searching for the best matching patches. Finally, we combined the constraints and gradient into a framework that could solve more challenging problems. We implemented our method in many images with various scenes and acquired promising results of visual consistency. Availability of data and materials Data will not be shared; the reason for not sharing the data and materials is that the work submitted for review is not completed. The research is still ongoing, and those data and materials are still required by the author and coauthors for further investigations.