 Research
 Open Access
 Published:
Enhanced computation method of topological smoothing on shared memory parallel machines
EURASIP Journal on Image and Video Processing volume 2011, Article number: 16 (2011)
Abstract
To prepare images for better segmentation, we need preprocessing applications, such as smoothing, to reduce noise. In this paper, we present an enhanced computation method for smoothing 2D object in binary case. Unlike existing approaches, proposed method provides a parallel computation and better memory management, while preserving the topology (number of connected components) of the original image by using homotopic transformations defined in the framework of digital topology. We introduce an adapted parallelization strategy called split, distribute and merge (SDM) strategy which allows efficient parallelization of a large class of topological operators. To achieve a good speedup and better memory allocation, we cared about task scheduling and managing. Distributed work during smoothing process is done by a variable number of threads. Tests on 2D grayscale image (512*512), using shared memory parallel machine (SMPM) with 8 CPU cores (2× Xeon E5405 running at frequency of 2 GHz), showed an enhancement of 5.2 with cache success rate of 70%.
1. Introduction
Smoothing filter is the method of choice for image preprocessing and pattern recognition. For example, the analysis or recognition of a shape is often perturbed by noise, thus the smoothing of object boundaries is a necessary preprocessing step. Also, when warping binary digital images, we obtain a crenellated result that must be smoothed for better visualization. The smoothing procedure can also be used to extract some shape characteristics: by making the difference between the original and the smoothed object, salient or carved parts can be detected and measured.
Smoothing shape has been extensively studied and many approaches have been proposed. The most popular one is the linear filtering by Laplacien smoothing for 2Dvector [1] and 3D mesh [2]. Other approach by morphological filtering can be applied directly to the shape [3] or to curvature plot of the object's contour [4]. Unfortunately none of these operators preserve the topology (number of connected components) of the original image. In 2004, our team introduced a new method for smoothing 2D and 3D objects in binary images while preserving topology [5]. Objects are defined as sets of grid points, and topology preservation is ensured by the exclusive use of homotopic transformations defined in the framework of digital topology [6]. Smoothness is obtained by the use of morphological openings and closings by metric discs or balls of increasing radius, in the manner of alternating sequential filters [7]. The authors' efforts have brought about two major issues such as preserving the topology and the multitude of objects in the scene to smooth out without worrying about memory management, latency or cadency of their filter. This paper describes an enhanced computation method of topological smoothing filter that assure better performance. We present also a new parallelization strategy, called Split Distribute and Merge (SD&M). Our strategy is designed specifically for topological operator's parallelization on shared memory architectures. The new strategy is based upon the exclusive combination of two patterns: divide and conquer and eventbased coordination.
This paper is organized as follows: in section 2, some basic notions of topological operators are summarized; the original smoothing filter is introduced. In section 3, parallelization strategy, that has been adopted, is introduced. We define the class of operators that our strategy may cover. Motivations for using shared memory parallel machines are also cited. Threads coordination and tasks scheduling are discussed. In section 4, the new parallel smoothing method is introduced and evaluations of acceleration, efficiency and success rate of cache memory access are also presented and discussed. Finally, we conclude with summary and future work in section 5.
2. Theoretical background
In this section, we recall some basic notions of digital topology [6] and mathematical morphology for binary images [8]. We define also the homotopic alternating sequential filters [5]. For the sake of simplicity, we restrict ourselves to the minimal set of notions that will be useful for our purpose. We start by introducing morphological operators based on structuring elements which are balls in the sense of Euclidean distance, in order to obtain the desired smoothing effect.
We denote by ℤ the set of relative integers, and by E the discrete plane ℤ^{2}. A point x ∈ E is defined by (x_{1}, x_{2}) with x_{ i } ∈ ℤ. Let x ∈ E, r ∈ ℤ, we denote by B_{ r }(x) the ball of radius r centred on x, defined by B_{ r }(x) = {y ∈ E, d(x, y) ≤ r}, where d is a distance on E. We denote by B_{ r } the map which associates to each x in E the ball B_{ r }(x). The Euclidean distance d on E is defined by: d(x, y) = [A^{2}B^{2}]^{1/2} with A = (x_{1}y_{1}) and B = (x_{2}y_{2}).
An operator on E is a mapping from P(E) into P(E), where P(E) denotes the set of all subsets of E. Let r be an integer, the dilation by B_{ r } is the operator δ_{ r } defined by δ_{ r }(X) = ∪_{ x ∈X }B_{ r }(x) ∀X ∈ P(E). The ball B_{ r } is termed as the structuring element of the dilation. The erosion by B_{ r } is the operator ε_{ r } defined by duality: ε_{ r } = * δ_{ r }.
Now, we introduce notion of simple point which is fundamental for the definition of topological operators in discrete spaces. We give a definition of local characterization of simple points in E = ℤ^{2}. Let consider two neighbourhoods relations Γ_{4} and Γ_{8} defined for each point x ∈ E by:
For general case, we define {\Gamma}_{n}^{*}\left(x\right)={\Gamma}_{n}\left(x\right)\backslash \left\{x\right\} with n ∈ {4, 8}. Thus y is said nadjacent to x if y\in {\Gamma}_{n}^{*}\left(x\right). We say also that two points x and y of X are nconnected in X if there is a npath between these two points. The equivalence classes for this relation are nconnected components of X. A subset X of E is said to be nconnected if it consists of exactly one nconnected component. The set of all nconnected components of X which are nadjacent to a point x is denoted by C_{ n }[x, X]. In order to have a correspondence between the topology of X and the topology of \overline{X}, we use nadjacency for X and \overline{n}adjacency for \overline{X}, with \left(n,\overline{n}\right) equal to (8; 4) or (4; 8).
Informally, a simple point p of a discrete object X is a point which is inessential to its topology. In other words, we can remove p from X without changing its topology. A point x ∈ X is said simple if each ncomponent of X contains exactly one ncomponent of X\{x} and if each \overline{n}component of \overline{X}\cup \left\{x\right\} contains exactly one \overline{n}component of \overline{X}. Let X ⊂ E and x ∈ E, two connectivity numbers defined as follows (#X = cardinality of X): T\left(x,X\right)=\#{C}_{n}\left[x,{\Gamma}_{8}^{*}\left(x\right)\cap X\right];\overline{T}\left(x,X\right)=\#{C}_{\overline{n}}\left[x,{\Gamma}_{8}^{*}\left(x\right)\cap \overline{X}\right].
The following properties allows us to locally characterize simple points [6, 9] hence to implement efficiently topology preserving operators:
(x ∈ E) is simple for X ⊆ E ↔ T(x, X) = 1 and \overline{T}\left(x,X\right)=1.
The homotopic alternating sequential filter is a composition of homotopic cuttings and fillings by balls of increasing radius. It takes an original image X and a control image C as input, and smoothes X while respecting its topology and geometrical constraints implicitly represented by C. A simple illustration is given by Figure 1. Smoothed image (b) is obtained using HAS filter with a radius equal to five and four connectedness (Γ_{4}). More example can be found in [5].
Based on this filter, Authors [5] introduce a general smoothing procedure with a single parameter to control smoothing degree. Let C ⊆ X, r ∈ ℕ and D\subseteq \overline{X}with X any finite subset of E. The homotopic alternating sequential filter (HASF) of order n, with constraint sets C and D, is defined as follows:
In the previous formula, H{C}_{n}^{C} (i) refers to homotopic cutting of X by B_{ n } with constraint set C and H{F}_{n}^{D}(ii) refers to homotopic filling of X by B_{ n } with constraint set D. These two homotopic operators can be defined as follows:
We recall that H(Z, W) is an homotopic constrained thinning operator. It gives the ultimate skeleton of Z constrained by W. The ultimate skeleton is obtained by selecting simple point in increasing order of their distance to the background thanks to a precomputed Euclidian distance map [10]. We recall also that *H(Y, V) is an homotopic constrained thickening operator. It thickens the set of Y by iterative addition of points which are simple for \overline{Y}and belong to the set V until stability.
We have provided, in this section, the theoretical underpinnings for studying topological transforms. We introduced also the homotopic alternating sequential filter which constitutes base for topological smoothing.
3. Parallelization Strategy
In this section, we start by defining the class of topological operators. We also present our motivation to parallelize these algorithms on parallel shared memory machines. Then, we will introduce different steps of our approach after making a brief classification over existing strategies. We will focus especially on distribution phase and tasks scheduling over different processors. Scheduling and merging algorithms are presented and discussed. To illustrate both algorithms, scenarios are also introduced and discussed.
3.1 Class of topological algorithms
In 1996, Bertrand and Couprie [11] introduced connectivity numbers for grayscale image. These numbers describe locally (in a neighborhood of 3 × 3) the topology of a point. According to this description any point can be characterized following its topology. They also introduced some elementary operations able to modify gray level of a point without modifying image topology. These elementary operations of point characterization present a fundamental link between large class of topological operators including, mainly, skeletonization and crest restoring algorithms [12]. This class can also be extended, under condition, to homotopic kernel and leveling kernel transformation [13], topological watershed algorithm [14] and topological smoothing algorithm [5] which is the subject of this article. All mentioned algorithms get also many algorithmic structure similarities. In fact associated characterizations procedures evolve until stability which induce common recursion between different algorithms. The grey level of any point can also be lowered or enhanced more than once. Finally, all mentioned algorithms get a pixel's array as input and output data structure. It is important to mention that, to date, this class has not been efficiently parallelized like other classes as connected filter of morphological operator which recently has been parallelized in Wilkinson's work [15]. Parallelization strategy proposed by Seinstra [16] for local operators and point to point operators can also be cited as example. For global operators, Meijster strategy [17] shows also consistence. Hence the need of a common parallelization strategy for topological operators that offers an adapted algorithm structure design space. Chosen algorithm structure patterns that will be used in the design must be suitable for SMP machines.
In reality, although the cost of communication (Memoryprocessor and interprocessors) is high enough, shared memory architectures meet our needs for different reasons: (i) These architectures have the advantage of allowing immediate sharing of data with is very helpful in the conception of any parallelization strategy (ii) They are nondedicated architecture using standard component (processor, memory...) so economically reliable (iii) They also offer some flexibility of use in many application areas, particular image processing.
3.2 Split Distribute and Merge Strategy
In practice, most effective parallel algorithm design might make use of multiple algorithm structures thus proposed strategy is a combination of the divide and conquer pattern and eventbased coordination pattern, see Figure 2. Hence the name that we have assigned: SD&M (Split Distribute and Merge) strategy. Not to be confused with mixedparallelism approach (combining dataparallelism and taskparallelism), it is important to mention that our strategy (i) represents the last stitch in the decomposition chain of algorithm design patterns and it provides a finegrained description of topological operators parallelization while mixedparallelism strategy provides a coarsegrained description without specifying target algorithm. (ii) It covers only the case of recursive algorithms, while mixedparallelization strategy is effective only in the linear case. (iii) It is especially designed for shared memory architecture with uniform access.
3.2.1 Split phase
The Divide and Conquer pattern is applied first by recursively breaking down the problem into two or more subproblems of the same type, until these become simple enough to be solved directly. Splitting the original problem take into account, in addition to the original algorithm's characteristics (mainly topology preservation), the mechanisms by which data are generated, stored, transmitted over networks (processorprocessor or memoryprocessor), and passed between different stages of computation.
3.2.2 Distribute phase
Work distribution is a fundamental step to assure a perfect exploitation of multicores architecture's potential. We'll start by recalling briefly some basic notion of distribution techniques then we introduce our minimal distribution approach that is particularly suitable for topological recursive algorithms where simple point characterization is necessary. Our approach is general and applicable to shared memory parallel machines. Critical cases are also introduced and discussed.
Indeed there are two main types of scheduler. There are those designed for realtime systems (RTS). In this case, the most commonly approaches used to schedule realtime task system are: ClockDriven, ProcessorSharing and PriorityDriven. Further description of different scheduling approaches can be found in [18–20]. According to [20] the PriorityDriven is far superior the other approaches. These schedulers must provide an operational RTS: completed work and delivered results on a timely basis. Other schedulers are designed for Non Realtime system. In this case, schedulers are not subject to the same constraints. Thus, "Symmetric Multiprocessing" scheduler distributes tasks to minimize total execution time without load balancing between processors, see Figure 3(a). On multicore architectures, this can lead to high occupancy rate of one processor while the others are free.
We propose a novel tasks scheduling approach to prevent improper load distribution while improving total execution time, see Figure 3(b). In literature, there are several schedulers that provide a balanced distribution of tasks such as RSDL "Rotating Staircase Deadline" [21] which incorporates a foregroundbackground descending priority system (the staircase) with runqueue managing minor and major epochs (rotation and deadline). Other scheduler, as CFS "Completely Fair Scheduler" [22], shows consistence. It handles resource allocation for executing processes, and aims to maximize overall CPU utilization while maximizing interactive performance. These schedulers are based on tasks uniformity principle. Through the tasks homogeneity, better distribution can be achieved and total execution time reduced. Unfortunately, these schedulers are not available in all operating system versions especially for small system. Based on the same principle of tasks uniformity, we propose a new scheduling algorithm, simpler to implement and more adapted to topological algorithm implementation.
Let be a basic nonpreemptive scheduler 'BasicNPS', T = {t_{1}, t_{2},..., t_{ k }} is the set of all tasks, T_{ T } = {t_{1}, t_{2}, ..., t_{ i }} is the set of tasks to process with T_{ T } ⊂ T, P = {p_{1}, p_{2},..., p_{ n }} is the set of all processors and P_{ a } = {p_{1}, p_{2},..., p_{ j }} is the set of available processors with P_{ a } ⊂ P.
BasicNPS (T_{ x } ⇒ P_{ y }) is able to schedule a set of T_{ x } tasks on P_{ y } processor. Let {p} be the maximum of processors that P_{ y } will contain. Then {p} can be defined as the maximum of available processors already defined by the set P_{ a } and {p} = max p_{ j }/p_{ j } ∈ P_{ a }. While ([P_{ a } ≠ ∅]∧[T_{ T } ≠ ∅]) then T_{ x } ⇒ P_{ y }: T_{ x } ∈ T_{ T }; P_{ y } ∈ P_{ a } . In this scheduler, each processor will treat at maximum m=max{t}_{i}\u2215{t}_{i}\to {p}_{j}\le max\left(\frac{\mid T\mid}{\mid P\mid}\right)tasks with j ∈ {1, 2,..., n}. Then, the worst case to process T isK\left(T\right)=max\left\{\underset{i}{max}{T}_{T}\to {p}_{1},...,\underset{j<...<i}{max}{T}_{T}\to {p}_{k}\right\}. As proof, let suppose that it exist a set L(T) as\sum L\left(T\right)\ge \sum K\left(T\right). As 'BasicNPS' manage L(T) and K(T), so we can introduce the following: L(T) ≤ m and K(T) ≤ m. Thus, if \left(\sum L\left(T\right)\ge \sum K\left(T\right)\right)then there exists at least one task {l}, with k ∈ K(T), such as: (A ∧ B ∧ C) with A = (l ∈ L(T)), B = (l ∉ K(T)), C = (l > k). This is impossible according to the definition of K(T) which was defined as the worst case.
Algorithm 1 describes 'BasicNPS' policy. The first step consists on asking operating system to determine the number of available processor. Depending on this number, algorithm will generate process. One active process will be assigned for each available processor. These new processes will belong to the SHED_FIFO class in order to ensure preemption and especially to avoid context switching. Process will only stop running if work is complete or less frequently when another process, belonging to the same class, with higher priority requesting processor. The global execution will stop if there no more task to process.
3.2.3 Merging phase
The key problem of each parallelization is merging obtained results. Normally this phase is done at the end of the process when all results are returned by all threads what usually means that only one output variable is declared and shared between all fighting threads. But as we mentioned in section 3.1, we are dealing with a dynamic evolution and if we take into account different steps of simple point detection then pixel characterizations, we can plan the following: The original shared data structure, containing all pixels, is divided into n research zones {z_{1}, z_{2} ..., z_{ n }}. We associate one thread from the following list {T_{1}, T_{2},..., Tn} to each zone. Each thread can browse freely its zone and if it detects target pixel types, it lowers characterized pixel and it pushes its eight neighbors in one of the available FIFO queues. A queue is said available if only one thread (owner) is using it. One queue cannot be shared by more than two threads so if no queue is available, threads can create a new one and become owners.
Since two threads finished, they directly merge and a new thread is created and then same process is lunched again. New created thread will inherit queue shared between his parents. Thus it can restart research. It is also important to mention that there is no hierarchical order in thread merging, only criteria is finishing time. We mention also that one neighbor cannot be inserted twice. It is a precaution in order to minimize consumed cache. More formal description of merging techniques is given in by algorithm 2.
It is important to highlight similarity and difference that may exist between our merging algorithm and KPN [23]. In effect, both are deterministic and do not depend on execution order. But KPN algorithm may be executed in sequentially or in parallel with the same outcome while our merging algorithm is designed only for parallel execution. KPN support recurrence and recursion while our merging algorithm support only recursion.
In large scale application, KPN showed consistence. Examples include Daedalus project [24] where generated KPN models are used to map process into FPGA architecture. Ambric architectures [25] implement also a KPN model using bounded buffers to create massively DMP Machines based on structural object programming model.
In a narrower framework limited to simple point characterization, the implementation of such a model will be very expensive and it would be better to find an easier and more specific algorithm.
In Figure 4, we give an illustration of the merging algorithm with four threads. The original shared data structure is divided into 4 research areas {z_{1}, z_{2}, z_{3}, z_{4}}. Threads {T_{1}, T_{2}, T_{3}, T_{4}} will start browsing different zones in parallel. T_{1} is the first to detect target point (constructible, destructible...) so it lowers characterized pixel (in z_{1}) and it pushes its eight (or four) neighbors in FIFO queue F_{1} that it has created before continue browsing. Later, T_{3} will detect new target point so it will lower characterized pixel (in z_{3}) then push neighbors in F_{1} before continue browsing. T_{3} does not need to create new FIFO queue since F_{1} is available. T_{1} and T_{3} will repeat this procedure twice. Since they finish browsing, they merge and new thread T_{5} is born. T_{5} will start browsing only F_{1}. Since it detect new target point so it will lower characterized pixel (in z_{5} = z_{1}+ z_{3}) then push neighbors in F_{3} that it has created before continue browsing. Similarly T_{2} and T_{4} will generate the creation of F_{2} and T_{6}. Here T_{6} will eventually merge with T_{5} to give birth to T_{7}. Finally there will be a single thread T_{7} which will brows F_{3} without detection any target points.
We have introduced, in this section, three necessary steps to implement our parallelization strategy (SDM). It is important to mention that some similarity may exist between our split/merging phases and alphaextension/betareduction phases from structural perspective. Actually both approaches intended to put in place more guarantees that the parallelism will actually be met. But uses contexts are different. In effect, Jean Paul Sansonnet [26–28] team introduced alphaextension (diffusion) and betareduction (merging) notions for stream manipulation in the framework of Declarative Data Parallel language definition and there techniques cannot be applied without a scalar function. While our proposal is restricted to topological characterization in the framework of topological operator's parallelization and no scalar function is required during the application of these two phases.
4 Parallel smoothing filter
In this section we start by analyzing overall structure of original algorithm. Then we continue with the parallelization of Euclidean distance, thinning and thickening algorithm. We conclude by a performance analysis of the entire smoothing topological operator. Obtained execution time, efficiency, speedup and cache misses will be introduced and discussed.
As we have shown in Section 2, smoothing algorithm receives as input a binary image and maximum radius. It uses two procedures for homotopic opening and closing, see Figure 5(a) (b). The call is looped to ensure an ongoing relationship between input and output. The opening process is a consecutive execution of erosion, thinning, dilatation and thickening. While closure procedure ensures the same performance of the four consecutive functions with single difference: the erosion instead of dilatation. Thinning and thickening ensure the topological control of erosion and dilatation. This control is based on researching and removing of all destructible points. When destructible point is deleted, its neighbors are reviewed to ensure that they are not destructible either.
A preliminary assessment of first implementation code, see Table 1, shows that Euclidean distance computing (EucDis) takes more time than topological point characterization (Topcar). For an image of (200*200), computation time of Euclidean Distance (E.D) with an infinite radius is 46.67% while point characterization of 2.4 million points occupies only 18.15%. If we limit radius between 5 and 10, computation time of (E.D) continues to increase. It can reach 64.44% of total time with a radius equal to 5. However time for topological characterization is only 8.89% for 1 million points. These finding remain the same if we increase image size. Beyond (512*512), computing time of point characterization becomes considerable.
4.1 Euclidean distance computing
4.1.1 Study on Euclidian Distance algorithms
During previous evaluation, 4SED [10] algorithm was used for Euclidean distance computation. So we are looking for another algorithm that is faster, and parallelizable. New algorithm must have an Euclidean distance computation error less than, or equal to, that produced by 4SED in order to maintain homotopic characteristics of the image. In literature, several algorithms for Euclidean distance computing exist. Lemire [29] and Shih [30] algorithms are bad candidates because Lemire's algorithm does not use Euclidean circle as structuring element. Then homotopic property will not be preserved. Shih's algorithm has a strong data dependency which penalizes parallelization. In [31], Cuissenaire propose a first algorithm for Euclidian distance computing, called PSN "Propagation Using a Single Neighborhood" that uses the following element structure:
He also proposes a second algorithm, called PMN "Propagation Using Multiple Neighborhood" that uses eight neighbors. In [32], he also proposes a third algorithm with o(n^{3/2}) complexity, which offers an accurate computation of the Euclidean distance. Only drawback of this third algorithm is computation time which is very important and goes beyond the two algorithms mentioned above. Even if computing error produced by PSN is greater than computing error produced by PMN, it is comparable to that produced by 4SED. Low data dependence and ability to operate on 3D images, makes PSN algorithm a potential candidate to replace 4SED.
Meijster [17] proposes an algorithm to compute exact Euclidean distance. Algorithm complexity is o(n) and it operates in two independent, but successive, steps. First step is based on looking over columns then computing distance between each point and existing objects. Second step includes same treatment looking over lines. It is important to note that strong independence between different processing steps and computing error equal to zero makes Meijster algorithm another potential candidate to replace 4SED. Algorithm is also able to operate on 3D images. Theory analysis of Meijster and Cuissenaire algorithms can be found in Fabbri's work [33].
In the following, we propose first analysis based on different algorithms implementation in order to compare between them. We have implemented 4SED algorithm using a fixed size stack. This stack uses a FIFO queue and it has small size while 4SED algorithm does not need to store temporal image. Results are directly stored into the output image, we will retain this implementation because 4SED assessment serve only as reference for comparison. For PSN implementation, we used stacks with dynamic sizes. Memory is allocated using small blocks defined at stack creation. When an object is added to queue, algorithm will use available memory of last block. If no space is available, a new block is allocated automatically. Block size is proportional to image size (N × M/100). Finally we used a simple memory structure to implement Meijster algorithm. A simple matrix was used to compute distance between points and object of each column and three vectors were used to compute distance in each line. We recall that this comparison is done in order to select the best algorithm among three candidates.
Figure 6 describes obtained results by different implementations on single processor architecture P4. During this evaluation we used binary test image (200 × 200). We have also varied ball radius. We used Valgrind software to evaluate different designs. Callgrind tool returns the cost of implementing of each program by detecting IF (Instruction Fetch). Results show that PSN algorithm is the most expensive in all cases (for any radius). Meijster algorithm is moderately faster than 4SED. The output images returned by Meijster algorithm hold the best visual quality while Euclidean distance computation error is almost zero thus our efforts will be brought on Meijster algorithm parallelization.
4.1.2 Parallelization of Meijster algorithm
We denote by I input image with m columns and n rows. We denote by B an object included in I. The idea is to compute, for each point p ∈ I ∧ p ∉ B, separating distance between p and the closest point b with b ∈ B and ∀(0 ≤ b ≤ m), b = (b_{ x }, b_{ y }). This amount to compute the following matrix:
If we assume that minimum distance of an empty group K is ∞ and ∀ z ∈ K, we have (z_{ y } + ∞) = ∞ then EDT(p) formula can be written as follow: ∀b_{ x } < n, ∀b_{ y } ≤ m, EDT(p) = min(p_{ y }b_{ x })^{2}+G(p_{ x }, b_{ y })^{2} with G(p_{ x }, y) = minp_{ x }b_{ x }: b = (b_{ x }, y).
Thus we can split the Euclidian distance transform procedure into two steps. The first step is to scan columns and compute EDT for each column y. Second step consists on repeating the same procedure for each line. In the following we start by detailing these two steps: In the first step G(p_{ x }, y) can be computed through the two following sub functions: G_{ T }(p_{ x }, y) = min p_{ x }b_{ x }: b = (b_{ x }, y), G_{ B }(p_{ x }, y) = min b_{ x }p_{ x }: b = (b_{ x }, y) with ∀0 ≤ b_{ x } ≤ n. To compute G_{ T }(p_{ x }, y) and G_{ B }(p_{ x }, y), we scan each column y from top to bottom using the two following formula: G_{ T }(p_{ x }, y) = G_{ T }(y, p_{ x }1)+1, G_{ B }(p_{ x }, y) = G_{ B }(y, p_{ x }1)+1. Thus sequential algorithm of the first step can be written as follows. The complexity order is o(n × m).
Let's move to the second step. We start by defining f(p, y) = (p_{ y }y)^{2}+G(p_{ x }, y)^{2} . Then we can define EDT(p) = min f(py), ∀0 ≤ y ≤ m. For each row u, we note that there is, for the same point p, the same value of f(p, y) for different values of y, so we can introduce the concept of "region of column ".
Let S be the set of y points such that f(p, y) is minimal and unique. The formula of S, ∀0 ≤ y ≤ u, is S_{ p }(u) = min y: f(p, y) ≤ f(p, i). ∀0 ≤ i ≤ u ∧ u ≤ m. Let T be the set of points with coordinate greater than, or equal to, horizontal coordinate of the intersection with a region: {T}_{p}\left(u\right)=Se{p}_{{p}_{x}}\left({S}_{p}\left(u1\right),u\right)+1.
Let Sep(i, u) be the separation between regions of i and u, defined by:
Thus lines will be processed, from left to right then from right to left. During the first term, from left to right, two vectors S and T will be created. These two vectors will contain respectively all regions and all intersections. During the second treatment, from right to left, we compute f for each value of S. f is also computed for each respective values of T. Algorithm 4 is associated to second step. For the first term, complexity order is q+2(mu) whereas complexity order of the second term is only m.
The independence of data processing between rows and columns is the key to apply of SDM parallelization strategy. In the first stage, column processing, we can define data interdependence by the following equation:
It follows that values of each column y of G, depends only on lines: p_{ x }, p_{ x }+1 and p_{ x }1. Similarly, at the second stage, we can introduce the following interrelationship: Edt(p) = f(p, S_{ p }(q)).
Then ∀(0 ≤ y ≤ u), (0 ≤ i ≤ u) Λ (u < m), S_{ p }(u) = min y: f(p, y) ≤ f(p, i). Thus, if (u = T_{ p }(q)) so q = (q1) which imply the following: {T}_{p}\left(u\right)=Se{p}_{{p}_{x}}\left({S}_{p}\left(q\right),u\right)+1.
According to this formalization, values of f(p, i) and Sep_{ x }(i, u) are independent of modified data. So using two vectors S and T, a private variable q for each line ensures complete independence in writing. We start applying the splitting step by sharing the columns and lines processing between multiple processors. A thread can process one or more columns and the number of threads used will depend on the number of processors. The results returned by all threads in this first stage will be merged in order to start lines processing. In the following we introduce the parallel version of Meisjter algorithm for both steps. Associated algorithm complexity is o((n × m)/N). (n × m) refers to image size and N refers to the number of processors.
Proposed parallel version of Meijster algorithm was implemented in C using OpenMP directives. Speedup for numbers of threads equal to 1, 2, 4, 8, and 16 were determined. The efficiency measure Ψ (n) is given by the following formula with n the number of processors: Ψ (n) = seq. time/(n*para. time) (ii)
Times were performed on eightcore (2× Xeon E5405) shared memory parallel computer, on Intel Quadcore Xeon E5335, on Intel Core 2 Duo E8400 and Intel monoprocessor Pentium 4 660. The minimum value of 5 timings was taken as most indicative of algorithm speed. More information about architectures characteristics are given in Section 4.
The measurements were done on 2D binary image (512*512). If we can get a satisfactory outcome for this standard, it will be the same for smaller size images. View cache size limits, larger image will not be tested. Figure 7 shows that number of instructions to compute Euclidian distance drops from an average of 9.5 × 10^{8} using 4SED algorithm down to 7.6 × 10^{8} ms with Meijster algorithm. Despite the passage from a sequential version running on single core to a parallel version running on 8 processors, acceleration is only multiplied by 1.6 as shown in Figure 8(a). This can be explained by the choke point between columns processing and lines processing. Waiting time between these two treatments significantly penalizes acceleration. Figure 8(b) shows that efficiency variation depends on the number of threads. It is also proportional to the number of processors. Moving to 3, 5 or 7 threads (odd number) decreases significantly the efficiency which reaches its maximum each time that the number of threads is equal the number of processors.
4.2 Thinning and thickening computing
Algorithms of thinning and thickening are almost the same. The only difference between them is the following: in thinning algorithm, destructible points are detected then their values are lowered. In thickening algorithm, constructible points, are detected then their values are increased. For parallelization, we will apply the same techniques introduced in [34]. We propose a similar version using two loops. Target points are initially detected then their value lowered or enhanced according to appropriate treatment. The set of their eight (or four) neighbors are copied into a "buffer" and rechecked. This treatment is repeated until stability. In the following, we present an adapted version of Couprie's thinning algorithm.
Unfortunately direct application of introduced parallel processing is not possible with the set of all points. Some points, called critical points, cannot be eliminated in parallel because initial topology of the image may be broken. Figure 9 illustrates this case: Critical points of an input image (a) are identified in (b). If these points are deleted in one iteration (c) topology necessary is broken (d).
To resolve this problem, we propose that research areas assigned to each thread must be composed of at least six lines (of the image). Each thread will use two buffers to treat each three lines thus four buffers are used to treat six lines as shown in Figure 9(e).
Through this organization threads can start running in parallel on Z_{11}, Z_{21} and Z_{31}. Once processing is completed threads can restart running on Z_{12}, Z_{22} and Z_{32}. In some cases, a neighbor of a destructible point is detected on the border of a contiguous area. To prevent that such neighbor escape to recheck, it must be injected to buffer of the right thread. Let's suppose that a point p ∈ Z_{2} is considered as destructible by T2, so its value will be lowered and its four neighbors {v_{1}, v_{2}, v_{3}, v_{4}} should be rechecked. Neighbors {v_{1}, v_{2}, v_{4}} belong to Z_{2} so they will be push in T2 buffers. The neighbor {v_{3}} belongs to Z_{3} so it will stack T3 buffers.
Performance evaluation of introduced adapted version of Couprie's algorithm is shown in Figure 10. On eight cores architecture, acceleration does not exceed 3.4. Such moderate result can be explained by critical borders processing. Regarding efficiency, the best performance is achieved when the number of thread is equal to the number of processors. If this equality is not ensured, the efficiency decreases. The problem threads' add number still persists.
The next step is to combine the parallel version of Meijster algorithm and the adapted version of Couprie algorithm to build the parallel processing of topological smoothing.
4.3 Global analyses
In this section, we present a global evaluation of the parallel smoothing operator. We start by presenting performance evaluations in terms of acceleration and efficiency. Then, we evaluate cache memory consumption.
4.3.1 Execution time
We implemented two versions of the proposed parallel topological smoothing algorithm, the first one using 'Symmetric Multiprocessing' scheduler and the second one using 'basicNPS' scheduler. Wallclock execution times for numbers of threads equal to 1, 2, 4, 8, and 16 were determined. The minimum value of 2 timings was taken as most indicative of algorithm speed. The measurements were done on 2D binary image (512*512). Results of the second implementation on the eightcore are shown in Figure 11.
We note that number of instructions drops from an average of 1879 × 10^{8} FI with a single thread down to 1652 × 10^{8} ms with 8 threads. As expected, the speedup for the second implementation using 'basicNPS' scheduler is higher than for the one using "Symmetric Multiprocessing" scheduler, thanks to balanced distribution of tasks. A remarkable result about speedup is also shown in Figure 12(a). In fact, speedup increases as we increase the number of threads beyond the number of processors in our machine (eight cores). In the first implementation, using "Symmetric Multiprocessing" scheduler, the speedup at 8 threads is 1.9 ± 0.01. However, for the second implementation, using our scheduler, the speedup has increased to 5.2 ± 0.01. Another common result between different architecture is stability of execution time on each ncore machine since the code uses n or more threads.
For better readability of our results, we tested also efficiency of our algorithm on various architectures (see Figure 12(b)) using the ψ(n) formula introduced earlier. For parallel time ratio we used best obtained time with 8 threads ('basicNPS' scheduler).
4.3.2 Cache Memory Evaluation
As memory access is a principal bottleneck in currentday computer architectures, a key enabler for high performance is masking the memory overhead. If we starts from basic theory that two classic cache design parameters dramatically influence the cache performance: the block size and the cache associativity. So the simplest way to reduce the miss rate is to increase the block size even it increases the miss penalty. The second solution is to decrease associatively in order to decrease hit time thus to retrieve a block in an associative cache, the block must be searched inside of an entire set since there is more than one place where the block can be stored.
Unfortunately, we are dealing with nonreconfigurable architectures with caches whose associativity and block size are predefined by the manufacturer. Nowadays, new approaches to reduce cache miss are developed such as taking advantage of locality of references to memory or using aggressive multithreading so that whenever a thread is stalled, waiting for data, the system can efficiently switch to execute another thread. Despite their power, the application of both approaches remains limited. In fact, applications of locality approach still experimental even with Larrabee technology introduced by Intel. And the aggressive multithreading approach has been specially designed for graphics processing engines, which manage thousands of inflight threads concurrently. So it is not recommended for general SMP machines with limited number of processors and threads. With all these limitations, the most intuitive solution is to rely on the scheduling. Thanks to our basicNPS scheduler, we have balanced the charges then prevent context switching thus we minimize caches misses.
In the following we present our experimental analysis. We consider a commonly used Intel processor configuration (More details are given by table 2). Number of processor varies from one to eight. The frequency varies between 1,73 GHz and 3,4 GHz. The L1 caches have at least a 32byte block size, while capacity vary between 16 Kbytes and 32 Kbytes, and for the associativity, only eight ways is considered. The L2 caches have at least a 64byte block size, while capacities vary between 512 Kbytes and 6 Mbytes, and the associativity varies between two and twenty four ways.
The scheduler relies on our basicNPS scheduling policy. As a result of this experiment, see Figure 13(A1), we found that three performance regions are clearly evident: In the leftmost region, as long as the cache capacity can effectively serve the growing number of threads, increasing the number of threads improves performance, as more processors are utilized. This area is generally identified as cacheefficiency zone. At some point, the cache becomes too small for the growing stream of access requests, so memory latency is no longer masked by the cache and instruction cache misses reduce more moderately. As the number of available threads again increases, the multithread efficiency zone (on the right) is reached, where adding more threads improves performance up to the maximal performance of the machine, or up to the bandwidth wall. Balanced workloads offer higher locality and better exploit the cache and hence expand the cache efficiency zone to the right and up. An outstanding example is given by table 3 which summarizes number of L1 instruction misses on Intel Dual Core T1400 architecture using SMP scheduling policy and BasicNPS scheduling policy. We note that number of instruction misses drops from an average of 18844 L1 Instr. misses (using SMP) with two threads down to 6030 L1 Instr. misses (using BasicNPS) usually with two threads. Here success rate is largely above the average of 50%. The same rate will be practically maintained when increasing the number of threads (Figure 14).
Moreover, the shape of the performance curve depends on how fast the cache hit rate degrades as a function of the number of threads. Any success access to L1 will eliminate an attempt to access to L2 thus performance curve, Figure 15, will evaluate in the same way. By reducing the number of cache miss from instruction cache, processor or thread of execution has not to wait (stall) until the instruction is fetched from main memory which immediately impact execution time.
Figures 14(A) and 16(A1) show so much load balancing and implicitly context switching between processes can affect performance in terms of reading data from caches. However, improvement in writing data, see Figure 14(B) and Figure 16(B1), in two caches remains modest. When there are more computation instructions per memory access, performance climbs more steeply with additional threads. This is because as more instructions are available for each memory access, fewer threads are needed to fill the stall time resulting from waiting for memory.
5 Conclusion
Topological characteristics are fundamental attributes of an object. In many applications, it is mandatory to preserve or control the topology of an image. Nevertheless, the design of transformations which preserve both topological and geometrical features of images is not an obvious task, especially for parallel processing.
In this paper, we have presented a new parallel computation method for topological smoothing through combining parallel computation of Euclidean Distance Transform using Meijster algorithm and parallel ThinningThickening processes using an adapted version of Couprie's algorithm.
We have also presented a new parallelization strategy called SDM (Split Distribute and Merge). Proposed strategy is partially based on divide and conquers principle associated to eventbased coordination techniques. Further than smoothing operator, SDM Strategy can be applied for a large class of topological operators as we shown in section 3.1. In addition to identified conditions during splitting step, we introduced an adapted scheduler called basicNPS (Basic  Non Preemptive Scheduler) able to distribute in balanced way a set of active tasks on available processors. Finally we introduced an adapted merging policy designed especially for dynamic system evolving until stability.
Parallel topological operator computation poses many challenges, ranging from parallelization strategies to implementation techniques. We tackle these challenges using successive refinement, starting with highly local operators, which process only by characterizing points and then deleting target pixels, and gradually moving to more complex topological operators with nonlocal behavior. In future work, we will study parallel computation of the topological watershed [14].
Algorithm 1. Scheduling policy

1.
T: Set of all tasks

2.
P: Set of all processors

3.
While (T ≠ ∅) repeat:
4. N_{ T } = Nbr_active_tasks();
5. N_{ P } = Nbr_ available_processors();
6. If (N_{ P } ≠ 0) then
7. If (N_{ T } < N_{ P }) then
8. For each processor N_{ pi }:
9. Generatenewprocess (N_{ Ti });
10. Identifyclass (N_{ Ti }, SCHED_FIFO);
11. Endfor
12. Else: N_{ DT } = Desable_tasks (N_{ P } N_{ T });
13. Insert_desabled_tasks (N_{ DT }, T);
14. For each processor N_{ Pi }:
15. Generatenewprocess (N_{ Ti });
16. Identifyclass (N_{ Ti }, SCHED_FIFO);
17. Endfor
18. EndIf
19. EndIf
20. EndWhile
Algorithm 2. Merging technique
1. Z: Set of research zones
2. T: Set of threads
3. FIFO _Q: Set of available FIFO queues
4. P_{ T } : Target pixel type; P_{ D }: Detected pixel
5. For all zones (Z_{ i } ∈ Z) do:
6. Parallel_browsing (T_{ i }, Z_{ i });
7. EndFor
8. For each thread (T_{ i } ∈ T) do:
9. If (pixel_caract(T_{ i }, P_{ T })==True) then
10. modify_value(P_{ D });
11. If ((FIFO _Q ≠ ∅) then
12. usedstatus(FIFO _Q_{ j }, true);
13. insert_neighbors(T_{ i }, P_{ D }, FIFO _Q_{ j });
14. Else: add_new_fifo (FIFO _Q)
15. usedstatus(FIFO _Q_{j+1}, False);
16. insert_neighbors(T_{ i }, P_{ D }, FIFO _Q_{j+1});
17. EndIf;
18. EndIf;
19. EndFor;
Algorithm 3. Meijster original version [1st Step]
1. Data: m:colums, n:lines, b:image
2. Forall y ∈ [0..m1] do
3. If (0, y) ∈ B then g[0..y] = 0
4. else g[0..y] = ∞
5. endif
6. /* G_{T} */
7. for (x = 1) to (n1) do
8. if [x, y] ∈ B then g[x..y] = 0
9. else g[x, y] = g[x+1, y]+1
10. endif
11. endfor
12. /* G_{B} */
13. for (x = n2) downto (0) do
14. if g[x+1, y] < g[x, y] then
15. g[x, y] = g[x+1, y]+1
16. endif
17. endfor
18. endforall
Algorithm 4: Meijster original version [2nd Step]
1. Data: b:image, g: G_Table, m: columns, n:lines
2. Forall x ∈ [0..n1] do
3. q = 0
4. s[0] = 0
5. t[0] = 0
6. /* First part */
7. for (u = 1) to (m1) do
8. A = (q ≥ 0) Λ [f((x, t[q]), s[q])]
9. B = f((x, t[q]), u)
10. while (A > B) then q ←(q+1)
11. end while
12. if (q < 0) then (q ← 0)
13. (s[0] ← u)
14. else w ← Sep(s[q], u, x)+1
15. if (w < m) then q ← (q+1)
16. s[q] ← u
17. t[q] ← w
18. endif
19. endif
20. endfor
21. /* Second part */
22. for (u = m1) to (0) do
23. Edt[x, u] = f((x, u), s[q])
24. if (u = t[q]) then q ← (q1)
25. endif
26. Endfor
27. End forall
Algorithm 5. Meijster parallel version [1st step]
1. For (y = t, y < m, y = y+t_{max}) do
2. If (0, y) ∈ B then g[0, y] ← 0
3. else g[0, y] ← ∞
4. endif
5. /* G_{T} */
6. for (x = 1) to (n1) do
7. if [x, y] ∈ B then g[x, y] ← 0
8. else g[x, y] ← g[x+1, y]+1
9. endif
10. Endfor
11. /* G_{B} */
12. for (x = n2) downto (0) do
13. if (g[x+1, y] < g[x, y]) then
14. g[x, y] ← g[x+1, y]+1
15. endif
16. Endfor
17. Endforall
Algorithm 6. Meijster parallel version [2nd Step]
1. For (x = t, x < n, x = x+t_{max}) do
2. q = 0; s[0] = 0;
3. t[0] = 0;
4. /* First part */
5. for (u = 1) to (m1) do
6. A ← (q ≥ 0) Λ[f((x, t[q]), s[q])]
7. B ← f((x, t[q]), u)
8. while (A > B) do q ← (q+1)
9. end while
10. if (q < 0) then (q ← 0)
11. (s[0] ← u)
12. else w ← Sep(s[q], u, x)+1
13. if (w < m) then q ← (q+1)
14. s[q] ← u
15. t[q] ← w
16. endif
17. endif
18. Endfor
19. /* Second part */
20. for (u = m1) downto (0) do
21. Edt[x, u] ← f((x, u), s[q])
22. if (u = t[q]) then q ← (q1)
23. endif
24. Endfor
25. End forall
Algorithm 7. Adapted version of thinning algorithm
1. while (input[x] is destructible) do
2. push(x, stack 1)
3. x ← x+1
4. endwhile
5. output ← input
6. While (stack 1 ≠ ∅) ∧(max_{ iter } > 0) do
7. While(stack 1 ≠ ∅)do
8. x ← pop(stack 1)
9. if (output[x] is destructible) then
10. output[x] ← reduce _pt(x)
11. push(x, stack 2)
12. endif
13. end while
14. While (stack 2 ≠ ∅))do
15. x ← pop(stack 2)
16. v ← neighbors(x)
17. i ← 0
18. While (i < 8) do
19. if (v[i] ∉ stack 1) then
20. push(v[i], stack 1)
21. endif
22. endwhile
23. endwhile
24. max_{ iter } ← max_{ iter } 1
25. Endwhile
References
Taubin G: Curve and surface smoothing without shrinkage. Proceedings of ICCV'95, 852857 1999.
X Liu, Bao H, Shum HY, Peng Q: A novel volume constrained smoothing method for meshes. Graphical Models 2002, 64: 169182. 10.1006/gmod.2002.0576
Asano A, Yamashita T, Yokozeki S: Active contour model based on mathematical morphology. ICPR 1998, 98: 14551457.
Leymarie F, Levine MD: Curvature morphology. Proceedings of Vision Interface 1989, 102109.
Couprie M, Bertrand G: Topology preserving alternating sequential filter for smoothing 2D and 3D objects. J Electron Imaging 2004, 13: 720730. 10.1117/1.1789986
Yung Kong T, Rosenfeld A: Digital topology: introduction and survey. Comput Vision Graphics Image Process 1989, 48: 357393. 10.1016/0734189X(89)901473
Sternberg SR: Grayscale morphology. Comput Vision Graphics Image Understanding 1986, 35: 333355. 10.1016/0734189X(86)900046
Serra J: Image Analysis and Mathematical Morphology. In Theoretical Advances. Volume II. Academic Press, New York; 1988. Chap. 10
Bertrand G: Simple points topological numbers and geodesic neighbourhoods in cubic grids. Pattern Recognition Letters 1994, 15: 10031011. 10.1016/01678655(94)900329
Danielson PE: Euclidean distance mapping. Computer Graphics and Image Processing 1980, 14: 227248. 10.1016/0146664X(80)900544
Bertrand G, Everat JC, Couprie M: Topological approach to image segmentation. SPIE Vision Geometry V 1996, 2826: 6576.
Couprie M, Bezerra FN, Bertrand G: Topological operators for greyscale image processing. Journal of Electronic Imaging 2001, 10: 10031015. 10.1117/1.1408316
Bertrand G, Everat JC, Couprie M: Image segmentation through operators based on topology. Journal of Electronic Imaging 1997, 6: 395405. 10.1117/12.276856
Bertrand G: On topological watersheds. J Math Imaging Vision 2005, 22: 217230. 10.1007/s1085100548915
Wilkinson MHF, Gao H, Hesselink WH, Jonker J, Meijster A: Concurrent computation of attribute filters on shared memory parallel machines. Trans Pattern Anal Mach Intell 2007, 18001813.
Seinstra FJ, Koelma D, Geusebroek JM: A software architecture for user transparent parallel image processing. International EuroPar conference 2001, 2150: 653662.
Meijster A, Roerdink JBTM, Hesselink WH: A general algorithm for computing distance transforms in linear time. In Mathematical Morphology and its Applications to Image and Signal Processing. Kluwer Academic Publishers, Dordrecht; 2000:331340.
Natarajan S, ed: Imprecise and Approximate Computation. Kluwer, Boston; 1995.
Van Tilborg AM, Koob GM, eds: Foundations of RealTime Computing: Scheduling and Resources Management, Kluwer, Boston. 1991.
Leung J, Zhao H: Realtime scheduling analysis report. Department of Computer Science New Jersey Institute of Technology 2005.
Kolivas C: RSDL completely fair starvation free 64 interactive cpu scheduler. lwn.net 2007.
Molnar I: Modular scheduler core and completely fair scheduler. lwn.net 2007.
Kahn G: The semantics of a simple language for parallel programming. In Proceedings of the IFIP Congress 74. NorthHolland Publishing Co., Amsterdam; 1974.
Nikolov H, Thompson M, Stefanov T, Pimentel AD, Polstra S, Bose R, Zissulescu C, Deprettere EF, Daedalus : Toward Composable Multimedia MPSoC Design, invited paper. In Proceedings of the ACM/IEEE International Design Automation Conference. (DAC '08), Anaheim, USA; 2008:574579.
Halfhill T: Ambric's new parallel processor. Microprocessor Report 2006.
Giavitto JL, Sansonnet JP: Introduction à 8 1/2 Rapport interne. LRI Orsay 1994.
Giavitto JL, Sansonnet JP: 8 1/2: dataparallélisme et dataflow. Techniques et Sciences Informatiques 1993., 12(5):
Mahiout A, Giavitto JL, Sansonnet JP: Distribution and scheduling dataparallel dataflow programs on massively parallel architectures. In SMSTPE '94: Software for Multiprocessors and Supercomputers. Office of Naval Research USA & Russian Basic Research Foundation, Moscow; 1994.
Lemire D: Streaming maximumminimum filter using no more than three comparisons per element. Nordic J Comput 2006, 13(4):328339.
Shih FY, Wu Y: Fast Euclidean distance transformation in two scans using a 3 × 3 neighborhood. Comput Vis Image Understanding 2004, 94: 195205.
Cuisenaire O, Macq B: Fast Euclidean distance transformation by propagation using multiple neighborhoods. CVIU 1999, 76(2):163172.
Cuisenaire O, Macq B: Fast and exact signed Euclidean distance transformation with linear complexity. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP99) 1999, 32933296.
FABBRI R, COSTA LF, TORELLI JC, BRUNO OM: 2D Euclidean distance transform algorithms: A comparative survey. ACM Computing Surveys 2008., 40:
Mahmoudi R, Akil M, Matas P: Parallel image thinning through topological operators on shared memory parallel machines. Signals. Systems and Computers Conference 2009, 723730.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Mahmoudi, R., Akil, M. Enhanced computation method of topological smoothing on shared memory parallel machines. J Image Video Proc. 2011, 16 (2011). https://doi.org/10.1186/16875281201116
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/16875281201116
Keywords
 Shared Memory
 Total Execution Time
 Simple Point
 Parallelization Strategy
 Merging Algorithm