# Enhanced computation method of topological smoothing on shared memory parallel machines

- Ramzi Mahmoudi
^{1}Email author and - Mohamed Akil
^{1}

**2011**:16

**DOI: **10.1186/1687-5281-2011-16

© Mahmoudi and Akil; licensee Springer. 2011

**Received: **1 March 2011

**Accepted: **27 October 2011

**Published: **27 October 2011

## Abstract

To prepare images for better segmentation, we need preprocessing applications, such as smoothing, to reduce noise. In this paper, we present an enhanced computation method for smoothing 2D object in binary case. Unlike existing approaches, proposed method provides a parallel computation and better memory management, while preserving the topology (number of connected components) of the original image by using homotopic transformations defined in the framework of digital topology. We introduce an adapted parallelization strategy called split, distribute and merge (SDM) strategy which allows efficient parallelization of a large class of topological operators. To achieve a good speedup and better memory allocation, we cared about task scheduling and managing. Distributed work during smoothing process is done by a variable number of threads. Tests on 2D grayscale image (512*512), using shared memory parallel machine (SMPM) with 8 CPU cores (2× Xeon E5405 running at frequency of 2 GHz), showed an enhancement of 5.2 with cache success rate of 70%.

## 1. Introduction

Smoothing filter is the method of choice for image preprocessing and pattern recognition. For example, the analysis or recognition of a shape is often perturbed by noise, thus the smoothing of object boundaries is a necessary preprocessing step. Also, when warping binary digital images, we obtain a crenellated result that must be smoothed for better visualization. The smoothing procedure can also be used to extract some shape characteristics: by making the difference between the original and the smoothed object, salient or carved parts can be detected and measured.

Smoothing shape has been extensively studied and many approaches have been proposed. The most popular one is the linear filtering by Laplacien smoothing for 2D-vector [1] and 3D mesh [2]. Other approach by morphological filtering can be applied directly to the shape [3] or to curvature plot of the object's contour [4]. Unfortunately none of these operators preserve the topology (number of connected components) of the original image. In 2004, our team introduced a new method for smoothing 2D and 3D objects in binary images while preserving topology [5]. Objects are defined as sets of grid points, and topology preservation is ensured by the exclusive use of homotopic transformations defined in the framework of digital topology [6]. Smoothness is obtained by the use of morphological openings and closings by metric discs or balls of increasing radius, in the manner of alternating sequential filters [7]. The authors' efforts have brought about two major issues such as preserving the topology and the multitude of objects in the scene to smooth out without worrying about memory management, latency or cadency of their filter. This paper describes an enhanced computation method of topological smoothing filter that assure better performance. We present also a new parallelization strategy, called Split Distribute and Merge (SD&M). Our strategy is designed specifically for topological operator's parallelization on shared memory architectures. The new strategy is based upon the exclusive combination of two patterns: divide and conquer and event-based coordination.

This paper is organized as follows: in section 2, some basic notions of topological operators are summarized; the original smoothing filter is introduced. In section 3, parallelization strategy, that has been adopted, is introduced. We define the class of operators that our strategy may cover. Motivations for using shared memory parallel machines are also cited. Threads coordination and tasks scheduling are discussed. In section 4, the new parallel smoothing method is introduced and evaluations of acceleration, efficiency and success rate of cache memory access are also presented and discussed. Finally, we conclude with summary and future work in section 5.

## 2. Theoretical background

In this section, we recall some basic notions of digital topology [6] and mathematical morphology for binary images [8]. We define also the homotopic alternating sequential filters [5]. For the sake of simplicity, we restrict ourselves to the minimal set of notions that will be useful for our purpose. We start by introducing morphological operators based on structuring elements which are balls in the sense of Euclidean distance, in order to obtain the desired smoothing effect.

We denote by ℤ the set of relative integers, and by E the discrete plane ℤ^{2}. A point *x* ∈ E is defined by (*x*_{1}, *x*_{2}) with *x*_{
i
} ∈ ℤ. Let *x* ∈ E, *r* ∈ ℤ, we denote by *B*_{
r
}(*x*) the ball of radius *r* centred on *x*, defined by *B*_{
r
}(*x*) = {*y* ∈ *E*, *d*(*x*, *y*) ≤ *r*}, where *d* is a distance on E. We denote by *B*_{
r
} the map which associates to each *x* in E the ball *B*_{
r
}(*x*). The Euclidean distance *d* on E is defined by: *d*(*x*, *y*) = [A^{2}-B^{2}]^{1/2} with A = (*x*_{1}-*y*_{1}) and B = (*x*_{2}-*y*_{2}).

An operator on *E* is a mapping from *P*(*E*) into *P*(*E*), where *P*(*E*) denotes the set of all subsets of *E*. Let *r* be an integer, the dilation by *B*_{
r
} is the operator *δ*_{
r
} defined by *δ*_{
r
}(*X*) = ∪_{
x ∈X
}*B*_{
r
}(*x*) ∀*X* ∈ *P*(*E*). The ball *B*_{
r
} is termed as the structuring element of the dilation. The erosion by *B*_{
r
} is the operator *ε*_{
r
} defined by duality: *ε*_{
r
} = * *δ*_{
r
}.

*E*= ℤ

^{2}. Let consider two neighbourhoods relations Γ

_{4}and Γ

_{8}defined for each point

*x*∈

*E*by:

For general case, we define ${\Gamma}_{n}^{*}\left(x\right)={\Gamma}_{n}\left(x\right)\backslash \left\{x\right\}$ with *n* ∈ {4, 8}. Thus *y* is said n-adjacent to *x* if $y\in {\Gamma}_{n}^{*}\left(x\right)$. We say also that two points *x* and *y* of *X* are n-connected in *X* if there is a n-path between these two points. The equivalence classes for this relation are n-connected components of *X*. A subset *X* of *E* is said to be n-connected if it consists of exactly one n-connected component. The set of all n-connected components of *X* which are n-adjacent to a point *x* is denoted by *C*_{
n
}[*x*, *X*]. In order to have a correspondence between the topology of *X* and the topology of $\overline{X}$, we use n-adjacency for *X* and $\overline{n}$-adjacency for $\overline{X}$, with $\left(n,\overline{n}\right)$ equal to (8; 4) or (4; 8).

Informally, a simple point *p* of a discrete object *X* is a point which is inessential to its topology. In other words, we can remove *p* from *X* without changing its topology. A point *x* ∈ *X* is said simple if each n-component of *X* contains exactly one n-component of *X*\{*x*} and if each $\overline{n}$-component of $\overline{X}\cup \left\{x\right\}$ contains exactly one $\overline{n}$-component of $\overline{X}$. Let *X* ⊂ *E* and *x* ∈ *E*, two connectivity numbers defined as follows (#*X* = cardinality of *X*): $T\left(x,X\right)=\#{C}_{n}\left[x,{\Gamma}_{8}^{*}\left(x\right)\cap X\right]$;$\overline{T}\left(x,X\right)=\#{C}_{\overline{n}}\left[x,{\Gamma}_{8}^{*}\left(x\right)\cap \overline{X}\right]$.

The following properties allows us to locally characterize simple points [6, 9] hence to implement efficiently topology preserving operators:

(*x* ∈ *E*) is simple for *X* ⊆ *E* ↔ *T*(*x*, *X*) = 1 and $\overline{T}\left(x,X\right)=1$.

*X*and a control image

*C*as input, and smoothes

*X*while respecting its topology and geometrical constraints implicitly represented by

*C*. A simple illustration is given by Figure 1. Smoothed image (b) is obtained using HAS filter with a radius equal to five and four connectedness (Γ

_{4}). More example can be found in [5].

*C*⊆

*X*, r ∈ ℕ and $D\subseteq \overline{X}$with

*X*any finite subset of

*E*. The homotopic alternating sequential filter (

*HASF*) of order

*n*, with constraint sets

*C*and

*D*, is defined as follows:

*X*by

*B*

_{ n }with constraint set

*C*and $H{F}_{n}^{D}$(ii) refers to homotopic filling of

*X*by

*B*

_{ n }with constraint set

*D*. These two homotopic operators can be defined as follows:

We recall that *H*(*Z*, *W*) is an homotopic constrained thinning operator. It gives the ultimate skeleton of *Z* constrained by *W*. The ultimate skeleton is obtained by selecting simple point in increasing order of their distance to the background thanks to a pre-computed Euclidian distance map [10]. We recall also that **H*(*Y*, *V*) is an homotopic constrained thickening operator. It thickens the set of *Y* by iterative addition of points which are simple for $\overline{Y}$and belong to the set *V* until stability.

We have provided, in this section, the theoretical underpinnings for studying topological transforms. We introduced also the homotopic alternating sequential filter which constitutes base for topological smoothing.

## 3. Parallelization Strategy

In this section, we start by defining the class of topological operators. We also present our motivation to parallelize these algorithms on parallel shared memory machines. Then, we will introduce different steps of our approach after making a brief classification over existing strategies. We will focus especially on distribution phase and tasks scheduling over different processors. Scheduling and merging algorithms are presented and discussed. To illustrate both algorithms, scenarios are also introduced and discussed.

### 3.1 Class of topological algorithms

In 1996, Bertrand and Couprie [11] introduced connectivity numbers for grayscale image. These numbers describe locally (in a neighborhood of 3 × 3) the topology of a point. According to this description any point can be characterized following its topology. They also introduced some elementary operations able to modify gray level of a point without modifying image topology. These elementary operations of point characterization present a fundamental link between large class of topological operators including, mainly, skeletonization and crest restoring algorithms [12]. This class can also be extended, under condition, to homotopic kernel and leveling kernel transformation [13], topological watershed algorithm [14] and topological smoothing algorithm [5] which is the subject of this article. All mentioned algorithms get also many algorithmic structure similarities. In fact associated characterizations procedures evolve until stability which induce common recursion between different algorithms. The grey level of any point can also be lowered or enhanced more than once. Finally, all mentioned algorithms get a pixel's array as input and output data structure. It is important to mention that, to date, this class has not been efficiently parallelized like other classes as connected filter of morphological operator which recently has been parallelized in Wilkinson's work [15]. Parallelization strategy proposed by Seinstra [16] for local operators and point to point operators can also be cited as example. For global operators, Meijster strategy [17] shows also consistence. Hence the need of a common parallelization strategy for topological operators that offers an adapted algorithm structure design space. Chosen algorithm structure patterns that will be used in the design must be suitable for SMP machines.

In reality, although the cost of communication (Memory-processor and inter-processors) is high enough, shared memory architectures meet our needs for different reasons: (i) These architectures have the advantage of allowing immediate sharing of data with is very helpful in the conception of any parallelization strategy (ii) They are non-dedicated architecture using standard component (processor, memory...) so economically reliable (iii) They also offer some flexibility of use in many application areas, particular image processing.

### 3.2 Split Distribute and Merge Strategy

#### 3.2.1 Split phase

The Divide and Conquer pattern is applied first by recursively breaking down the problem into two or more sub-problems of the same type, until these become simple enough to be solved directly. Splitting the original problem take into account, in addition to the original algorithm's characteristics (mainly topology preservation), the mechanisms by which data are generated, stored, transmitted over networks (processor-processor or memory-processor), and passed between different stages of computation.

#### 3.2.2 Distribute phase

Work distribution is a fundamental step to assure a perfect exploitation of multi-cores architecture's potential. We'll start by recalling briefly some basic notion of distribution techniques then we introduce our minimal distribution approach that is particularly suitable for topological recursive algorithms where simple point characterization is necessary. Our approach is general and applicable to shared memory parallel machines. Critical cases are also introduced and discussed.

We propose a novel tasks scheduling approach to prevent improper load distribution while improving total execution time, see Figure 3(b). In literature, there are several schedulers that provide a balanced distribution of tasks such as RSDL "Rotating Staircase Deadline" [21] which incorporates a foreground-background descending priority system (the staircase) with run-queue managing minor and major epochs (rotation and deadline). Other scheduler, as CFS "Completely Fair Scheduler" [22], shows consistence. It handles resource allocation for executing processes, and aims to maximize overall CPU utilization while maximizing interactive performance. These schedulers are based on tasks uniformity principle. Through the tasks homogeneity, better distribution can be achieved and total execution time reduced. Unfortunately, these schedulers are not available in all operating system versions especially for small system. Based on the same principle of tasks uniformity, we propose a new scheduling algorithm, simpler to implement and more adapted to topological algorithm implementation.

Let be a basic non-preemptive scheduler 'Basic-NPS', *T* = {*t*_{1}, *t*_{2},..., *t*_{
k
}} is the set of all tasks, *T*_{
T
} = {*t*_{1}, *t*_{2}, ..., *t*_{
i
}} is the set of tasks to process with *T*_{
T
} ⊂ *T*, *P* = {*p*_{1}, *p*_{2},..., *p*_{
n
}} is the set of all processors and *P*_{
a
} = {*p*_{1}, *p*_{2},..., *p*_{
j
}} is the set of available processors with *P*_{
a
} ⊂ *P*.

Basic-NPS (*T*_{
x
} ⇒ *P*_{
y
}) is able to schedule a set of *T*_{
x
} tasks on *P*_{
y
} processor. Let {*p*} be the maximum of processors that *P*_{
y
} will contain. Then {*p*} can be defined as the maximum of available processors already defined by the set *P*_{
a
} and {*p*} = max *p*_{
j
}/*p*_{
j
} ∈ *P*_{
a
}. While ([*P*_{
a
} ≠ ∅]∧[*T*_{
T
} ≠ ∅]) then *T*_{
x
} ⇒ *P*_{
y
}: *T*_{
x
} ∈ *T*_{
T
}; *P*_{
y
} ∈ *P*_{
a
} . In this scheduler, each processor will treat at maximum $m=max{t}_{i}\u2215{t}_{i}\to {p}_{j}\le max\left(\frac{\mid T\mid}{\mid P\mid}\right)$tasks with *j* ∈ {1, 2,..., *n*}. Then, the worst case to process *T* is$K\left(T\right)=max\left\{\underset{i}{max}{T}_{T}\to {p}_{1},...,\underset{j<...<i}{max}{T}_{T}\to {p}_{k}\right\}$. As proof, let suppose that it exist a set *L*(*T*) as$\sum L\left(T\right)\ge \sum K\left(T\right)$. As 'Basic-NPS' manage *L*(*T*) and *K*(*T*), so we can introduce the following: |*L*(*T*)| ≤ *m* and |*K*(*T*)| ≤ *m*. Thus, if $\left(\sum L\left(T\right)\ge \sum K\left(T\right)\right)$then there exists at least one task {*l*}, with *k* ∈ *K*(*T*), such as: (*A* ∧ *B* ∧ *C*) with *A* = (*l* ∈ *L*(*T*)), *B* = (*l* ∉ *K*(*T*)), C = (*l* > *k*). This is impossible according to the definition of *K*(*T*) which was defined as the worst case.

Algorithm 1 describes 'Basic-NPS' policy. The first step consists on asking operating system to determine the number of available processor. Depending on this number, algorithm will generate process. One active process will be assigned for each available processor. These new processes will belong to the SHED_FIFO class in order to ensure preemption and especially to avoid context switching. Process will only stop running if work is complete or less frequently when another process, belonging to the same class, with higher priority requesting processor. The global execution will stop if there no more task to process.

#### 3.2.3 Merging phase

The key problem of each parallelization is merging obtained results. Normally this phase is done at the end of the process when all results are returned by all threads what usually means that only one output variable is declared and shared between all fighting threads. But as we mentioned in section 3.1, we are dealing with a dynamic evolution and if we take into account different steps of simple point detection then pixel characterizations, we can plan the following: The original shared data structure, containing all pixels, is divided into *n* research zones {*z*_{1}, *z*_{2} ..., *z*_{
n
}}. We associate one thread from the following list {*T*_{1}, *T*_{2},..., *Tn*} to each zone. Each thread can browse freely its zone and if it detects target pixel types, it lowers characterized pixel and it pushes its eight neighbors in one of the available FIFO queues. A queue is said available if only one thread (owner) is using it. One queue cannot be shared by more than two threads so if no queue is available, threads can create a new one and become owners.

Since two threads finished, they directly merge and a new thread is created and then same process is lunched again. New created thread will inherit queue shared between his parents. Thus it can restart research. It is also important to mention that there is no hierarchical order in thread merging, only criteria is finishing time. We mention also that one neighbor cannot be inserted twice. It is a precaution in order to minimize consumed cache. More formal description of merging techniques is given in by algorithm 2.

It is important to highlight similarity and difference that may exist between our merging algorithm and KPN [23]. In effect, both are deterministic and do not depend on execution order. But KPN algorithm may be executed in sequentially or in parallel with the same outcome while our merging algorithm is designed only for parallel execution. KPN support recurrence and recursion while our merging algorithm support only recursion.

In large scale application, KPN showed consistence. Examples include Daedalus project [24] where generated KPN models are used to map process into FPGA architecture. Ambric architectures [25] implement also a KPN model using bounded buffers to create massively DMP Machines based on structural object programming model.

In a narrower framework limited to simple point characterization, the implementation of such a model will be very expensive and it would be better to find an easier and more specific algorithm.

*z*

_{1},

*z*

_{2},

*z*

_{3},

*z*

_{4}}. Threads {

*T*

_{1},

*T*

_{2},

*T*

_{3},

*T*

_{4}} will start browsing different zones in parallel.

*T*

_{1}is the first to detect target point (constructible, destructible...) so it lowers characterized pixel (in

*z*

_{1}) and it pushes its eight (or four) neighbors in FIFO queue

*F*

_{1}that it has created before continue browsing. Later,

*T*

_{3}will detect new target point so it will lower characterized pixel (in

*z*

_{3}) then push neighbors in

*F*

_{1}before continue browsing.

*T*

_{3}does not need to create new FIFO queue since

*F*

_{1}is available.

*T*

_{1}and

*T*

_{3}will repeat this procedure twice. Since they finish browsing, they merge and new thread

*T*

_{5}is born.

*T*

_{5}will start browsing only

*F*

_{1}. Since it detect new target point so it will lower characterized pixel (in

*z*

_{5}=

*z*

_{1}+

*z*

_{3}) then push neighbors in

*F*

_{3}that it has created before continue browsing. Similarly

*T*

_{2}and

*T*

_{4}will generate the creation of

*F*

_{2}and

*T*

_{6}. Here

*T*

_{6}will eventually merge with

*T*

_{5}to give birth to

*T*

_{7}. Finally there will be a single thread

*T*

_{7}which will brows

*F*

_{3}without detection any target points.

We have introduced, in this section, three necessary steps to implement our parallelization strategy (SDM). It is important to mention that some similarity may exist between our split/merging phases and alpha-extension/beta-reduction phases from structural perspective. Actually both approaches intended to put in place more guarantees that the parallelism will actually be met. But uses contexts are different. In effect, Jean Paul Sansonnet [26–28] team introduced alpha-extension (diffusion) and beta-reduction (merging) notions for stream manipulation in the framework of Declarative Data Parallel language definition and there techniques cannot be applied without a scalar function. While our proposal is restricted to topological characterization in the framework of topological operator's parallelization and no scalar function is required during the application of these two phases.

## 4 Parallel smoothing filter

In this section we start by analyzing overall structure of original algorithm. Then we continue with the parallelization of Euclidean distance, thinning and thickening algorithm. We conclude by a performance analysis of the entire smoothing topological operator. Obtained execution time, efficiency, speedup and cache misses will be introduced and discussed.

Time execution rate of E.D and topological characterization functions

200 × 200 | 168 × 288 | |||||
---|---|---|---|---|---|---|

r = 5 | r = 10 | r = ∞ | r = 5 | r = 10 | r = ∞ | |

EucDis (%) | 64.44 | 54.93 | 46.67 | 59.25 | 49.79 | 35.25 |

TopCar (%) | 8.89 | 13.89 | 18.15 | 11.58 | 16.50 | 24.03 |

### 4.1 Euclidean distance computing

#### 4.1.1 Study on Euclidian Distance algorithms

He also proposes a second algorithm, called PMN "Propagation Using Multiple Neighborhood" that uses eight neighbors. In [32], he also proposes a third algorithm with *o*(*n*^{3/2}) complexity, which offers an accurate computation of the Euclidean distance. Only drawback of this third algorithm is computation time which is very important and goes beyond the two algorithms mentioned above. Even if computing error produced by PSN is greater than computing error produced by PMN, it is comparable to that produced by 4SED. Low data dependence and ability to operate on 3D images, makes PSN algorithm a potential candidate to replace 4SED.

Meijster [17] proposes an algorithm to compute exact Euclidean distance. Algorithm complexity is *o*(*n*) and it operates in two independent, but successive, steps. First step is based on looking over columns then computing distance between each point and existing objects. Second step includes same treatment looking over lines. It is important to note that strong independence between different processing steps and computing error equal to zero makes Meijster algorithm another potential candidate to replace 4SED. Algorithm is also able to operate on 3D images. Theory analysis of Meijster and Cuissenaire algorithms can be found in Fabbri's work [33].

In the following, we propose first analysis based on different algorithms implementation in order to compare between them. We have implemented 4SED algorithm using a fixed size stack. This stack uses a FIFO queue and it has small size while 4SED algorithm does not need to store temporal image. Results are directly stored into the output image, we will retain this implementation because 4SED assessment serve only as reference for comparison. For PSN implementation, we used stacks with dynamic sizes. Memory is allocated using small blocks defined at stack creation. When an object is added to queue, algorithm will use available memory of last block. If no space is available, a new block is allocated automatically. Block size is proportional to image size (N × M/100). Finally we used a simple memory structure to implement Meijster algorithm. A simple matrix was used to compute distance between points and object of each column and three vectors were used to compute distance in each line. We recall that this comparison is done in order to select the best algorithm among three candidates.

#### 4.1.2 Parallelization of Meijster algorithm

*I*input image with

*m*columns and

*n*rows. We denote by

*B*an object included in

*I*. The idea is to compute, for each point

*p*∈

*I*∧

*p*∉

*B*, separating distance between

*p*and the closest point

*b*with

*b*∈

*B*and ∀(0 ≤

*b*≤

*m*),

*b*= (

*b*

_{ x },

*b*

_{ y }). This amount to compute the following matrix:

If we assume that minimum distance of an empty group *K* is ∞ and ∀ *z* ∈ *K*, we have (*z*_{
y
} + ∞) = ∞ then *EDT*(*p*) formula can be written as follow: ∀*b*_{
x
} < *n*, ∀*b*_{
y
} ≤ *m*, *EDT*(*p*) = min(*p*_{
y
}-*b*_{
x
})^{2}+*G*(*p*_{
x
}*, b*_{
y
})^{2} with G(*p*_{
x
}, y) = min|*p*_{
x
}-*b*_{
x
}|: *b* = (*b*_{
x
}, *y*).

Thus we can split the Euclidian distance transform procedure into two steps. The first step is to scan columns and compute *EDT* for each column *y*. Second step consists on repeating the same procedure for each line. In the following we start by detailing these two steps: In the first step *G*(*p*_{
x
}, *y*) can be computed through the two following sub functions: *G*_{
T
}(*p*_{
x
}, *y*) = min *p*_{
x
}-*b*_{
x
}: *b* = (*b*_{
x
}, *y*), *G*_{
B
}(*p*_{
x
}, *y*) = min *b*_{
x
}-*p*_{
x
}: *b* = (*b*_{
x
}, *y*) with ∀0 ≤ *b*_{
x
} ≤ *n*. To compute *G*_{
T
}(*p*_{
x
}, *y*) and *G*_{
B
}(*p*_{
x
}, *y*), we scan each column *y* from top to bottom using the two following formula: *G*_{
T
}(*p*_{
x
}, *y*) = *G*_{
T
}(*y*, *p*_{
x
}-1)+1, *G*_{
B
}(*p*_{
x
}, *y*) = *G*_{
B
}(*y*, *p*_{
x
}-1)+1. Thus sequential algorithm of the first step can be written as follows. The complexity order is *o*(*n* × *m*).

Let's move to the second step. We start by defining *f*(*p*, *y*) = (*p*_{
y
}-*y*)^{2}+*G*(*p*_{
x
}, *y*)^{2} . Then we can define *EDT*(*p*) = min *f*(*p*-*y*), ∀0 ≤ *y* ≤ *m*. For each row *u*, we note that there is, for the same point *p*, the same value of *f*(*p*, *y*) for different values of *y*, so we can introduce the concept of "region of column ".

Let *S* be the set of *y* points such that *f*(*p*, *y*) is minimal and unique. The formula of *S*, ∀0 ≤ *y* ≤ *u*, is *S*_{
p
}(*u*) = min *y*: *f*(*p*, *y*) ≤ *f*(*p*, *i*). ∀0 ≤ *i* ≤ *u* ∧ *u* ≤ *m*. Let *T* be the set of points with coordinate greater than, or equal to, horizontal coordinate of the intersection with a region: ${T}_{p}\left(u\right)=Se{p}_{{p}_{x}}\left({S}_{p}\left(u-1\right),u\right)+1$.

*Sep*(

*i*,

*u*) be the separation between regions of

*i*and

*u*, defined by:

Thus lines will be processed, from left to right then from right to left. During the first term, from left to right, two vectors *S* and *T* will be created. These two vectors will contain respectively all regions and all intersections. During the second treatment, from right to left, we compute *f* for each value of *S*. *f* is also computed for each respective values of *T*. Algorithm 4 is associated to second step. For the first term, complexity order is *q*+2(*m*-*u*) whereas complexity order of the second term is only *m*.

It follows that values of each column y of G, depends only on lines: *p*_{
x
}, *p*_{
x
}+1 and *p*_{
x
}-1. Similarly, at the second stage, we can introduce the following interrelationship: *Edt*(*p*) = *f*(*p*, *S*_{
p
}(*q*)).

Then ∀(0 ≤ *y* ≤ *u*), (0 ≤ *i* ≤ *u*) Λ (*u* < *m*), *S*_{
p
}(*u*) = min *y*: *f*(*p*, *y*) ≤ *f*(*p*, *i*). Thus, if (*u* = *T*_{
p
}(*q*)) so *q* = (*q*-1) which imply the following: ${T}_{p}\left(u\right)=Se{p}_{{p}_{x}}\left({S}_{p}\left(q\right),u\right)+1$.

According to this formalization, values of *f*(*p*, *i*) and *Sep*_{
x
}(*i*, *u*) are independent of modified data. So using two vectors *S* and *T*, a private variable *q* for each line ensures complete independence in writing. We start applying the splitting step by sharing the columns and lines processing between multiple processors. A thread can process one or more columns and the number of threads used will depend on the number of processors. The results returned by all threads in this first stage will be merged in order to start lines processing. In the following we introduce the parallel version of Meisjter algorithm for both steps. Associated algorithm complexity is *o*((*n × m*)/*N*). (*n* × *m*) refers to image size and *N* refers to the number of processors.

Proposed parallel version of Meijster algorithm was implemented in C using OpenMP directives. Speedup for numbers of threads equal to 1, 2, 4, 8, and 16 were determined. The efficiency measure Ψ (*n*) is given by the following formula with *n* the number of processors: Ψ (*n*) = seq. time/(*n**para. time) (ii)

Times were performed on eight-core (2× Xeon E5405) shared memory parallel computer, on Intel Quad-core Xeon E5335, on Intel Core 2 Duo E8400 and Intel mono-processor Pentium 4 660. The minimum value of 5 timings was taken as most indicative of algorithm speed. More information about architectures characteristics are given in Section 4.

^{8}using 4SED algorithm down to 7.6 × 10

^{8}ms with Meijster algorithm. Despite the passage from a sequential version running on single core to a parallel version running on 8 processors, acceleration is only multiplied by 1.6 as shown in Figure 8(a). This can be explained by the choke point between columns processing and lines processing. Waiting time between these two treatments significantly penalizes acceleration. Figure 8(b) shows that efficiency variation depends on the number of threads. It is also proportional to the number of processors. Moving to 3, 5 or 7 threads (odd number) decreases significantly the efficiency which reaches its maximum each time that the number of threads is equal the number of processors.

### 4.2 Thinning and thickening computing

Algorithms of thinning and thickening are almost the same. The only difference between them is the following: in thinning algorithm, destructible points are detected then their values are lowered. In thickening algorithm, constructible points, are detected then their values are increased. For parallelization, we will apply the same techniques introduced in [34]. We propose a similar version using two loops. Target points are initially detected then their value lowered or enhanced according to appropriate treatment. The set of their eight (or four) neighbors are copied into a "buffer" and rechecked. This treatment is repeated until stability. In the following, we present an adapted version of Couprie's thinning algorithm.

To resolve this problem, we propose that research areas assigned to each thread must be composed of at least six lines (of the image). Each thread will use two buffers to treat each three lines thus four buffers are used to treat six lines as shown in Figure 9(e).

Through this organization threads can start running in parallel on Z_{11}, Z_{21} and Z_{31}. Once processing is completed threads can restart running on Z_{12}, Z_{22} and Z_{32}. In some cases, a neighbor of a destructible point is detected on the border of a contiguous area. To prevent that such neighbor escape to recheck, it must be injected to buffer of the right thread. Let's suppose that a point *p* ∈ Z_{2} is considered as destructible by T2, so its value will be lowered and its four neighbors {*v*_{1}, *v*_{2}, *v*_{3}, *v*_{4}} should be rechecked. Neighbors {*v*_{1}, *v*_{2}, *v*_{4}} belong to Z_{2} so they will be push in T2 buffers. The neighbor {*v*_{3}} belongs to Z_{3} so it will stack T3 buffers.

The next step is to combine the parallel version of Meijster algorithm and the adapted version of Couprie algorithm to build the parallel processing of topological smoothing.

### 4.3 Global analyses

In this section, we present a global evaluation of the parallel smoothing operator. We start by presenting performance evaluations in terms of acceleration and efficiency. Then, we evaluate cache memory consumption.

#### 4.3.1 Execution time

^{8}FI with a single thread down to 1652 × 10

^{8}ms with 8 threads. As expected, the speed-up for the second implementation using 'basic-NPS' scheduler is higher than for the one using "Symmetric Multiprocessing" scheduler, thanks to balanced distribution of tasks. A remarkable result about speedup is also shown in Figure 12(a). In fact, speed-up increases as we increase the number of threads beyond the number of processors in our machine (eight cores). In the first implementation, using "Symmetric Multiprocessing" scheduler, the speedup at 8 threads is 1.9 ± 0.01. However, for the second implementation, using our scheduler, the speedup has increased to 5.2 ± 0.01. Another common result between different architecture is stability of execution time on each n-core machine since the code uses n or more threads.

For better readability of our results, we tested also efficiency of our algorithm on various architectures (see Figure 12(b)) using the *ψ*(*n*) formula introduced earlier. For parallel time ratio we used best obtained time with 8 threads ('basic-NPS' scheduler).

#### 4.3.2 Cache Memory Evaluation

As memory access is a principal bottleneck in current-day computer architectures, a key enabler for high performance is masking the memory overhead. If we starts from basic theory that two classic cache design parameters dramatically influence the cache performance: the block size and the cache associativity. So the simplest way to reduce the miss rate is to increase the block size even it increases the miss penalty. The second solution is to decrease associatively in order to decrease hit time thus to retrieve a block in an associative cache, the block must be searched inside of an entire set since there is more than one place where the block can be stored.

Unfortunately, we are dealing with non-reconfigurable architectures with caches whose associativity and block size are predefined by the manufacturer. Nowadays, new approaches to reduce cache miss are developed such as taking advantage of locality of references to memory or using aggressive multithreading so that whenever a thread is stalled, waiting for data, the system can efficiently switch to execute another thread. Despite their power, the application of both approaches remains limited. In fact, applications of locality approach still experimental even with Larrabee technology introduced by Intel. And the aggressive multithreading approach has been specially designed for graphics processing engines, which manage thousands of in-flight threads concurrently. So it is not recommended for general SMP machines with limited number of processors and threads. With all these limitations, the most intuitive solution is to rely on the scheduling. Thanks to our basic-NPS scheduler, we have balanced the charges then prevent context switching thus we minimize caches misses.

Hardware configuration

Intel P4 | Intel Dual Core T1400 | Intel C2 Quad Q9550 | Intel Xeon E5405 | ||
---|---|---|---|---|---|

Number of processor | 1 | 2 | 4 | 2 × 4 | |

SMT | Yes | Yes | Yes | Yes | |

Frequency | 3,4 GHz | 1,73 GHz | 2,83 GHz | 2,00 GHz | |

L1 Instruction Cache | Size | 16 Kb | 32 Ko | 32 Ko | 32 Ko |

Asso. | 8-way | 8-way | 8-way | 8-way | |

Block size | 32 byte | 32 byte | 32 byte | 32 byte | |

L1 Data Cache | Size | 16 Kb | 32 Ko | 32 Ko | 32 Ko |

Asso. | 8-way | 8-way | 8-way | 8-way | |

Block size | 64 byte | 64 byte | 64 byte | 64 byte | |

L2 Cache | Size | 2 Mb | 512 Kb | 6 Mb | 6 Mb |

Asso. | 8-way | 8-way | 8-way | 8-way | |

Block size | 64 byte | 64 byte | 64 byte | 64 byte | |

RAM size | 1 Gb | 2 Gb | 2 Gb | 8 Gb |

L1 - Instructions Misses (Symmetric Multiprocessing scheduler vs.Basic-NPS scheduler)

Number of threads | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
---|---|---|---|---|---|---|---|---|

Instruction L1 misses | SMP Scheduler | 18844 | 19476 | 18638 | 19726 | 20058 | 20324 | 18946 |

Basic-NPS scheduler | 6030 | 6262 | 6035 | 6437 | 7202 | 7804 | 7085 |

## 5 Conclusion

Topological characteristics are fundamental attributes of an object. In many applications, it is mandatory to preserve or control the topology of an image. Nevertheless, the design of transformations which preserve both topological and geometrical features of images is not an obvious task, especially for parallel processing.

In this paper, we have presented a new parallel computation method for topological smoothing through combining parallel computation of Euclidean Distance Transform using Meijster algorithm and parallel Thinning-Thickening processes using an adapted version of Couprie's algorithm.

We have also presented a new parallelization strategy called SDM (Split Distribute and Merge). Proposed strategy is partially based on divide and conquers principle associated to event-based coordination techniques. Further than smoothing operator, SDM Strategy can be applied for a large class of topological operators as we shown in section 3.1. In addition to identified conditions during splitting step, we introduced an adapted scheduler called basic-NPS (Basic - Non Preemptive Scheduler) able to distribute in balanced way a set of active tasks on available processors. Finally we introduced an adapted merging policy designed especially for dynamic system evolving until stability.

Parallel topological operator computation poses many challenges, ranging from parallelization strategies to implementation techniques. We tackle these challenges using successive refinement, starting with highly local operators, which process only by characterizing points and then deleting target pixels, and gradually moving to more complex topological operators with non-local behavior. In future work, we will study parallel computation of the topological watershed [14].

## Algorithm 1. Scheduling policy

- 1.
*T*: Set of all tasks - 2.
*P*: Set of all processors - 3.
While (

*T*≠ ∅) repeat:

4. *N*_{
T
} = Nbr_active_tasks();

5. *N*_{
P
} = Nbr_ available_processors();

6. If (*N*_{
P
} ≠ 0) then

7. If (*N*_{
T
} < *N*_{
P
}) then

8. For each processor *N*_{
pi
}:

9. Generate-new-process (*N*_{
Ti
});

10. Identify-class (*N*_{
Ti
}, SCHED_FIFO);

11. Endfor

12. Else: *N*_{
DT
} = Desable_tasks (*N*_{
P
} -*N*_{
T
});

13. Insert_desabled_tasks (*N*_{
DT
}*, T*);

14. For each processor *N*_{
Pi
}:

15. Generate-new-process (*N*_{
Ti
});

16. Identify-class (*N*_{
Ti
}, SCHED_FIFO);

17. Endfor

18. EndIf

19. EndIf

20. EndWhile

## Algorithm 2. Merging technique

1. *Z*: Set of research zones

2. *T*: Set of threads

3. *FIFO* _*Q*: Set of available FIFO queues

4. *P*_{
T
} : Target pixel type; *P*_{
D
}: Detected pixel

5. For all zones (*Z*_{
i
} ∈ *Z*) do:

6. Parallel_browsing (*T*_{
i
}, *Z*_{
i
});

7. EndFor

8. For each thread (*T*_{
i
} ∈ *T*) do:

9. If (pixel_caract(*T*_{
i
}, *P*_{
T
})==True) then

10. modify_value(*P*_{
D
});

11. If ((*FIFO* _*Q* ≠ ∅) then

12. usedstatus(*FIFO* _*Q*_{
j
}, true);

13. insert_neighbors(*T*_{
i
}*, P*_{
D
}, *FIFO* _*Q*_{
j
});

14. Else: add_new_fifo (*FIFO* _*Q*)

15. usedstatus(*FIFO* _*Q*_{j+1}, False);

16. insert_neighbors(*T*_{
i
}*, P*_{
D
}, *FIFO* _*Q*_{j+1});

17. EndIf;

18. EndIf;

19. EndFor;

## Algorithm 3. Meijster original version [1st Step]

1. Data: m:colums, n:lines, b:image

2. Forall *y* ∈ [0..*m*-1] do

3. If (0, *y*) ∈ *B* then *g*[0..*y*] = 0

4. else *g*[0..*y*] = ∞

5. endif

6. /* G_{T} */

7. for (*x* = 1) to (*n*-1) do

8. if [*x*, *y*] ∈ *B* then *g*[*x*..*y*] = 0

9. else *g*[*x*, *y*] = *g*[*x*+1, *y*]+1

10. endif

11. endfor

12. /* G_{B} */

13. for (*x* = *n*-2) downto (0) do

14. if *g*[*x*+1, *y*] < *g*[*x*, *y*] then

15. *g*[*x*, *y*] = *g*[*x*+1, *y*]+1

16. endif

17. endfor

18. endforall

## Algorithm 4: Meijster original version [2nd Step]

1. Data: b:image, g: G_Table, m: columns, n:lines

2. Forall *x* ∈ [0..*n*-1] do

3. *q* = 0

4. *s*[0] = 0

5. *t*[0] = 0

6. /* First part */

7. for (*u* = 1) to (*m*-1) do

8. *A* = (*q* ≥ 0) Λ [*f*((*x*, *t*[*q*]), *s*[*q*])]

9. *B* = *f*((*x*, *t*[*q*]), *u*)

10. while (*A* > *B*) then *q* ←(*q*+1)

11. end while

12. if (*q* < 0) then (*q* ← 0)

13. (*s*[0] ← *u*)

14. else *w* ← *Sep*(*s*[*q*], *u*, *x*)+1

15. if (*w* < *m*) then *q* ← (*q*+1)

16. *s*[*q*] ← *u*

17. *t*[*q*] ← *w*

18. endif

19. endif

20. endfor

21. /* Second part */

22. for (*u* = *m*-1) to (0) do

23. *Edt*[*x*, *u*] = *f*((*x*, *u*), *s*[*q*])

24. if (*u* = *t*[*q*]) then *q* ← (*q*-1)

25. endif

26. Endfor

27. End forall

## Algorithm 5. Meijster parallel version [1st step]

1. For (*y* = *t*, *y* < *m*, *y* = *y*+*t*_{max}) do

2. If (0, *y*) ∈ *B* then *g*[0, *y*] ← 0

3. else *g*[0, *y*] ← ∞

4. endif

5. /* G_{T} */

6. for (*x* = 1) to (*n*-1) do

7. if [*x*, *y*] ∈ *B* then *g*[*x*, *y*] ← 0

8. else *g*[*x*, *y*] ← *g*[*x*+1, *y*]+1

9. endif

10. Endfor

11. /* G_{B} */

12. for (*x* = *n*-2) downto (0) do

13. if (*g*[*x*+1, *y*] < *g*[*x*, *y*]) then

14. *g*[*x*, *y*] ← *g*[*x*+1, *y*]+1

15. endif

16. Endfor

17. Endforall

## Algorithm 6. Meijster parallel version [2nd Step]

1. For (*x* = *t*, *x* < *n*, *x* = *x*+*t*_{max}) do

2. *q* = 0; *s*[0] = 0;

3. *t*[0] = 0;

4. /* First part */

5. for (*u* = 1) to (*m*-1) do

6. *A* ← (*q* ≥ 0) Λ[*f*((*x*, *t*[*q*]), *s*[*q*])]

7. *B* ← *f*((*x*, *t*[*q*]), *u*)

8. while (*A* > *B*) do *q* ← (*q*+1)

9. end while

10. if (*q* < 0) then (*q* ← 0)

11. (*s*[0] ← *u*)

12. else *w* ← *Sep*(*s*[*q*], *u*, *x*)+1

13. if (*w* < *m*) then *q* ← (*q*+1)

14. *s*[*q*] ← *u*

15. *t*[*q*] ← *w*

16. endif

17. endif

18. Endfor

19. /* Second part */

20. for (*u* = *m*-1) downto (0) do

21. *Edt*[*x*, *u*] ← *f*((*x*, *u*), *s*[*q*])

22. if (*u* = *t*[*q*]) then *q* ← (*q*-1)

23. endif

24. Endfor

25. End forall

## Algorithm 7. Adapted version of thinning algorithm

1. while (*input*[*x*] is destructible) do

2. *push*(*x*, *stack* 1)

3. *x* ← *x*+1

4. endwhile

5. *output* ← *input*

6. While (*stack* 1 ≠ ∅) ∧(max_{
iter
} > 0) do

7. While(*stack* 1 ≠ ∅)do

8. *x* ← *pop*(*stack* 1)

9. if (*output*[*x*] is destructible) then

10. *output*[*x*] ← *reduce* _*pt*(*x*)

11. *push*(*x*, *stack* 2)

12. endif

13. end while

14. While (*stack* 2 ≠ ∅))do

15. *x* ← *pop*(*stack* 2)

16. *v* ← *neighbors*(*x*)

17. *i* ← 0

18. While (*i* < 8) do

19. if (*v*[*i*] ∉ *stack* 1) then

20. *push*(*v*[*i*], *stack* 1)

21. endif

22. endwhile

23. endwhile

24. max_{
iter
} ← max_{
iter
} -1

25. Endwhile

## Declarations

## Authors’ Affiliations

## References

- Taubin G: Curve and surface smoothing without shrinkage.
*Proceedings of ICCV'95, 852-857*1999.Google Scholar - X Liu, Bao H, Shum H-Y, Peng Q: A novel volume constrained smoothing method for meshes.
*Graphical Models*2002, 64: 169-182. 10.1006/gmod.2002.0576View ArticleMATHGoogle Scholar - Asano A, Yamashita T, Yokozeki S: Active contour model based on mathematical morphology.
*ICPR*1998, 98: 1455-1457.Google Scholar - Leymarie F, Levine MD: Curvature morphology.
*Proceedings of Vision Interface*1989, 102-109.Google Scholar - Couprie M, Bertrand G: Topology preserving alternating sequential filter for smoothing 2D and 3D objects.
*J Electron Imaging*2004, 13: 720-730. 10.1117/1.1789986View ArticleGoogle Scholar - Yung Kong T, Rosenfeld A: Digital topology: introduction and survey.
*Comput Vision Graphics Image Process*1989, 48: 357-393. 10.1016/0734-189X(89)90147-3View ArticleGoogle Scholar - Sternberg SR: Grayscale morphology.
*Comput Vision Graphics Image Understanding*1986, 35: 333-355. 10.1016/0734-189X(86)90004-6View ArticleGoogle Scholar - Serra J: Image Analysis and Mathematical Morphology. In
*Theoretical Advances*.*Volume II*. Academic Press, New York; 1988. Chap. 10Google Scholar - Bertrand G: Simple points topological numbers and geodesic neighbourhoods in cubic grids.
*Pattern Recognition Letters*1994, 15: 1003-1011. 10.1016/0167-8655(94)90032-9View ArticleGoogle Scholar - Danielson PE: Euclidean distance mapping.
*Computer Graphics and Image Processing*1980, 14: 227-248. 10.1016/0146-664X(80)90054-4View ArticleGoogle Scholar - Bertrand G, Everat JC, Couprie M: Topological approach to image segmentation.
*SPIE Vision Geometry V*1996, 2826: 65-76.View ArticleGoogle Scholar - Couprie M, Bezerra FN, Bertrand G: Topological operators for greyscale image processing.
*Journal of Electronic Imaging*2001, 10: 1003-1015. 10.1117/1.1408316View ArticleGoogle Scholar - Bertrand G, Everat JC, Couprie M: Image segmentation through operators based on topology.
*Journal of Electronic Imaging*1997, 6: 395-405. 10.1117/12.276856View ArticleGoogle Scholar - Bertrand G: On topological watersheds.
*J Math Imaging Vision*2005, 22: 217-230. 10.1007/s10851-005-4891-5View ArticleMathSciNetGoogle Scholar - Wilkinson MHF, Gao H, Hesselink WH, Jonker J, Meijster A: Concurrent computation of attribute filters on shared memory parallel machines.
*Trans Pattern Anal Mach Intell*2007, 1800-1813.Google Scholar - Seinstra FJ, Koelma D, Geusebroek JM: A software architecture for user transparent parallel image processing.
*International Euro-Par conference*2001, 2150: 653-662.MATHGoogle Scholar - Meijster A, Roerdink JBTM, Hesselink WH: A general algorithm for computing distance transforms in linear time. In
*Mathematical Morphology and its Applications to Image and Signal Processing*. Kluwer Academic Publishers, Dordrecht; 2000:331-340.Google Scholar - Natarajan S, ed:
*Imprecise and Approximate Computation*. Kluwer, Boston; 1995.MATHGoogle Scholar - Van Tilborg AM, Koob GM, eds:
*Foundations of Real-Time Computing: Scheduling and Resources Management, Kluwer, Boston*. 1991.Google Scholar - Leung J, Zhao H: Real-time scheduling analysis report.
*Department of Computer Science New Jersey Institute of Technology*2005.Google Scholar - Kolivas C: RSDL completely fair starvation free 64 interactive cpu scheduler.
*lwn.net*2007.Google Scholar - Molnar I: Modular scheduler core and completely fair scheduler.
*lwn.net*2007.Google Scholar - Kahn G: The semantics of a simple language for parallel programming. In
*Proceedings of the IFIP Congress 74*. North-Holland Publishing Co., Amsterdam; 1974.Google Scholar - Nikolov H, Thompson M, Stefanov T, Pimentel AD, Polstra S, Bose R, Zissulescu C, Deprettere EF, Daedalus : Toward Composable Multimedia MP-SoC Design, invited paper. In
*Proceedings of the ACM/IEEE International Design Automation Conference*. (DAC '08), Anaheim, USA; 2008:574-579.Google Scholar - Halfhill T: Ambric's new parallel processor.
*Microprocessor Report*2006.Google Scholar - Giavitto J-L, Sansonnet J-P: Introduction à 8 1/2 Rapport interne.
*LRI Orsay*1994.Google Scholar - Giavitto J-L, Sansonnet J-P: 8 1/2: data-parallélisme et data-flow. Techniques et Sciences Informatiques 1993., 12(5):Google Scholar
- Mahiout A, Giavitto J-L, Sansonnet J-P: Distribution and scheduling data-parallel dataflow programs on massively parallel architectures. In
*SMS-TPE '94: Software for Multiprocessors and Supercomputers*. Office of Naval Research USA & Russian Basic Research Foundation, Moscow; 1994.Google Scholar - Lemire D: Streaming maximum-minimum filter using no more than three comparisons per element.
*Nordic J Comput*2006, 13(4):328-339.MathSciNetMATHGoogle Scholar - Shih FY, Wu Y: Fast Euclidean distance transformation in two scans using a 3 × 3 neighborhood.
*Comput Vis Image Understanding*2004, 94: 195-205.View ArticleGoogle Scholar - Cuisenaire O, Macq B: Fast Euclidean distance transformation by propagation using multiple neighborhoods.
*CVIU*1999, 76(2):163-172.Google Scholar - Cuisenaire O, Macq B: Fast and exact signed Euclidean distance transformation with linear complexity.
*IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP99)*1999, 3293-3296.Google Scholar - FABBRI R, COSTA LF, TORELLI JC, BRUNO OM: 2D Euclidean distance transform algorithms: A comparative survey.
*ACM Computing Surveys*2008., 40:Google Scholar - Mahmoudi R, Akil M, Matas P: Parallel image thinning through topological operators on shared memory parallel machines. Signals.
*Systems and Computers Conference*2009, 723-730.Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.