Skip to main content

Segmentation and size estimation of tomatoes from sequences of paired images

Abstract

In this paper, we present a complete system to monitor the growth of tomatoes from images acquired in open fields. This is a challenging task because of the severe occlusion and poor contrast in the images. We approximate the tomatoes by spheres in the 3D space, hence by ellipses in the image space. The tomatoes are first identified in the images using a segmentation procedure. Then, the size of the tomatoes is measured from the obtained segmentation and camera parameters. The shape information combined with temporal information, given the limited evolution from an image to the next one, is used throughout the system to increase the robustness with respect to occlusion and poor contrast.

The segmentation procedure presented in this paper is an extension of our previous work based on active contours. Here, we present a method to update the position of the tomato by comparing the SIFT descriptors computed at predetermined points in two consecutive images. This leads to a very accurate estimation of the tomato position, from which the entire segmentation procedure benefits. The average error between the automatic and manual segmentations is around 4 % (expressed as the percentage of tomato size) with a good robustness with respect to occlusion (up to 50 %).

The size estimation procedure was evaluated by calculating the size of tomatoes under a controlled environment. In this case, the mean percentage error between the actual radius and the estimated size is around 2.35 % with a standard deviation of 1.83 % and is less than 5 % in most (91 %) cases. The complete system was also applied to estimate the size of tomatoes cultivated in open fields.

1 Introduction

Monitoring the growth of crop provides important information about the status of the crop and helps the farmer in better managing resource requirements (such as storage requirements, transportation) after the harvest. It also allows better planning and marketing well in advance, as well as better negotiating terms and condition for crop insurance. Moreover, any abnormal growth of the crop can be determined through crop continuous monitoring during the entire agriculture season [1].

Existing methods for monitoring the growth of crops can be broadly divided into two categories. In the first category, the growth of the crop is monitored and the yield of the field is estimated based on remote sensing data [2, 3]. Various vegetation indices such as normalized difference vegetation index (NDVI) and vegetation condition index (VCI) are used to calculate the growth stage of the crop and then estimate the yield. However, the quality of the acquired data may decrease due to adverse climatic condition (such as clouds) [3]. Moreover, since NDVI is based on reflected radiation in the near-infrared and visible wavelengths, the condition of the soil could result in unreliable measured indices. Crop growth modeling is another method used to model the growth of the crop based on crop variety, soil, and weather information [4]. The drawback of this method is that it considers an ideal scenario with no infection in the field. In case of an infection, the estimated model would not accurately represent the actual growth status of the crop.

There exist few studies where the growth of the crops is monitored based on captured images of the field [5, 6]. For instance, the authors in [5] proposed to detect apples cultivated in apple orchards and then measure their size based on morphological operations. However, the proposed method does not take into account any possible occlusion, which may be a strong limitation. The authors in [6] developed a model that can predict the yield of the field at harvest, given the flower density calculated from the captured image. In order to have maximal contrast in the image and least influence of sunlight condition, a black screen made of textile was placed behind the trees, which is a heavy and painful task. Moreover, the yield of the field at harvest depends not only on the flower density but also on the meteorological conditions during the season. These methods are limited to a controlled environment where there is little occlusion and little movement between consecutive images.

This study proposes an innovative system for monitoring the growth of tomatoes cultivated in open fields, from images acquired at regular time intervals by two cameras. The purpose is to estimate the size of the tomatoes remotely, all along their maturation (from flowers to ripe fruits), in order to detect any abnormal development and predict the yield of the field. No specific installation is required to control the environment. The two cameras are placed in the open field and the images are transferred to a central server via a wireless network (2G, 3G, 4G). The data are stored and analyzed at the central server. Note that a judicious amount of data should be transferred to the central server in order to minimize the cost. From the estimated sizes, any abnormal development can be deduced and an estimate of the yield of the field can be computed remotely.

The growth monitoring and yield estimation require identifying the tomatoes in the images and performing quantitative measurements based on calibrated acquisitions. Figure 1 shows typical images of the field using a two-camera acquisition system. Given the challenges of the system (see Section 2), we suppose that a tomato can be modeled as a sphere in the 3D space. Therefore, the tomatoes are first identified in the images using a segmentation procedure, which exploits shape information combined with temporal information, given the limited evolution from an image to the next one. The size of the tomatoes is then measured from the obtained segmentation and camera parameters. We chose to capture and analyze images of the tomatoes during their whole development while keeping costs as low as possible. We chose to process one pair of images per day, as analyzing images more frequently would increase the cost of transmission and processing, without bringing significant extra information.

Fig. 1
figure 1

Images of the scene acquired using two cameras

This work is an extension of our previous works [7, 8] which presented the segmentation procedure only. In contrast, this paper presents the complete system to follow the growth of tomatoes, which consists in segmenting the tomatoes and then estimating their sizes. Furthermore, we propose a more accurate method for estimating the position of the tomato as compared to our previous works [7, 8], thereby improving the obtained segmentation and leading to more accurate experimental results.

As it can be observed from Fig. 1, detecting the tomatoes is a very challenging task. This point is discussed in Section 2, and the model we propose to overcome the difficulties is presented in Section 3. Sections 4 and 5 describe the proposed methods for the two parts of the system: segmentation procedure and size estimation procedure. Experimental results obtained with our complete system on data acquired in open fields are presented in Section 6.

2 Challenges of the system

One of the major challenges of the system results from occlusion. Most of the tomatoes are either partially or completely hidden by other tomatoes or branches/leaves. Figure 2 shows images of three different tomatoes. It is worth noting the variation in the amount of occlusion. In some cases, due to severe occlusion, it is impossible to correctly detect the tomatoes even manually.

Fig. 2
figure 2

Variation in the amount of occlusion

Color information is not very useful as the tomatoes and leaves are almost of the same color during a major part of the agriculture season (Fig. 1). Moreover, the position of the tomato is not fixed during an agriculture season. This might be due to external climatic condition (wind, rain) or due to the increasing weight of the tomato as the season progresses.

Since the images are acquired in open fields, we do not have any control on the external illumination of the scene. As a result, a shadow is observed in some images (Fig. 3). Moreover, due to the presence of neighboring tomatoes in the background, a portion of the contour is blurred (Fig. 3). This results in an imprecision regarding the delineation of the actual position of the contour.

Fig. 3
figure 3

Imprecision on the actual position of the contour due to shadow (left) and blurring (right)

In order to measure the size of the tomato, we need to determine the image points in the two images corresponding to the same point in the 3D space. This correspondence problem is another very challenging task given the complexity of the scene.

In order to overcome all these problems, we propose to exploit available a priori information. The next section introduces our system and describes how this information is integrated.

3 Proposed system

We suppose that the tomato is a sphere in the 3D space. Using the properties of projective geometry, it can be shown that the image of a sphere in the 3D space is an ellipse (Section 3.1). Moreover, we found experimentally that the ellipse parameters vary slowly from one day to the next one, allowing us to introduce temporal knowledge in our models (Section 3.2). All this information is used all along our segmentation method which relies on an active contour model with shape constraint (Section 3.3). The complete workflow is introduced in Section 3.4.

Figure 4 shows the geometry of the acquisition system. The camera parameters (intrinsic and extrinsic) were computed by observing a calibration pattern at different locations and orientations in the scene [9].

Fig. 4
figure 4

The acquisition system consists of two cameras installed in an open field

3.1 Geometric model

The contour generator Γ of a surface Qr of the 3D space, in general, is a space curve, composed of all points X situated on the surface at which the imaging rays are tangent. The apparent contour Cn is the image of this contour generator.

Under the camera projection matrix P, the apparent contour of the quadric Q r (a surface in the 3D space) is the conic C n defined as:

$$ \mathbf{Cn}^{*}=\mathbf{P}\mathbf{Qr}^{*}\mathbf{P}^{T} $$
((1))

where M represents the adjoint of M or M M −1 for a non-singular matrix M, where denotes equality up to a scale factor. Note that a dual conic C n is used here because apparent contour arises from tangency (see Chapter 8 in [10]).

In our case, the contour generator Γ of a sphere is a circle whose center is the center X c of the sphere (Fig. 5). The curve Γ is included in the plane Π orthogonal to the line joining the center of the sphere X c and the camera center C. The rays from the camera center C and tangent at the contour generator Γ form a cone, with the camera center as its vertex. The intersection of the image plane with the cone gives the image of the contour generator. Therefore, the apparent contour of this contour generator (circle) is an ellipse.

Fig. 5
figure 5

The image of a sphere is an ellipse

Introducing this a priori shape information in the segmentation procedure increases the robustness of the segmentation with respect to occlusion. Besides, this also simplifies the size estimation procedure since the radius of the sphere can be estimated without a full 3D reconstruction of the scene.

3.2 Temporal model

As discussed earlier, there is little growth of the tomato in a given day. Therefore, only two images are acquired every day, one for each camera. This creates a sequence of images for a given tomato. We manually segmented five tomatoes and approximated the delineated contour by an ellipse. We then studied the evolution of the length of their major and minor axes for the entire agriculture season. This study confirmed that there is little growth in the tomato during a given day [7].

Moreover, it was observed that under normal circumstances, there is a little movement of the tomato as the season progresses. However, this movement is not uniform and very difficult to predict especially in case of strong winds or heavy rains, which led us to propose a new algorithm for tomato detection, presented in Section 4.2.

This a priori temporal information is integrated in the segmentation algorithm by defining intervals which represent the acceptable variation of the ellipse parameters in a given day. The admissible range of variation of the ellipse parameters is quite wide and defined as follows:

$$\begin{array}{*{20}l} -0.1 < \frac{a^{i+1}-a^{i}}{a^{i}} <0.2 \end{array} $$
((2))
$$\begin{array}{*{20}l} -0.1 < \frac{b^{i+1}-b^{i}}{b^{i}} <0.2 \end{array} $$
((3))
$$\begin{array}{*{20}l} -0.1 < \frac{\textit{SA}^{i+1}-\textit{SA}^{i}}{\textit{SA}^{i}} <0.25 \end{array} $$
((4))
$$\begin{array}{*{20}l} \left| \frac{\textit{Ecc}^{i+1}-\textit{Ecc}^{i}}{\textit{Ecc}^{i}} \right| <0.1 \end{array} $$
((5))
$$\begin{array}{*{20}l} \left| \frac{\varphi^{i+1}-\varphi^{i}}{\varphi^{i}} \right| <0.2 \end{array} $$
((6))

where a i,b i,φ i,S A i, and E c c i are the semi major axis length, semi minor axis length, orientation, area, and eccentricity of the ellipse in the previous image (i th image), respectively.

3.3 Active contour with shape constraint

The proposed segmentation procedure is based on an active contour model [11] with shape constraint. A brief description of the proposed active contour model [7] is presented below.

As discussed, the image of the tomato is assumed to be an ellipse. Thus, we propose to use a parametric active contour with an elliptic shape constraint, derived from a reference ellipse. Assuming the center of the reference ellipse as the origin, a point z(θ) on the evolving contour can be defined as:

$$ z(\theta) = r (\theta)e^{j\theta} $$
((7))

where θ is the angular coordinate and r(θ) is the radial coordinate. Similarly, a point situated on the reference ellipse can be denoted as z e (θ)=r e (θ)e jθ. The energy functional with shape prior is defined as:

$$\begin{array}{*{20}l} {}\mathbf{E}_{\text{Total}}(r,r_{e})\,=\, \int_{0}^{2\pi} \!\left(\!\frac{\alpha}{2} |r'(\theta)|^{2}\!\right)\mathrm{d} \theta &\,+\, \int_{0}^{2\pi}\! E_{\text{Image}} (r(\theta)e^{j\theta})\mathrm{d} \theta \\ &\,+\, \frac{\psi}{2}\int_{0}^{2\pi}\! |r(\theta)\,-\,r_{e}(\theta)|^{2} \mathrm{d}\theta \end{array} $$
((8))

In the above equation, the internal energy is represented by the first term and the external energy by the last term, which restricts the evolution of the contour with respect to the reference ellipse. The coefficient α controls the variations of r and makes it regular, while ψ controls the influence of the shape prior on the total energy. In our application, these two parameters were set experimentally (α=10,ψ=0.5) on some examples and the same values were used for all images. The second term, the image energy term, is calculated using gradient vector flow (GVF) [12]. The minimization of Eq. 8 is classically performed thanks to an iterative algorithm. The reference ellipse is regularly updated, based on both the position of the current curve z(θ) and the knowledge of the final ellipse in the previous image (temporal model).

3.4 Summary of the proposed algorithm

Here, we briefly present the main steps for processing a tomato in a series of images captured by the left and right cameras. The global system is composed of two main steps: tomato segmentation and size estimation (Fig. 6). We assume that the images of a given day (i+1), \(\text {Im}_{l}^{i+1}\) and \(\text {Im}_{r}^{i+1}\), are processed knowing the elliptic approximations, \(\text {Ell}_{l}^{i}\) and \(\text {Ell}_{r}^{i}\), found in the previous day (i).

Fig. 6
figure 6

The proposed system

We first summarize the segmentation procedure applied to the left and right images separately (Fig. 7). Since the position of the tomato is not fixed in the image as the season progresses, we first update it (Section 4.2). Then, gradient information is combined with region information in order to propose a first elliptic approximation of the tomato boundary, from which the active contour with elliptic shape constraint can be applied. Finally, four ellipse estimates are computed from which the operator has to select the best one as the final segmentation (Section 4.3).

Fig. 7
figure 7

Different steps of the segmentation procedure

In order to perform metric measurements in the scene, we need to determine the camera projection matrices for the two cameras (P,Q). These matrices are computed using the method presented in [9]. Then, from the obtained segmentation in the two images \(\left (\text {Ell}_{\mathrm {l}}^{\mathrm {i+1}}, \text {Ell}_{\mathrm {r}}^{\mathrm {i+1}}\right)\) and the camera projection matrices, the set of 3D space points situated on the contour generator is computed (Section 5.1). From the two sets of 3D space points corresponding to the left and the right cameras, we then estimate the radius of the sphere using least square minimization techniques (Section 5.2). Finally, a joint optimization is performed to obtain the final radius estimate of the sphere (Section 5.3).

4 Segmentation procedure

This section presents the proposed algorithm for detecting the tomatoes. Let us denote by Imi+1 the (i+1)th image (left or right) in which we wish to identify the tomato. In our sequential approach, the contour in the (i+1)th image is computed based on the information present in Imi+1 and the contour of the tomato in the i th image (Imi) which has been validated by the operator (Fig. 7). So, in the following steps, it is assumed that the contour representing the tomato in the i th image is available and reliable. It is denoted by the ellipse \(\text {Ell}_{f}^{i}= \left [x{c_{f}^{i}},y{c_{f}^{i}},{a_{f}^{i}},{b_{f}^{i}},{\varphi _{f}^{i}}\right ]\) where \({C_{f}^{i}} = \left [x{c_{f}^{i}},y{c_{f}^{i}}\right ]\) represents the center of the ellipse whose semi major and minor axes lengths are \({a_{f}^{i}}\) and \({b_{f}^{i}}\), respectively, and which has a rotation angle of \({\varphi _{f}^{i}}\).

4.1 Pre-processing

Color information is not very useful since the tomatoes turn to red only at the end of the season. However, the edges of the tomatoes are more contrasted in the red component of the image, even during the first stages of the maturation. Hence, only this component is considered. A contrast stretching transformation is applied to this image.

4.2 Tomato localization

We first update the position of the tomato in the current image ((i+1)th) using a descriptor-based approach. Given the complexity of the scene, detecting interest points in the entire image and then matching their descriptors would be a computationally expensive task. Instead, we propose to compare the descriptors computed at a predetermined set of points in the previous (i th) and current ((i+1)th) images. The points in the i th image are computed from the final segmentation validated by the operator (Section 4.3.3). As such, these points are very likely to be situated on the actual boundary of the tomato, or they are very close to it otherwise. In the (i+1)th image, the candidate points are computed based on gradient magnitude and direction; they may or may not lie on the actual boundary of the tomato. By matching these two sets of descriptors, we then compute the translation undergone by the tomato from the i th image to the (i+1)th image.

4.2.1 Selection of relevant points in the i th image

Considering the final segmentation \({v_{f}^{i}}\) (output by the active contour algorithm) and its least square estimate \(\text {Ell}_{f}^{i}\) in the i th image (Fig. 8), the first step aims at selecting from \({v_{f}^{i}}\) a set P i of points that are surely situated on the tomato.

Fig. 8
figure 8

(left) Selecting the points situated on the actual boundary of the tomato in the i th image; (right) Detecting the candidate points of strong gradient

For each candidate point \(P_{v,h}^{i}\) of \({v_{f}^{i}}, h=1,\ldots,n_{P_{v}}^{i}\), the nearest point \(Q_{v,h}^{i}\) situated on the ellipse \(\text {Ell}_{f}^{i}\) is first determined. The normal to the ellipse \(\text {Ell}_{f}^{i}\) at the point \(Q_{v,h}^{i}\) is calculated and denoted by \({n_{h}^{i}}\).

In order to search out the points in the neighborhood of \(P_{v,h}^{i}\), with prominent gradient magnitude and whose gradient direction is normal to the ellipse \(\text {Ell}_{\mathrm {f}}^{\mathrm {i}}\), an intensity profile is created between \(P_{h,1}^{i}\) and \(P_{h,2}^{i}\), where \(P_{h,1}^{i} = P_{v,h}^{i}-0.25\; {r_{f}^{i}}\; {n_{h}^{i}}\) and \(P_{h,2}^{i} = P_{v,h}^{i}+0.1\; {r_{f}^{i}}\; {n_{h}^{i}}\), with

$$ {r_{f}^{i}}=\frac{{a_{f}^{i}}+{b_{f}^{i}}}{2} $$
((9))

If we denote by \(P_{h,\text {max}}^{i}\) the point with maximum gradient magnitude along the segment \(\left [P_{h,1}^{i}P_{h,2}^{i}\right ]\), then \(P_{v,h}^{i}\) is selected as a point situated on the actual contour of the tomato if:

$$ d\left(P_{h,max}^{i},P_{v,h}^{i}\right) <2 $$
((10))

where d represents the Euclidean distance. Hereinafter, the set of points situated on the actual contour of the tomato computed using the above condition is represented as \(\mathbf {P}^{i}= \left \{{P_{h}^{i}},h=1,2,\ldots {n_{P}^{i}}\right \}\), where \({n_{P}^{i}}\leq n_{P_{v}}^{i}\).

Figure 9 shows the points detected in the 17th image of sequence S=7. Most of the non-occluded contour points of the tomato have been selected using this approach.

Fig. 9
figure 9

The two sets of points to be matched: (left) P i in S = 7, i = 17, (right) \({P}_{c}^{i+1}\) in S = 7, i = 18

4.2.2 Selection of candidate points in the (i+1) th image

We now wish to compute a set of candidate contour points in the (i+1)th image. We first roughly determine the position of the tomato in the (i+1)th image, based on the pattern matching method presented in [7]. Let us denote by C m =[ x m ,y m ] the estimated position. The purpose of the proposed method is to refine this position.

Using a polar representation with C m as the origin, let us denote by \(\left \{P_{u,c}^{i+1}=\rho _{u,c}^{i+1}e^{j\theta _{u,c}^{i+1}}, c=1,\ldots,n_{P_{u}}^{i+1}\right \}\) the set of points situated inside two concentric circles (Fig. 8) whose radii are respectively \(0.5 {r^{i}_{f}}\) and \(1.5{r^{i}_{f}}\). Points from this set are selected as candidate points if they satisfy the following two conditions:

$$\begin{array}{*{20}l} \left| \arg \left(\nabla \text{Im}^{\mathrm{i+1}}\left({P}_{u,c}^{i+1}\right)\right) -\theta_{u,c}^{i+1}\right| \leq \delta \theta_{\text{max}} \end{array} $$
((11))
$$\begin{array}{*{20}l} \left|\nabla \text{Im}^{\mathrm{i+1}}\left({P}_{u,c}^{i+1}\right) \right| > \eta \end{array} $$
((12))

where \(\arg \left (\nabla \text {Im}^{\mathrm {i+1}}\left ({P}_{u,c}^{i+1}\right)\right)\) and \(\left |\nabla \text {Im}^{\mathrm {i+1}}\left ({P}_{u, c}^{i+1}\right) \right |\) are respectively the angle and magnitude of the gradient at \({P}_{u,c}^{i+1}\) in i m i+1.

The threshold values have been determined experimentally (η = 0.2 pixels, \(\delta \theta _{\text {max}}=\frac {\pi }{8} rad\)). The above conditions can be viewed as selecting the points with strong gradient whose gradient direction is within an acceptable limit with respect to the normal vector to a circle with radius \({r^{i}_{f}}\). Finally, for every angle \(\theta _{u,c}^{i+1}\), if several points satisfy the criteria defined in Eqs. 11 and 12, only the closest to the center point C m is retained. Thus, at most, one candidate point is retained by angle. This allows reducing the number of candidate points to be processed, in a way that is consistent with the fact that the non-occluded pixels of the tomato do not have prominent gradient.

Hereinafter, the set of candidate points in the (i+1)th image is denoted by \(\mathbf {P}_{c}^{\,i+1}=\left \{P_{c,l}^{\,i+1},l =1,\ldots,n_{P}^{i+1}\right \},n_{P}^{i+1}\leq n_{P_{u}}^{i+1} \).

Figure 9 shows the set of points \(\mathbf {P}_{c}^{i+1}\) detected for the 18th image of sequence S = 7. Note that the points lying on the actual contour of the tomato have been detected along with several other points lying on the adjacent leaves.

4.2.3 Descriptor matching

Next, we wish to match the descriptors computed at these two sets of points. Let us denote by \(D_{\mathbf {P}^{i}}\) and \(D_{\mathbf {P}_{c}^{i+1}}\) the scale invariant feature transform (SIFT) descriptors [13] computed at P i and \(\mathbf {P}_{c}^{i+1}\), respectively. For every interest point of P i, the best match is determined by minimizing the Euclidean distance between the two descriptor vectors (using the keypoint matching approach presented in [13]). Let us suppose that the k th point of the set P i matches with the l th point in the set \(\mathbf {P}_{c}^{i+1}\). The corresponding translation \({T_{k}^{i}}\) is then computed as:

$$ {T_{k}^{i}} = P_{c,l}^{i+1}-{P_{k}^{i}}, k = 1,\ldots,{n_{P}^{i}} $$
((13))

Now, we wish to determine which translation among the \({n_{P}^{i}}\) possible candidates represents the actual movement of the tomato. For each candidate, we define the translated set of points \(\mathbf {P}_{T_{k}}^{i} = \left \{P_{T_{k,h}}^{i}\right \}\) where

$$ P_{T_{k,h}}^{i}= {P^{i}_{h}}+{T_{k}^{i}}, h = 1,\ldots,{n_{P}^{i}} $$
((14))

Among the \({n_{P}^{i}}\) possible translations, we first select \(n_{h}\left (n_{h}<{n_{P}^{i}}\right)\) translations which maximize the number of inliers. An inlier is defined as a point of \(\mathbf {P}_{T_{k}}^{i}\) whose distance to a point of \(\mathbf {P}_{c}^{i+1}\) is less than 5 pixels. However, due to the high variability that can generally be observed between two consecutive images, selecting a translation based only on the maximization of the number of inliers would not give optimal results. Hence, we propose to introduce additional information.

We first compute the region representing the tomato in the i th image using a classical region growing algorithm, with seed calculated from the brightest pixels and growth limited to the region inside \(\text {Ell}_{f}^{i}\). We thus obtain the region \({\omega _{t}^{i}}\) representing the non-occluded part of the tomato in the previous image (Fig. 10). This region is used as a reference for computing the optimal translation.

Fig. 10
figure 10

\({\omega _{t}^{i}}\): region representing tomato in i th image

We then apply the region growing algorithm in the (i+1)th image, considering every preselected translation (translation of seed and of the limiting area). We denote by \(\omega _{k}^{i+1}\) the binary image so obtained corresponding to a particular T k .

Let us define ν k , the ratio of the two regions as:

$$\begin{array}{@{}rcl@{}} \nu_{k}= \frac{\left| \omega_{k}^{i+1}\right|}{\left|{\omega_{t}^{i}}\right|} \end{array} $$
((15))

where |A| represents the cardinality of set A. We retain the translation T f (Fig. 11) which maximizes ν k , among the translations that satisfy two preliminary conditions:

$$\begin{array}{*{20}l} \nu_{k} > 0.65\ \ \ \end{array} $$
((16))
Fig. 11
figure 11

Translated points

$$\begin{array}{*{20}l} \!\!\!\!\!\!\frac{1}{\left| \omega_{k}^{i+1}\right|} \sum_{(x,y)\in\omega_{k}^{i+1}}\text{Im}^{\mathrm{i+1}}(x,y) \!>\! 0.65 \frac{1}{\left|{\omega_{t}^{i}}\right|} \sum_{(x,y){\in\omega_{t}^{i}}} \!\text{Im}^{\mathrm{i}}(x,y)\ \ \end{array} $$
((17))

The first condition eliminates those cases where there is an inconsistency between the size of the region representing the tomatoes in the two consecutive images. The second condition ensures that the mean gray level measured in the (i+1)th image on an area which is supposed to be a non-occluded part of the tomato is consistent with the one found in the previous image.

Due to the varying configurations of occlusion, it might be possible that none of the n h preselected translations, which maximize the number of inliers, represent the actual translation of the tomato. This scenario is generally detected by the above two conditions. In that case, all the \({n_{P}^{i}}\) initial translations are tested. We denote by \(C_{t}^{i+1}\) the updated location of the tomato center in the (i+1)th image:

$$ C_{t}^{i+1} = {C_{f}^{i}} + T_{f} $$
((18))

4.3 Estimation of the tomato boundary

In order to reduce the region to be analyzed, a smaller image ImSi+1 is extracted from Imi+1 with its center as \(C_{t}^{i+1}\). A contrast stretching transformation is then applied to ImSi+1.

The segmentation procedure is based on our active contour model with elliptic shape constraint (Section 3.3). It leads to a better robustness with respect to occlusions and lack of contrast. However, given the complexity of the scene, a good initialization is required; otherwise, the active contour will not converge towards the searched contours but will get trapped in another local minimum of the energy functional. Thus, the tomato detection procedure presented in Section 4.2 is of major interest as compared to our previous work [7, 8], as it improves the accuracy of the initialization algorithm (Section 4.4.1) and consequently the global performances (Section 4.4.2).

A brief overview of the initialization procedure is presented in Section 4.3.1. More details can be found in [7, 8]. Then, additional information about the active contour model is given in Section 4.3.2.

4.3.1 Initialization of the active contour model

We use both gradient and region information to determine the initial position of the active contour model.

First, the method described in Section 4.2.2 (Eqs. 11 and 12) is applied in order to determine a set of candidate points, based on gradient magnitude and direction. Since the position of the tomato center is detected more accurately than in our previous work, it has been possible to restrict the size of the region of interest, which now lies inside two concentric circles with radius \(0.9 {b^{i}_{f}}\) and \(1.1 {a^{i}_{f}}\) instead of \(0.5{r^{i}_{f}}\) and \(1.5{r^{i}_{f}}\) (Fig. 12).

Fig. 12
figure 12

Initialization of the elliptic active contour model. (upper left) Selection of candidate contour points; (upper right) selection of N a ellipses, through RANSAC algorithm; (lower left) detection of the non-occluded parts of the tomato, \(\omega _{t}^{i+1}\), through region growing algorithm; (lower right) initialization of the elliptic active contour model

Due to the presence of outliers, even small, a least square estimate of an ellipse from all of the candidate points would not accurately represent the tomato boundary. Consequently, a RANSAC estimate based on an elliptic model is used to determine the parameters of several candidate ellipses. In this step, only those ellipses which satisfy the temporal constraint formulated in Section 3.2 are considered. Thus, a total of N a =20 ellipses, \(\text {Ell}_{u}^{i+1}, u=1,\ldots,N_{a}\), are retained: the ellipses with the largest number of inliers and whose parameters are compatible with the ones of the tomato in the previous image (Fig. 12). Note that both spatial regularization and temporal regularization have been used in this step, increasing the reliability of the segmentation procedure.

The third step of the initialization procedure aims at introducing region information, in order to select the best ellipse among the N a previously retained. For that, we apply a region growing algorithm, whose seed is formed from the brightest pixels inside the intersection of all the candidate ellipses. The result is a binary image \(\omega _{t}^{i+1}\) (Fig. 12), where the pixels set to 1 correspond to non-occluded parts of the tomato. Then, for every candidate ellipse \(\text {Ell}_{u}^{i+1}, u=1,\ldots,N_{a}\), we compare the region \(\omega _{t}^{i+1}\) with the region inside the considered ellipse, denoted by \(\omega _{u}^{i+1}\), by calculating the following index:

$$ \tau(u) = \frac{\left| \omega_{u}^{i+1}\cap \left(1-\omega_{t}^{i+1}\right)\right| +\left| \omega_{t}^{i+1} \cap \left(1-\omega_{u}^{i+1}\right) \right|}{\left|\omega_{t}^{i+1} \cap \omega_{u}^{i+1} \right|} $$
((19))

where |A| represents the cardinality of a set A. The ratio τ(u) measures the consistency between the segmentation obtained through the contour analysis \(\left (\omega _{u}^{i+1}\right)\) and the region analysis \(\left (\omega _{t}^{i+1}\right)\). It reaches a minimum (zero) when \(\omega _{u}^{i+1}\) and \(\omega _{t}^{i+1}\) match perfectly.

Let us denote by \(a^{i+1}_{u}\) and \(b^{i+1}_{u}\) the semi axis lengths of the candidate ellipse \(\text {Ell}_{u}^{i+1}\). The final elliptic initialization \(\text {Ell}_{v}^{i+1}\) (Fig. 12) is selected among the ellipses for which a good match is achieved (Eq. ??), considering another regularization condition which imposes that the size and shape of the ellipse in ImSi+1 are close to the ones in ImSi (Eq. 21):

$$\begin{array}{*{20}l} S_{\tau} = \left\{ u \in [\!1,N_{a}]\mid \tau(u)\leq 1.1 \min_{w\in [1,N_{a}]}\tau (w) \right\} \end{array} $$
((20))
$$\begin{array}{*{20}l} v = \arg\min_{u\in S_{\tau}}\left(a^{i+1}_{u}-a^{i}\right)^{2}+\left(b^{i+1}_{u}-b^{i}\right)^{2} \end{array} $$
((21))

Combining \(\omega _{v}^{i+1}\) and \(\omega _{t}^{i+1}\) enables us also to determine the region of potential occlusions.

4.3.2 Applying the elliptic active contour model

The elliptic active contour model (Section 3.3) is then applied from the initialization \(\text {Ell}_{v}^{i+1}\). During the first n start iterations, the parameter ψ is set to zero, so that z moves towards the most prominent contours. Then, the elliptic shape constraint is introduced for n ellipse iterations (ψ≠0) in order to guarantee robustness with respect to occlusion. Finally, the shape constraint is relaxed (ψ=0) for a few n end iterations, which allows reaching the boundary more accurately, as a tomato is not a perfect ellipse (Fig. 13).

Fig. 13
figure 13

Tomato segmentation. (upper left) Result provided by the elliptic active contour model; (upper right)(lower left) two sets of contours points extracted from (upper left); (lower right) four elliptic approximations

During the n ellipse iterations, the reference ellipse is regularly and automatically updated from the current curve z. Again, a least square estimate calculated from all the points of the curve z is not relevant because some of them may lie on false contours (e.g., leaves). So a procedure, similar to the one described in Section 4.2.1, is applied in order to select a subset of points that are very likely to lie on the boundary of the tomato. From these points, the parameters of the reference ellipse are optimized in a root mean square error sense and so automatically updated every 10 iterations. Note that the length of the major and minor axes are estimated only once, at the beginning of the process, as the initial values are supposed to be very close to the actual values (temporal regularization), contrary to the other parameters of the ellipse, which are much more unstable due to the global movement of the tomato.

It is also worth noting that the image forces are not considered in the regions of occlusion, in every step of this process.

4.3.3 Final elliptic approximation

Finally, four elliptic estimates (Fig. 13) of the tomato boundary are determined: points that are likely to be on the actual boundary are extracted based on several selection criteria [7, 8]; then, the RANSAC algorithm or a least square estimation [14] are applied to get the four elliptic approximations from the sets of selected points. In general, the four ellipses are almost the same in the case of little occlusion while they may differ more significantly in the case of higher occlusion. The operator has only to select the best estimate, if he considers it correct, or manually define a better elliptic approximation, otherwise. The latter case is however very rare and arises when the tomato is highly occluded (see the experimental results presented in Section 4.4).

4.4 Evaluation of the segmentation procedure

We first compare the proposed descriptor-based method to update the position of the tomato (Section 4.4.1) with our previous work. We then evaluate the segmentation procedure by comparing the obtained segmentations with manual segmentations (Section 4.4.2).

The proposed segmentation procedure was evaluated on the images acquired during three agriculture seasons (April–August, 2011–2013). Although a variety of the tomatoes was the same, a difference in vegetation was observed due to external climatic conditions. We identified 21 tomatoes for our study covering different sites and different seasons, thus ensuring a good representation of the variability. Analyzing only one pair of images a day for each tomato, we created therefore 21 pairs of image sequences.

Not all flowers develop into tomatoes at the same time. Besides, some tomatoes may be totally hidden by other tomatoes or leaves. As a result, the total number of days a particular tomato can be observed is not identical for all the 21 tomatoes. Flowers that mature to tomato early and are not hidden by leaves/tomatoes can be observed for a maximum number of days, thus creating a maximum number of images in the corresponding tomato sequence.

It is difficult to evaluate the segmentation procedure on the entire image dataset, given the variable degrees of occlusion. Therefore, the influence of the amount of occlusion on the final radius estimate was studied, and the image dataset was then divided into three categories [7, 8]:

$$ {\fontsize{8.7pt}{9.6pt} \selectfont{\begin{aligned} {}\!\text{category}\,=\, \left\{\begin{array}{lll} \!1, &\text{if the amount of occlusion is less than 30~\%} \\ \!2, &\text{if the amount of occlusion is between~30 and 50\,\%} \\ \!3, &\text{if the amount of occlusion is more than 50~\%}. \end{array}\right. \end{aligned}}} $$
((22))

Our evaluation is based on the comparison of the automatic segmentation results with elliptic approximation of manual segmentations. We denote the manual segmentation of the i th image of a given sequence by \(\text {Ell}_{s}^{i} =\left [x{c_{s}^{i}},y{c_{s}^{i}},{a_{s}^{i}},{b_{s}^{i}},{\varphi _{s}^{i}} \right ]\) where \({C_{s}^{i}}=\left [x{c_{s}^{i}},y{c_{s}^{i}} \right ]\) represents the center of the ellipse whose semi major and minor axes are \({a_{s}^{i}}\) and \({b_{s}^{i}} \), respectively, and which has a rotation angle of \({\varphi _{s}^{i}}\). Results are presented for category 1 and category 2 separately. Note that only images acquired from the left camera are presented in this section. Indeed, the images acquired using the left and the right cameras exhibit similar characteristics, in overall, even if different percentages of occlusion can be observed for some pairs.

4.4.1 Tomato localization

In this section, we compare the proposed descriptor-based approach to update the position of the tomato with the pattern matching approach presented in our previous work [7, 8]. For this, we measure the distance between the estimated center of the tomato obtained with the two approaches with the actual center of the tomato given by the manual segmentation.

Let us define for any image i two distance measures \(D_{\text {pm}}^{i}\) and \(D_{\text {desc}}^{i}\):

$$\begin{array}{*{20}l} D_{\text{pm}}^{i} &=& d\left({C_{m}^{i}},{C_{s}^{i}}\right) \end{array} $$
((23))
$$\begin{array}{*{20}l} D_{\text{desc}}^{i} &=& d\left({C_{t}^{i}},{C_{s}^{i}}\right) \end{array} $$
((24))

\(D_{\text {pm}}^{i}\) and \(D_{\text {desc}}^{i}\) represent the error on the estimation of the tomato center for the pattern matching method \(\left ({C_{m}^{i}}\right)\) and the descriptor-based method \(\left ({C_{t}^{i}}\right)\), respectively. Tables 1 and 2 show the percentage of images in a given sequence for which the distance measure \(D_{\text {pm}}^{i}\) or \(D_{\text {desc}}^{i}\) is less than 10 pixels, \(\frac {{r_{s}^{i}}}{4}\) and \(\frac {{r_{s}^{i}}}{2}\), where \({r_{s}^{i}}=\frac {{a_{s}^{i}}+{b_{s}^{i}}}{2}\).

Table 1 Percentages of images in category 1 where the distance measure \(D_{\text {pm}}^{i}\) or \( D_{\text {desc}}^{i}\) is less than a given threshold (10, \(\frac {{r_{s}^{i}}}{2}\) and \(\frac {{r_{s}^{i}}}{4}\) pixels). Also shown is the total number of images (N 1) for each sequence in category 1
Table 2 Percentages of images in category 2 where the two distance measures \(D_{\text {pm}}^{i}, D_{\text {desc}}^{i}\) is less than a given threshold (10, \(\frac {{r_{s}^{i}}}{2}\) and \(\frac {{r_{s}^{i}}}{4}\) pixels). Also shown is the total number of images (N 2) of category 2 in each sequence

For the images of category 1, the position of the tomato was precisely detected in 97 % of the images using the descriptor-based approach (\(D_{\text {desc}}^{i} <10 \) pixels) as compared to 65 % in case of pattern matching. This demonstrates a significant improvement regarding the accuracy of the tomato localization. Moreover, almost all tomatoes (98 %) are correctly detected \(\left (D_{\text {desc}}^{i}<\frac {{r_{s}^{i}}}{2}\right)\) based on this method with a significant improvement (+3.6 %) as compared with the pattern matching approach.

The images of category 2 contain a significant amount of occlusion, with more than 30 % of the elliptical contour hidden. Therefore, the pattern matching approach fails to find the position of the tomato \(\left (D_{\text {pm}}^{i}>\frac {{r_{s}^{i}}}{2}\right)\) in 9 % of cases. However, the descriptor-based approach correctly detects the position of the tomato for 96.5 % of the images with a significant improvement (+5.1 %). For instance, in sequence 13, the position of the tomato was correctly detected \(\left (D_{\text {pm}}^{i}<\frac {{r_{s}^{i}}}{2}\right)\) in only 76 % of the images by the pattern matching approach, against 100 % with the new one. Moreover, as in the case of images of category 1, a significant improvement regarding the accuracy of the estimated position of the tomato is observed. Using the descriptor-based approach, the position is accurately detected for 85 % of the images (\(D_{\text {desc}}^{i} < 10 \) pixels), compared to 59 % with pattern matching.

The pattern matching approach [7, 8] is based on the detection of the non-occluded region. In case of partial occlusion, the maximum of correlation may be rather far from the actual center of the tomato. Moreover, it cannot provide an accurate estimation when several tomatoes overlap. The descriptor-based approach overcomes these difficulties since it relies on both region and contour information. The movement of the tomato can be accurately estimated by matching feature vectors calculated on contour points. Overall, the descriptor-based approach is more robust to occlusion and provides a more accurate estimation of the tomato center. This benefits to the complete segmentation procedure, as the detection of the candidate contour points is then much more reliable, leading to a better initialization of the elliptic active contour model.

4.4.2 Tomato segmentation

In order to evaluate the segmentation procedure, the obtained segmentation A (one of the four final estimates \(\text {Ell}_{f1}^{i}\), \(\text {Ell}_{f2}^{i}\), \(\text {Ell}_{f3}^{i}\), or \(\text {Ell}_{f4}^{i}\)) is compared with the manual segmentation \(\text {Ell}_{s}^{i}\) by computing the average \(D^{i}_{\text {mean}}\) and the maximum \(D^{i}_{\text {max}}\) distance between A and \(\text {Ell}_{s}^{i}\) for every tomato image i. These distance measures, expressed in pixels, are normalized with respect to the size of the tomato \(\left ({r_{s}^{i}}\right)\) in order to better interpret the results. The normalized distance measures are defined as:

$$\begin{array}{*{20}l} D^{i}_{\text{mean}R}= \frac{D^{i}_{\text{mean}}}{{r^{i}_{s}}}100 \end{array} $$
((25))
$$\begin{array}{*{20}l} D^{i}_{\text{max}R} = \frac{D^{i}_{\text{max}}}{{r^{i}_{s}}}100 \end{array} $$
((26))

Tables 3 and 4 present the mean and the standard deviation of \(D^{i}_{\text {mean}R}\) and \(D^{i}_{\text {max}R}\) for the images of category 1 and category 2, respectively. These tables do not consider the images with incorrect estimation of the tomato position (i.e., \(D_{\text {desc}}^{i}>\frac {{r^{i}_{s}}}{2}\)). Moreover, two series of results are presented: the statistics computed for \(\text {Ell}_{f4}^{i}\), which would correspond to a fully automatic process, and the statistics computed with the best estimate \(\text {Ell}_{\text {opt}}^{i}\) (i.e. minimizing \(D_{\text {mean}R}^{i}\)), which, in practice, would be selected by an operator among the four possibilities.

Table 3 Distribution of \(D_{\text {mean}R}^{i}\) and \(D_{\text {max}R}^{i}\) computed assuming \(\text {Ell}_{f4}^{i}\) or \(\text {Ell}_{\text {opt}}^{i}\) as the final segmentation for the images of category 1. Also shown are the total number of images (N 1) of category 1 for each sequence and the number of images N c where the position of the tomato is correctly estimated \(\left (D_{\text {desc}}^{i}<\frac {{r_{s}^{i}}}{2}\right)\). Only four images were discarded using this criterion
Table 4 Distribution of \(D_{\text {mean}R}^{i}\) and \(D_{\text {max}R}^{i}\) computed assuming \(\text {Ell}_{f4}^{i}\) and \(\text {Ell}_{\textit {opt}}^{i}\) as the final segmentation for the images of category 2. Also shown are the total number of images (N 2) of category 2 for each sequence and the number of images N c where the position of the tomato is correctly estimated \(\left (D_{\text {desc}}^{i}<\frac {{r_{s}^{i}}}{2}\right)\). Only five images were discarded using this criterion

For the images of category 1, very good results (Fig. 14) were obtained in most of the sequences with \(\mu _{D_{\text {mean}R}}\) less than 10 % for all the sequences. The mean error averaged on all images is less than 5 % (even for Ell f4, without manual selection), which fits the requirements of the final user. Moreover, a lower \(\sigma _{D_{\text {mean}R}}\) demonstrates the robustness of our method. Even if the occlusion degree is less than 30 %, many images present a blurred contour or low contrast, due to shadowing effects or overlap with nearby tomatoes. However, a good segmentation is obtained in even these challenging cases (Fig. 14 b, c). For the images acquired in the agriculture season 2013 (Sequence 12–21), the size of the tomatoes is smaller compared to the seasons 2011 and 2012. This accounts for slightly higher relative distance measures for these sequences (Fig. 14 d).

Fig. 14
figure 14

Original image (left) and final segmentation (right) shown in red obtained on images of category 1. The contour in cyan represents the manual segmentation. The distance measures (D meanR , D maxR ) are (top to bottom) (1.06 %, 2.98 %), (2.12 %, 4.79 %), (1.12 %, 3.25 %), (5.08 %, 13.38 %). Note that even in the presence of occlusion (a), smoothed contour (b, c), a good segmentation is obtained. Also notice the small size of the tomato which results in higher relative distance measure (d)

For the images of category 2, which contain a significant amount of occlusion (30 to 50 %), good results are observed with \(\mu _{D_{\text {mean}R}}\) less than 10 % in case of Ell f4 for almost all sequences (except sequence 10). When considering the optimal ellipse Ellopt, \(\mu _{D_{\text {mean}R}}\) is less than 10 % for all the sequences and less than 5 % for 60 % of them. The results have been significantly improved compared to our previous work, thanks to the more accurate detection of the position of the tomato. Figure 15 shows some examples where a good segmentation is obtained even in the presence of noise and blurred contour and/or low resolution, in addition to severe occlusion.

Fig. 15
figure 15

Original image (left) and final segmentation (right) shown in red obtained on four (a-d) images of category 2. The contour in cyanrepresents the manual segmentation. The distance measures (D meanR , D maxR ) are (top to bottom) (1.25 %, 2.74 %), (5.50 %,17.71 %), (2.08 %, 5.89 %), (1.20 %, 3.74 %)

The significant segmentation errors occur when the presence of the neighboring leaves and/or branches produces strong gradients near the desired contour which misleads the movement of the active contour. This effect is prominent in sequence 13, which results in higher distance measures for this sequence (Fig. 16 a). Other typical cases of errors result from blurring due to tomato overlap combined with partial occlusion (Fig. 16 b).

Fig. 16
figure 16

Original image (left) and final segmentation (right) shown in red obtained on two (a-b) images of sequence 13. The contour in cyanrepresents the manual segmentation. The distance measures (D meanR , D maxR ) are (top to bottom) (10.39 %, 37.76 %), (19.87 %, 50.23 %)

Finally, it is worth noting that it may be very difficult to determine the contour position at some places even manually, because of blurring, shadowing (Fig. 17 a, b) or the presence of leaves near the “head” of the tomato at the beginning of the maturation (Fig. 17 c). This imprecision can also explain middling results in some cases.

Fig. 17
figure 17

Original image (left) and final segmentation (right) shown in red for three images (a-c). The contour in cyan represents the manual segmentation. The distance measures (D meanR , D maxR ) are (top to bottom) (6.54 %, 21.97 %), (7.92 %, 14.46 %), (15.95 %, 37.54 %)

Table 5 compares the statistics obtained for the two methods, the one presented in [7, 8] \(\left (\mu _{\text {mean}R}^{o}, \sigma _{D_{\text {mean}R}}^{o}\right)\) and the new one proposed in this paper \(\left (\mu _{\text {mean}R }, \sigma _{D_{\text {mean}R}}\right)\). For each method, the images considered in the table are only the ones where the position of the tomato is correctly detected (\(D_{\text {pm}}^{i}<\frac {{r_{s}^{i}}}{2},D_{\text {desc}}^{i}<\frac {{r_{s}^{i}}}{2} \), respectively), since the segmentation algorithm does never provide correct results otherwise. In this way, more images \(\left (N_{c}>{N_{c}^{o}}\right)\) can be processed with the descriptor-based approach, which shows better robustness with respect to occlusion (+8 % for images of category 2). The descriptor-based approach provides a significant benefit when the image quality is poor or the tomato is highly occluded (e.g., sequences 3, 4, 10, 11, 14, 15, 17, 20, 21). Similar results are obtained otherwise. This detailed study demonstrates that the update of the tomato position is a crucial step which conditions the quality of the final segmentation.

Table 5 Comparing the mean and standard deviation of D meanR for the two methods in the images of categories 1 and 2 (Ellopt) where the position of the tomato is correctly estimated

5 Size estimation

In this section, we wish to estimate the size of the tomato approximated by a sphere in the 3D space. In order to calculate the parameters of the sphere, it is assumed that the camera projection matrices, P and Q for the left and right cameras respectively, as well as the parameters of the apparent contours (ellipse) in the two images, have been calculated.

The camera parameters are determined once, at the beginning of the season, by observing a calibration pattern at different positions and orientations in the scene, as described in [9]. Note that the acquisition system is firmly fixed to the ground, so that the calculated projection matrices P and Q are valid throughout the agricultural season. The parameters of the apparent contours of the tomatoes in the left and right images are provided by the segmentation procedure, after validation by an operator (Section 4.3.3).

In the following, we use bold letters to denote vectors (X) and italics to denote scalars (X). Quantities in the 3D space are denoted by upper case letters (X, X) while image quantities are denoted by lower case letters (x, x). Finally, points lying on the ellipses obtained from the segmentation algorithm are now denoted by x o,l and x o,r , o=1,…,N, for the left and right images respectively.

From the segmentation in the two images, we first recover the ellipse centers, x l in the left image and x r in the right image. It is assumed that these are the image points of the sphere center X ct , which is then determined based on a triangulation procedure [9]. Using the property presented in Section 3.1, a set of 3D space points lying on the contour generator is determined from the points on the elliptic contour, for each image (Section 5.1). From the two sets of 3D space points, corresponding to the left and right images, two values for the radius of the sphere are computed based on a least square minimization approach (Section 5.2). Finally, a joint functional is minimized in order to obtain an estimate of the sphere radius (Section 5.3).

5.1 3D space points lying on the contour generator

Let us first consider the left image and the image points x o,l , o=1,…,N. The 3D space points lying on the contour generator are computed as the intersection of the rays back projected from the image points and the plane Π l which contains the sphere center X ct and is orthogonal to the line joining the camera center C l (deduced from the projection matrix P) and X ct . Its equation is given by:

$$ AX+BY+CZ+D = 0, $$
((27))

where

$$ {}{N}=[\!A,B,C]^{T}=\frac{\mathbf{X}_{ct}-\mathbf{C}_{l}}{\Vert \mathbf{X}_{ct}-\mathbf{C}_{l} \Vert}, D = -{AX}_{ct}-{BY}_{ct}-{CZ}_{ct}. $$
((28))

The 3D space point X o,l =[ X o,l ,Y o,l ,Z o,l ]T is projected to its corresponding image point x o,l = [ x o,l ,y o,l ]T via the projection matrix P as:

$$ \beta \left[ \begin{array}{ccc} \!x_{o,l} & y_{o,l} & 1 \\ \end{array} \right]^{T} = \mathbf{P}\left[ \begin{array}{cccc} \!X_{o,l} & Y_{o,l} & Z_{o,l} & 1 \\ \end{array} \right]^{T} $$
((29))

where β is an unknown non-zero scalar factor. By expanding the above equation and eliminating the unknown β, two equations are obtained in the unknowns X o,l ,Y o,l ,Z o,l . Moreover, it is assumed that X o,l lies on plane Π l (Eq. 27). This gives a set of three equations which may be written in the form:

$$ \mathbf{A}_{p}\mathbf{X}_{o,l} = \mathbf{B}_{p} $$
((30))

where,

$$ \mathbf{A}_{p} = \left[ \begin{array}{ccc} x_{o,l}P_{31}-P_{11} & x_{o,l}P_{32}-P_{12} & x_{o,l}P_{33}-P_{13}\\ y_{o,l}P_{31}-P_{21} & y_{o,l}P_{32}-P_{22} & y_{o,l}P_{33}-P_{23} \\ A & B & C \end{array} \right] $$
((31))
$$ \mathbf{B}_{p}= \left[ \begin{array}{c} P_{14}-P_{34}x_{o,l} \\ P_{24}-P_{34}y_{o,l} \\ -D \end{array} \right] $$
((32))

Assuming A p to be non-singular, the solution of this equation is given by:

$$ \mathbf{X}_{o,l}= \mathbf{A}_{p}^{-1}\mathbf{B}_{p} $$
((33))

The same process is performed for the right image. Consequently, two sets of 3D space points X o,l and X o,r are computed from their respective image points x o,l and x o,r in the left and right images.

5.2 Least square estimation of the circle

This estimation is performed on the left and right images independently. Let us consider the left image. Due to measurement errors, the 3D space points X o,l , o=1,…,N might not exactly lie on a perfect circle in the plane Π l . Hence, we search for a least square estimate of the circle.

In order to simplify the calculation, every 3D space point X o,l is transformed to a new coordinate system (X ,Y ,Z ) linked to the plane Π l with X , Y axes lying on the plane and Z -axis orthogonal to the plane. The least square estimation of the circle enables us to get a first estimate of the sphere radius, R l . We get the second estimate R r in the same way.

5.3 Joint optimization

Let us denote by [X c t,l′,Y c t,l′]([X c t,r′,Y c t,r′]) the center of the sphere in the left image (right image) in the new coordinate system linked to Π l (Π r ). From an initial radius estimate defined as the mean of R l and R r (i.e. \(R=\frac {R_{l}+R_{r}}{2}\)), the following function:

$$ \begin{array}{c} {}F(R) = \sum_{o=1}^{N} \left[\left(X'_{o,l}-X'_{ct,l}\right)^{2}+\left(Y'_{o,l}-Y'_{ct,l}\right)^{2}- R^{2}\right]^{2} \\ +\left[\left(X'_{o,r}-X'_{ct,r}\right)^{2}+\left(Y'_{o,r}-Y'_{ct,r}\right)^{2}- R^{2}\right]^{2} \end{array} $$
((34))

is minimized using the Gauss-Newton method, in order to determine the final estimate of the sphere radius, denoted by R est.

5.4 Evaluation: Size estimation

Since a tomato is not a perfect sphere, two reference distances D 1,D 2 were measured manually, which approximate the size of the tomato (Fig. 18). These reference distances were compared with the estimated radius R est and relative error percentages \({PE}_{D_{1}}\), \({PE}_{D_{2}}\) were computed with respect to D 1 and D 2:

$$\begin{array}{*{20}l} {PE}_{D_{1}} = 100\left| \frac{D_{1}-R_{est}}{D_{1}} \right| \end{array} $$
((35))
Fig. 18
figure 18

Reference distances D 1,D 2

$$\begin{array}{*{20}l} {PE}_{D_{2}} = 100 \left|\frac{D_{2}-R_{est}}{D_{2}} \right| \end{array} $$
((36))

In order to evaluate the proposed method, the radius of some tomatoes (T a =1,…, 10) were measured from images acquired in the laboratory under ideal conditions (correct illumination, no occlusion). Images of these tomatoes were acquired at different positions (Pos A and B) and different heights (H=10 and 30 cm) from the ground using the same acquisition system as in the open field. The relative position between the camera and tomatoes was also identical, providing images of similar resolution. At each position, the tomatoes were observed at three different orientations (Orn =1,2,3), not necessarily identical for all positions. The radius of these tomatoes were computed as described above using a manual segmentation as we wish to focus on the evaluation of the second part of the system (estimation of the tomato size). Table 6 shows the \({PE}_{D_{1}}\) for different positions and different orientations. This error is always less than 10 % and most (91 %) of the values are less than 5 %, which demonstrates the robustness of the radius estimation. Moreover, the estimated radius R est is closer to the reference distance D 1 than D 2. This is because a sphere with a radius D 1 will cover the entire tomato and hence it is logical that R est is closer to D 1.

Table 6 Percentage error \({PE}_{D_{1}}\) for the tomatoes T a

Using the estimated radius R est, an estimate of the volume, \(V_{R_{\textit {est}}}\), is computed using the spherical hypothesis \(\left (V_{R_{\text {est}}}=\frac {4}{3}\pi R_{\text {est}}^{3}\right)\). However, since a tomato is not a perfect sphere, we determined a correction factor α cc that can be applied on the radius in order to get a measure closer to the actual volume.

Let us denote by \(V_{D_{1}}^{\text {correc}}\) the volume estimated with the corrected radius R=α cc D 1 (i.e., \(V_{D_{1}}^{\text {correc}}=\frac {4}{3}\pi (\alpha _{\textit {cc}}D_{1})^{3}\)). The value of α cc has been determined experimentally so as to minimize the relative difference \(\frac {\left | \mathrm {V}_{\text {actual}}-\mathrm {V}_{D_{1}}^{\text {correc}} \right |}{\mathrm {V}_{\text {actual}}}\) for four (T a =1,2,3,4) tomatoes studied in all positions, where V actual is the actual volume. We found α cc =0.95. The relative error percentage \({PE}_{V_{\text {actual}}}^{\text {correc}}\) between \(V_{R_{\text {est}}}^{\text {correc}}\) and the actual volume Vactual is then studied for the other tomatoes at different positions and different orientations. The error percentage is less than 15 % in 87 % of the cases. From this experiment, it seems that it may be possible to correct the measurements made with the spherical hypothesis in order to take into account the specific shape of the tomato that is cultivated in the field. This short study validates the volume estimation using the spherical hypothesis. However, the proposed correction model is very basic and was parametrized on a small set of tomatoes. Further studies need to be conducted on a larger dataset to increase the robustness of the volume estimation.

6 Result: entire system

The method proposed in Section 5 was used to measure the size of 10 tomatoes cultivated in the open field for the agriculture season 2013 (Sequences 12–21). It is assumed that Ellopt is the final segmentation selected by the operator among the four possibilities, in both images (no manual correction). Table 7 presents the estimated radius \(R_{\text {est}}^{f}\) along with the error percentage \({PE}_{D_{1}},{PE}_{D_{2}}\) with respect to the two reference distances D 1,D 2 (Eqs. 35 and 36).

Table 7 The estimated radius \(R_{\text {est}}^{f}\) for 10 tomatoes is compared with the reference distances D 1 and D 2. The distances are expressed in centimeters

For 8 tomatoes over 10, the estimated radius is actually in the interval [ D 2,D 1] and generally close to D 1 as compared to D 2 as discussed earlier. Moreover, for most of the sequences except 19, the error percentage \({PE}_{D_{1}}\) is less than 10 %, which demonstrates the robustness of our method and fits the requirement of the project. For sequence 19, the position of the tomato is not correctly detected due the presence of identical neighboring tomatoes (Fig. 19). As a result, an incorrect radius is estimated for this tomato.

Fig. 19
figure 19

For the sequence 19, the position of the tomato is not correctly detected. The manual segmentation is shown in red while Ellopt is shown in green

Note that the accuracy of the final radius estimates depends on the reliable estimation of several parameters at the different steps of the method (segmentation, camera parameters, etc.). An imprecision in one of these parameters would result in an inaccurate radius estimate. The influence of the imprecision in the segmentation on the estimated radius was also studied theoretically using Eq. 1. For a 1-pixel error in the length of the major axis (minor axis) of the ellipse, the corresponding relative percentage error in the radius was found to be between 0.5 and 3 % (0.6 and 3 %, respectively) depending on the position of the object in the scene, which is acceptable for the considered application.

7 Conclusions

This paper presents a complete system to monitor the growth of tomatoes from images captured in open field. One of the major challenges is occlusion. Moreover, poor illumination and the presence of neighboring tomatoes may cause the tomato contours to be smoothed, resulting in imprecision on their actual position. To overcome these challenges, we proposed to model the tomato as a sphere in the 3D space. This enables us to introduce a priori shape information in the segmentation procedure, which increases the robustness with respect to occlusion and lack of contrast. Besides, the spherical hypothesis allows us to simplify the size estimation procedure.

The segmentation method presented in this paper is an extension of our previous work [7, 8] based on active contour. In this paper, we propose to estimate the movement of the tomato between two consecutive images, by comparing SIFT descriptors computed at points of the contour. This leads to a more accurate estimate of the position of the tomato than with the pattern matching approach presented in [7, 8]. The improvement is more prominent in the images with significant occlusion (between 30 and 50 %) and poor contrast. For instance, the descriptor-based approach correctly detects the position in 96.5 % of the images where occlusion is between 30 and 50 %, which is an improvement of +5 % compared to the previous approach. Moreover, a high accuracy is reached for 85 % of these images against 59 % with the previous approach. The precision is also significantly improved for low occluded images (97 % against 65 %). This is very important since a more accurate estimation of the tomato position results in a more reliable estimation of the candidate contour points, which in turn leads to a better initialization of the elliptic active contour model. So, the entire segmentation procedure benefits from this new algorithm. The average error (expressed as the percentage of tomato size) is now around 4 % even for tomatoes with a degree of occlusion as high as 50 %.

We also presented a method to estimate the size of the tomato from the obtained segmentation. This method was first tested under ideal acquisition conditions and using manual segmentation. In this case, the percentage error between the actual radius and the estimated size was always less than 10 % with most (91 %) of the error less than 5 %, which demonstrate the robustness of radius estimation. The complete system was also applied to estimate the size of tomatoes cultivated in open fields for the agriculture season 2013. The percentage error was less than 10 % in most of the cases, despite the poor quality of images during this season (small size, pixelated images).

The segmentation procedure based on shape information in each image separately can be extended to include the information in both images in order to propose a joint energy minimization scheme. For instance, if we suppose that a 3D space point X situated on the tomato is projected onto x L in the left image and x R in the right image; then, the evolution of the contour in the two images can be controlled by using the epipolar constraint \(\left (\mathbf {x}_{L}^{T}F\mathbf {x}_{R}=0\right)\) in a joint energy minimization functional, where F is the fundamental matrix computed from the two camera matrices. This approach would increase the robustness of the segmentation procedure with respect to occlusion, particularly in the image pairs where the percentage of occlusion is not identical.

One of the possible approaches to improve the robustness of the yield estimation would be to detect automatically the number of tomatoes present in an image without necessarily performing the segmentation procedure. This could be done during the end of the season when most tomatoes are red. By exploiting the color information, the density of tomatoes could be determined and combined with the size estimation performed on a subset of tomatoes, acquired with higher image resolution. This strategy would result in a more accurate estimate of the yield before the harvest. Moreover, the correction factor α acc involved in volume computation was estimated using a small set of tomatoes (Section 5.4). Further studies need to be done to develop a more accurate volume estimation model. The first experimental results obtained during the agricultural season 2013 were very encouraging. However, we plan to conduct larger experiments in open fields to assess the robustness and the accuracy of the entire system.

In the future, we wish to integrate the proposed algorithm in a gateway/platform-based machine to machine (M2M) architecture in order to develop an operational system for the farmer to remotely monitor the growth of tomatoes. The proposed system may also be used to monitor the growth of other crops such as apples.

References

  1. Estimating crop yields; a brief guide (2013). http://agriculture.vic.gov.au/agriculture/grains-and-other-crops/crop-production/estimating-crop-yields-a-brief-guide. Accessed October 2015.

  2. A Prasad, L Chai, R Singh, M Kafatos, Crop yield estimation model for Iowa using remote sensing and surface parameters. Int. J. Appl. Earth Observation Geoinformation. 8(1), 26–33 (2006). doi:http://dx.doi.org/10.1016/j.jag.2005.06.002.

    Article  Google Scholar 

  3. M Mkhabela, P Bullock, S Raj, S Wang, Y Yang, Crop yield forecasting on the Canadian Prairies using MODIS NDVI data. Agric. Forest Meteorol. 151(3), 385–393 (2011). doi:http://dx.doi.org/10.1016/j.agrformet.2010.11.012.

    Article  Google Scholar 

  4. H Zhao, Z Pei, in second international conference on agro-geoinformatics (Agro-Geoinformatics). Crop growth monitoring by integration of time series remote sensing imagery and the WOFOST model, Fairfax, VA (IEEE, 2013), pp. 568–571. doi:http://dx.doi.org/10.1109/Argo-Geoinformatics.2013.6621940.

  5. D Stajnko, Z Cmelik, Modelling of apple fruit growth by application of image analysis. Agric. Conspec. Sci. 70, 59–64 (2005).

    Google Scholar 

  6. A Aggelopoulou, D Bochtis, S Fountas, K Swain, T Gemtos, G Nanos, Yield prediction in apple orchards based on image processing. J. Precision Agric. 12, 448–456 (2011).

    Article  Google Scholar 

  7. U Verma, F Rossant, I Bloch, J Orensanz, D Boisgontier, in international conference on pattern recognition applications and methods (ICPRAM). Shape-based segmentation of tomatoes for agriculture monitoring (Angers. France, 2014), pp. 402–411.

  8. U Verma, F Rossant, I Bloch, J Orensanz, D Boisgontier, Segmentation of tomatoes in open field images with shape and temporal constraints, Pattern Recognition, Applications and Methods. (A Fred, et al., eds.) (LNCS 9443: ICPRAM 2014 Best Papers, Springer, 2015). (forthcoming).

  9. J Bouguet, Camera calibration toolbox for Matlab (2013). http://www.vision.caltech.edu/bouguetj/calib_doc/. Accessed April 2013.

  10. R Hartley, A Zisserman, Multiple View Geometry in Computer Vision, 2nd edn (Cambridge University Press, New York, NY, USA, 2004).

    Book  MATH  Google Scholar 

  11. M Kass, A Witkin, D Terzopoulos, Snakes: active contour models. Int. J. Comput. Vis. 1(4), 321–331 (1988).

    Article  Google Scholar 

  12. C Xu, J Prince, Snakes, shapes, and gradient vector flow. IEEE Trans. Image Process. 7(3), 359–369 (1998).

    Article  MATH  MathSciNet  Google Scholar 

  13. D Lowe, Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004).

    Article  Google Scholar 

  14. W Gander, G Golub, R Strebel, Least-squares fitting of circles and ellipses. BIT Numerical Math. 34(4), 558–578 (1994).

    Article  MATH  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work was partly supported by the MCUBE project (European Regional Development Fund (ERDF)), which aims at integrating multimedia processing capabilities in a classical machine to machine (M2M) framework, thus allowing the user to remotely monitor an agricultural field. The authors would like to thank Jérôme Grangier, for his participation in this project. This work was performed while the first author was doing his PhD at ISEP and Telecom ParisTech.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ujjwal Verma.

Additional information

Competing interests

The authors declare that they have no competing interests.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Verma, U., Rossant, F. & Bloch, I. Segmentation and size estimation of tomatoes from sequences of paired images. J Image Video Proc. 2015, 33 (2015). https://doi.org/10.1186/s13640-015-0087-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13640-015-0087-0

Keywords