Skip to main content

Evaluation of the use of box size priors for 6D plane segment tracking from point clouds with applications in cargo packing

Abstract

This paper addresses the problem of 6D pose tracking of plane segments from point clouds acquired from a mobile camera. This is motivated by manual packing operations, where an opportunity exists to enhance performance, aiding operators with instructions based on augmented reality. The approach uses as input point clouds, by its advantages for extracting geometric information relevant to estimating the 6D pose of rigid objects. The proposed algorithm begins with a RANSAC fitting stage on the raw point cloud. It then implements strategies to compute the 2D size and 6D pose of plane segments from geometric analysis of the fitted point cloud. Redundant detections are combined using a new quality factor that predicts point cloud mapping density and allows the selection of the most accurate detection. The algorithm is designed for dynamic scenes, employing a novel particle concept in the point cloud space to track detections’ validity over time. A variant of the algorithm uses box size priors (available in most packing operations) to filter out irrelevant detections. The impact of this prior knowledge is evaluated through an experimental design that compares the performance of a plane segment tracking system, considering variations in the tracking algorithm and camera speed (onboard the packing operator). The tracking algorithm varies at two levels: algorithm (\(A_{wpk}\)), which integrates prior knowledge of box sizes, and algorithm (\(A_{woutpk}\)), which assumes ignorance of box properties. Camera speed is evaluated at low and high speeds. Results indicate increments in the precision and F1-score associated with using the \(A_{wpk}\) algorithm and consistent performance across both velocities. These results confirm the enhancement of the performance of a tracking system in a real-life and complex scenario by including previous knowledge of the elements in the scene. The proposed algorithm is limited to tracking plane segments of boxes fully supported on surfaces parallel to the ground plane and not stacked. Future works are proposed to include strategies to resolve this limitation.

1 Introduction

Efficient cargo packing operations are relevant in distribution centers, influencing product prices, customer satisfaction, and distributor resilience against market fluctuations. The packing operation goal is to assign a six-dimensional (6D) pose to the cargo within the container, optimizing space use, and reducing dispatch times, while considering practical constraints [1].

The manual packing operation involves three stages: (1) items for packing and their specific 6D pose on a container are computed on a solution known as optimal packing pattern. This stage assumes the known of box and container dimensions, and practical constraints [2, 3]. (2) The order in which the cargo will be loaded to ensure stability and reduce physical strain on workers is computed [4, 5]. The solution is known as a physical packing sequence (PPS). (3) A packing operator must assemble the optimal packing pattern following the physical packing sequence.

In the third stage, the operator explores a consolidation zone, seeking the box indicated by the PPS, manually picks the box, and arranges it on a packing zone, as illustrated in Fig. 1. The consolidation zone occupies around 20 \(m^2\), and contains multiple boxes with unknown distribution, hence an unknown initial pose. In addition, the boxes in the consolidation zone reside on the ground and exhibit diversity in texture and size. The boxes are extracted from the consolidation zone sequentially, based on the PPS and assuming the handling of a single box at a time. The PPS, box size, and number of boxes are available before the operator begins assembling the packing pattern.

Fig. 1
figure 1

Consolidation and packing zones in the dispatching area of a warehouse

The third stage involves interfaces, often in the form of instructions printed on paper or handheld screens with reported limitations [6, 7]. Virtual elements are suitable to facilitate the assembly of the optimal packing pattern and focus the operator attention on the manipulated cargo; consequently, augmented reality (AR) is being considered as an alternative to enhance interface performance. This kind of AR interface relies on visual perception systems capable to track the 6D pose of the cargo items in the consolidation zone, a problem known as 6D pose visual tracking.

This study focuses on the 6D-pose tracking of plane segments that makeup boxes, from point clouds. The problem involves determining three-dimensional position and rotation for each instance of a plane segment within a box, captured in each frame of a video sequence. The point cloud selection is on basis of the advantage of this format with respect to the extraction of geometric information relevant to estimate the 6D pose of rigid objects.

The study assumes that the boxes are distributed in a consolidation zone without being stacked. Furthermore, this work considers a mobile camera anchored to the packing operator. This approach is motivated by the distinctive skills and economic advantages of humans for assembly tasks [8, 9], the benefits related to low instrumentation of the working area, the flexibility to explore wide areas and solving occlusions during the exploration, and the possibility to refine detections as mapping conditions are improved. The main challenges of 6D-pose tracking of plane segments in packing operations include:

  1. 1.

    Keeping an updated object map under dynamic environment with repositioning and sequential removing of boxes from the consolidation zone.

  2. 2.

    Mitigating false positives and false negatives resulting from occlusion between boxes, high density of cuboids, the diversity of color and texture of the tracked boxes, and the heterogeneous sizes of cuboids.

  3. 3.

    Addressing challenges with a mobile camera, in a wide area including managing low-density mappings of objects outside the nominal distances suggested by the sensor vendor, handling redundant detections between frames, and integrating detections from multiple camera viewpoints.

Our approach is inspired by methods in the area of 3D object modeling based on geometrical fitting. Most of these approaches [10,11,12] define two stages to solve a geometric fitting: adjusting planes and then adjusting boxes. While successful in scenarios with multiple boxes, their scope is limited to fitting point clouds to geometric primitives without addressing the estimation of 6D pose between frames. The proposals in Refs. [11, 13, 14] deal with mobile cameras utilizing an incremental approach, that gradually integrates a local map of detections into a global map in each iteration. While this approach is associated with benefits in inside-out sensor arrangement [15] as explored in this research, it needs to address the challenge of dynamic scenes with sequential output of boxes in packing applications. In addition, algorithms integrating prior knowledge of 3D models of objects have been proposed [16, 17], resulting in improvements in the reconstruction performance. However, these studies were not assessed on scenes with box-type objects. The 3D models are not available in the packing application, but the 3D size of the box is, which is relevant information to improve tracking performance.

Consequently, this study extends the scope of 3D object modeling by including 6D pose estimation routines and strategies to handle dynamic scenes. The proposed algorithm begins with a RANSAC fitting stage on the point cloud, then implements novel strategies to build plane segment objects: entities with descriptors of 3D position, 3D rotation and 2D length; these strategies rely on geometrical analysis on each fitted point cloud. The algorithm reuses the incremental approach in Refs. [11, 13, 14] but introduces a new quality factor to solve the redundant detections. In addition, the proposed approach addresses the sequential output of boxes from the scene using a novel particle management system that eliminates obsolete plane segments. A variant of the algorithm uses box size priors (available in most packing operations) to filter out irrelevant detections. The contribution of this prior knowledge is assessed by an experimental design that compares the performance of a plane segment tracking system, considering variations in the tracking algorithm and camera speed (onboard the packing operator). The algorithm factor is assessed at two levels: \(A_{woutpk}\) (an algorithm without prior scene knowledge) and \(A_{wpk}\) (an algorithm using prior knowledge of box sizes in the scene). Here, wpk and woutpk mean “with previous knowledge” and “without previous knowledge,” respectively. The speed factor, denoted as V, is evaluated at two levels: \(V_{low}\) for low velocity and \(V_{high}\) for high velocity. The literature review indicates that this study represents a new exploration of 6D pose tracking, which can potentially improve the performance of industrial manual packing operations. The main contributions of this exploration to the area of point cloud processing and 6D pose tracking are:

  1. 1.

    A novel procedure based on a geometric analysis to compute 3D orientation and 2D length in point clouds fitted with perpendicular planes.

  2. 2.

    A new quality factor that predicts the point cloud mapping density and solves redundancy in detecting plane segments.

  3. 3.

    A novel particle concept in the point cloud space that allows tracking the validity of detections in dynamic scenes.

  4. 4.

    The assessment of the proposed algorithm in realistic and complex scenarios compared to those reported in the existing literature on 6D pose tracking, including significant variability of boxes in terms of quantity, size, and texture, occlusion between boxes, larger consolidation area for boxes, mapping of a wide working area using a camera with reduced field of view, presence of noise in point clouds due to moving cameras, and sequential output of boxes from the consolidation.

  5. 5.

    The integration of box size priors into algorithms designed for the 6D pose tracking of plane segments in manual cargo packing applications, resulting in a demonstrated reduction of false positives and enhancement in tracker performance.

2 Related works

The main problem in this work is the 6D pose multi-tracking of plane segments in manual packing applications using a mobile camera. This application presents specific challenges, such as the need for tracking in a wide, dynamic environment where boxes exit sequentially and the lack of information about the initial pose of tracked objects. None of the identified works have directly addressed this specific problem, but significant advances have been made in four related areas: 6D pose tracking with traditional approaches, logistic parcel detection, learning-based trackers, and 3D modeling based on geometric fitting.

2.1 6D pose tracking with traditional approaches

Traditional approaches, such as model-based [18,19,20] and template-based algorithms [21,22,23], have shown high performance when tracking single or multiple objects in a confined area (i.e., with camera field of view greater than the scanned area). However, they assume the existence of templates or 3D models of the tracked objects, which are not available in the manual packing application. Furthermore, these algorithms are not designed to handle a camera with a limited field of view combined with a wide area to scan.

2.2 Logistic parcel detection

This area aims to locate the region of interest within an image and classify it as a package or box [24,25,26,27]. This approach has made advances in 3D detection under scenes with stacked boxes and varied textures using single images, but these methods do not solve the problem of 6D pose estimation. Furthermore, they typically work with single images and fixed cameras. As a consequence, the approach does not deal with the challenges imposed by a mobile camera scanning a wide area.

2.3 Learning-based trackers

Learning-based trackers have been widely adopted in recent years, significantly increasing performance on multiple applications. They rely on knowing the initial pose of the tracked object, followed by pose estimation in subsequent frames through regression [28] or searching of 3D–3D correspondences, and pose refinement through optimization routines [29,30,31,32,33,34,35]. Consequently, the main limitation is the need for initial pose of boxes and plane segments in packing applications. Moreover, they do not report assessment in areas greater than 20 \(m^2\) as required in packing operations considered in this work. Among these works, Ref. [36] stands out for applying a top-performing pose estimation method [37] to address tracking challenges in a bin-picking operation involving objects with cuboid shapes. Consequently, Ref. [36] is used in this work as a benchmark for performance evaluation.

2.4 3D modeling based on geometric fitting

The works from this area have been validated on wide scanned areas with multiple boxes. Despite some limitations, these methods offer a suitable starting point for addressing the main problem. Consequently, this section provides a deeper insight into these related works. T. Nguyen et al. [38] propose a system for scene reconstruction, utilizing a SLAM stage for depth image integration. Their pipeline uses a plane segmentation algorithm to find planar regions in the scene and integration strategies to recover geometric primitives. They report challenges with high error accumulation from the SLAM stage in large scanning areas. Their proposal lacks pose estimation and is validated on smaller static scenes compared to manual packing operations. A historic fusion stage to reduce error accumulation in SLAM is introduced in Ref. [13]. Although their experiments include multiple boxes in the scene, the box size and scanned area diversity are under the expected values in packing operations. The proposal is assessed under static scenes, and their scope is limited to the shape reconstruction without pose estimation. An incremental historic fusion pipeline, incorporating local and global maps, is proposed in Ref. [11, 14]. The local map gathers data from the current frame, while the global map integrates data from current and past frames. This approach refines reconstructions as the camera explores new zones and accumulates information about objects across frames. However, these proposals lack pose estimation and are not validated under dynamic scenarios. A deep learning-based pipeline to abstract real-world environments using cuboids from single RGB images or point clouds is proposed in Ref. [39]. They use a sequential fitting of points to cuboids, defining iterative stages to compute weights for the depth map and primitives with state s to describe the input image. The state s is updated based on fitted points at each iteration and serves as input for weights computation in the depth map. However, their algorithm does not include pose estimation or feature tracking between frames. It does not address challenges posed by a camera field of view lower than the scanned area or camera movement. Landgraf et al. [40] address the problem of reconstruction, shape detection, and instance segmentation from depth images of a stack of objects. Their pipeline includes instance segmentation by the 3D Hough Voting algorithm and the autoencoder strategy defined in Ref. [41]. The main limitation of the method is a high dependence between the camera point of view and the instance segmentation results. Under diverse points of view, different segmentations are computed. The problem increases with severe occlusions (as expected in packing scenes). In addition, their scope does not include the pose estimation of the objects, not the dynamic environment with sequential output of boxes. The problem of tracking cuboids from multi-RGB-D images is addressed in Ref. [42]. The pipeline includes the fusion of point clouds from diverse cameras, removal of irrelevant areas, application of a voxel grid filter for downsampling, computation of planar segments using region growing with the difference in normal as a growth criterion, cuboid detection based on the detected planar segments, and cuboid identification across frames. However, they require markers to calibrate multiple cameras, and their scope does not include pose estimation for the shape. Although the validation is performed with multiple boxes, the test scenarios are static, and the camera arrangement differs from the one of interest in this work (inside-out sensor arrangement). The problem of cuboid fitting from a noisy point cloud is approached in Ref. [12]. Initially, plane segments are extracted, and cuboids are constructed from pairs of plane segments and from the 3D fitting of a bounding box (BB) to each unpaired plane segment; pairs are established based on proximity criteria and geometric relationships between plane segments. Then they apply cuboid filtering using a Monte Carlo tree search (MCTS) algorithm. While their approach demonstrates acceptable performance, its high computational complexity limits its applications to offline tasks such as labeling datasets for training data-driven processing architectures.

Table 1 Comparison between problem features and related works in 3D modeling based on geometric fitting

In conclusion, while traditional tracking, logistic parcel detection, learning-based trackers, and 3D modeling approaches provide valuable insights, they still need to resolve specific challenges of 6D pose multi-tracking in dynamic scenes with sequential output of boxes and using a mobile camera. Despite some limitations, 3D modeling based on geometric fitting offers a suitable initial point for addressing the main problem. Table 1 compares the features of manual packing operations and those reported in the literature on 3D modeling based on geometric fitting, including Ref. [36] as a reference from learning-based trackers. The first column indicates the referenced work. The second column is enabled for works that process point clouds or depth images. The third column is marked for works that use a single mobile camera to explore the scanned area. The fourth and fifth columns indicate works validated under various box textures and sizes. Works that consider multi-box scenes are checked in the sixth column. The seventh column mark works validated under scanned areas greater than 20 \(m^2\). The eighth column identifies works dealing with the sequential output of boxes. The last column is enabled for proposals that do not require the initial pose of each box. This table not only aids in identifying the unique features proposed in this paper compared to related works but also shows that previous proposals only encompass some of the features present in manual packing operations. Consequently, none of the identified works can fully solve the tracking problem defined in this work.

3 Methods

Fig. 2
figure 2

Tracking system in packing operation

The 6D tracking system for plane segments is illustrated in Fig. 2. This system maps a consolidation zone using an RGB-D camera attached to an operator’s head who simultaneously performs manual cargo packing. A SLAM algorithm processes RGB-D images obtained from the mapping process to yield pairs \(\{\textbf{p}(k), \mathbf {P_{c_h}}(k)\}\), where \(\textbf{p}(k)\) represents a frame (in point cloud format) and \(\mathbf {P_{c_h}}(k)\) denotes the 6D pose of the mobile camera \(c_h\) at instant \(k\). The pair of data \(\{\textbf{p}(k), \mathbf {P_{c_h}}(k)\}\), along with prior knowledge of box sizes within the consolidation area, are processed by the “6D plane segment tracking algorithms” block to generate a global map of plane segments \(\mathbf {gp_s}(k)\) with length and pose properties for each detected element. Figure 3 provides an example of the pair \(\{\textbf{p}(k), \mathbf {P_{c_h}}(k)\}\). It is important to note that the point cloud corresponds not only to the presence of boxes but also to the floor, fragments of walls, objects delineating the consolidation zone and holes due to occlusion.

Fig. 3
figure 3

Tracking system input at instant 25, session 10. The input composed by the point cloud \(\textbf{p}(k)\) and the camera pose \(\mathbf {P_{c_h}}(k)\) is illustrated. Furthermore, the coordinate system \(q_h\) is shown

The following sections describe the designed tracking algorithms, the datasets used to test them in a natural packing environment, metrics and error functions selected for system assessment and the experimental design used.

3.1 6D plane segment tracking algorithm

Fig. 4
figure 4

Block diagram of the 6D plane segment tracking algorithms

Figure 4 illustrates a block diagram for tracking the 6D pose of plane segments. It begins with the “plane detector” block, which analyzes the point cloud \(p(k)\) using the RANSAC algorithm to detect a local map of planes, as shown in Fig. 5. Next, the “plane segment detector” block computes each plane segment’s 3D position, 3D orientation, and 2D size, resulting in a local plane segment map illustrated in Fig. 6. Following this, the “filtering by previous knowledge” block discards plane segments outside the expected size in the scene, including those belonging to walls, artifacts around the scene, or point clouds that combine two or more box faces into a single plane segment. Finally, the “creation of a global plane segment map” block integrates local maps generated by different camera viewpoints, as presented in Fig. 7. This latter block includes strategies to maintain an updated version of the global map of plane segments by discarding segments that exit the consolidation zone and including new segments detected during user exploration.

Two versions of the tracking algorithm were implemented based on the block diagram in Fig. 4: (1) a tracking algorithm that uses the available previous knowledge by implementing the “filtering by prior knowledge” block, defined as \(A_{wpk}\) (with previous knowledge), and (2) a tracking algorithm that does not implement the “filtering by prior knowledge” block, assuming unknown box properties in scenes, defined as \(A_{woutpk}\) (without previous knowledge).

3.1.1 Plane detector

In this section, the process of detecting plane primitives from unstructured, colorless point clouds \(\textbf{p}(k)\), as illustrated in Fig. 3, is described. This process is solved in four stages. (1) Initially, the ground plane model \(\varvec{\Omega }_{ground}\) is detected using the RANSAC algorithm. Points that fit this model are removed, resulting in the point cloud \(\mathbf {p_{woutground}}(k)\). A distance threshold \(th_{ground}\) of 5 cm, determined experimentally, is used for fitting points to \(\varvec{\Omega }_{ground}\). (2) The efficient RANSAC algorithm [43] is then applied to \(\mathbf {p_{woutground}}(k)\) to detect plane-type primitives only. The output is a local plane map \(\textbf{lp}(k) = \{lp_1, lp_2, \ldots , lp_{N_{lp}}\}\), where \(lp_i\) represents the \(i\)-th local plane and \(N_{lp}\) is the total number of detected local planes. Each \(lp_i\) includes descriptors of the plane model \(lp_{i}.\varvec{\Omega }\), the geometrical center \(lp_{i}.\textbf{gc}\), and the inliers \(lp_{i}.\textbf{p}\). The plane model \(\varvec{\Omega }\) includes the normal vector \(\textbf{n}\) and the parameter \(D\), which represents the signed distance from the plane surface to the origin of the coordinate frame \(q_h\) along the normal vector \(\textbf{n}\). (3) The outliers are removed using a statistical filter in three steps [44]: (i) compute the distance to the \(k\) nearest neighbors for each point in \(lp_{i}.\textbf{p}\), (ii) calculate the mean (\(\mu \)) and standard deviation (\(\sigma \)) of these distances, and (iii) discard points whose distances exceed \(\mu + N\sigma \), with parameters set experimentally to \(k = 10\) and \(N = 1\). (4) Finally, the geometrical center of each plane in the local map is updated. This involves projecting the inliers \(lp_{i}.\textbf{p}\) onto the model \(lp_{i}.\varvec{\Omega }\) and recomputing the geometrical center. The resulting map is termed “local” because it depends solely on the current point cloud \(\textbf{p}(k)\), in contrast to a global map that integrates data from multiple time instants. Figure 5 illustrates the local map for a sample frame with \(N_{lp} = 14\). For each plane, the figure illustrates a numerical identifier (\(lp_i.ID\)) (identifier 1 corresponds to the ground plane and has been omitted from the figure to facilitate point visualization), points from \(\mathbf {p(k)}\) that fit the model \(Ax + By + Cz = D\) or inlier points, and the normal vector of the plane model (\(lp_i.\textbf{n} = [A, B, C]\)), positioned at the geometric center of the inliers.

Fig. 5
figure 5

Local map of planes at instant 25, session 10

Comparing Fig. 5 with Fig. 3 confirms the method successfully eliminates inlier points to the ground plane model. However, this also removes some points from lateral planes adjoining the ground, affecting their length and pose features. This comparison also reveals inaccuracies and omissions: plane 14 has incorrect parameters (its normal is not orthogonal to the normal of planes 7 and 11), and some planes forming a box are missing (the plane that should accompany the pair of planes (8, 12) from Box 1). These issues are related to low-density scanned areas, which occur when the camera’s distance and orientation relative to the object are outside the sensor limits or when objects are close to the RGB-D sensor’s spatial resolution limits.

Figure 5 demonstrates that the method effectively groups points into subgroups corresponding to box faces in the scene. However, additional planes unrelated to the boxes, such as parts of the wall and workspace limiters, are also detected (planes 6, 10, 13). This figure also shows that normal vectors are not standardized, with some pointing inside the boxes and others outside (planes 9 and 12).

3.1.2 Plane segment detector

This block aims to convert the local plane map \(\textbf{lp}(k)\) into a local map of plane segments \(\mathbf {lp_s}(k) = \{ps_1, ps_2, \ldots , ps_{N_{lps}}\}\), adding length and pose descriptors to each local plane \(lp_i\). It is assumed that each box has full support, and the support of each box is either the ground plane or a plane coplanar with the ground. Figure  6 presents a local map of plane segments for the same sample frame used in Fig. 5. Each plane segment \(ps_i\) is illustrated with an identifier composed of two digits (\(ps_i.ID = \text {[frame identifier - plane identifier]}\)), the contour delimiting the extent of the plane segment, and the pose \(ps_i.\textbf{P}\) represented by a coordinate system with its origin at the geometric center of the plane. Some adjacent plane segments exhibit separation between their edges, extending beyond the apparent box limits. This behavior is linked to two main conditions: (1) inaccuracies related to the depth sensor, including object size, distance between the object and camera, and the orientation between the camera’s line of sight and the plane segment normal, and (2) deformation of some cardboard boxes used in the dataset. Irrelevant detections, such as plane segments that do not belong to boxes (e.g., 25–13), are also illustrated. The following section presents methods to compute plane segment properties and analyzes their impact on the detections.

Fig. 6
figure 6

Local map of plane segments at instant 25, session 10

3.1.2.1 Plane type

Detected planes are classified in function of the angle \(\alpha \) between the normal vector of the ground plane and the normal vector of the detected plane, using Eq. 1. A zero type is assigned to top planes (parallel to the ground), 1 type to lateral planes (perpendicular to the ground), and 2 to other planes. The \(th_{\alpha }\) is a threshold in degrees that allows lateral or top planes a tolerance in their orientation on the classification; outside this variation, planes are classified as type 2. Planes type 2 are discarded and appear when planes are detected from a low point density or in the case of planes that do not belong to boxes. The \(th_{\alpha }\) was experimentally set to 18 radians, allowing to compensate for inaccuracies derived from scanning and detection stages. Figure 6 shows the correct classification of plane types using blue color in lateral planes and black color in top planes. Note that there is no plane segment related to plane 14 in Fig. 5 because it was classified as plane type 2 and then discarded.

$$\begin{aligned} \text {{type}}(\alpha , th_{\alpha }) = {\left\{ \begin{array}{ll} 0, &{} \text {if } |\cos (\alpha )| > \cos (th_{\alpha }) \\ 1, &{} \text {if } |\cos (\alpha )| < \cos \left( \frac{\pi }{2} - th_{\alpha }\right) \\ 2, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(1)
3.1.2.2 Standardization of the orientation in normal vectors

This stage defines methods to standardize the orientation of normal vectors to point outside the box, allowing the use of the convexity check defined in Ref. [45] when tracking boxes. Selection of planes that require reversing their normal orientation is performed by analysis of the distance \(D_{c_h}\), measured from the plane surface and the origin of the coordinate system \(q_{c_h}\); i.e., the coordinate system solidary to the mobile camera. Any plane with negative \(D_{c_h}\) requires a sign reversal of its normal. Figure 6 confirms that all the normal vectors point outside the boxes as a consequence of this stage; in particular, concerning Fig. 5, the normal vectors 9 and 12 were reversed, corresponding to plane segments 25–9, 25–12 in Fig. 6.

3.1.2.3 Rotation matrix R and length of the plane segment L1, L2

This section presents methods to compute the rotation and length of each plane segment by processing the parameters of the plane model \(ps_i.\varvec{\Omega }\) and the inliers to this model \(ps_i.\textbf{p}\).

For top planes (type 0), the method is implemented with an algorithm for simultaneous estimation of length and rotation based on a convex hull model [46]. This algorithm was initially proposed to estimate the pose of vehicles in autonomous driving applications [45], and exhibits the precision and robustness required in dynamic scenes with noise and a moving camera, as used in this work. In terms of notation, the rotation matrix \(ps_i.\textbf{R}\) of top planes orients its \(\textbf{x}_{ps}\) axis parallel to the edge \(ps_i.L_2\), as illustrated by the black-colored plane segments in Fig. 6.

A method to compute plane segments’ rotation and length was not identified for lateral planes. Thus, a novel method based on the analysis of the geometric properties of the fitted point cloud is used: Initially, it computes the tilt angle \(\theta _{tilt}\) between the lateral plane and the coordinate plane composed by the pair of axes (\(\textbf{x}_h, \textbf{y}_h\)) from the coordinate system \(\mathbf {q_h}\). Then it computes rotation \(\textbf{R}\) as a function of \(\theta _{tilt}\). Finally, the length of plane segments is computed in two steps: (i) rotation of the fitted point cloud to the nearest coordinate plane in \(\mathbf {q_h}\), (ii) computation of the maximal and minimal values of the projected point cloud in 2D. Furthermore, this procedure computes a Boolean flag \(L_2toY\) (false/true), which is activated in cases where edge \(L_2\) is parallel to the anti-gravity vector. In terms of notation, the rotation matrix \(ps_i.\textbf{R}\) of lateral planes orients its \(y_{ps}\) axis parallel to the anti-gravity vector, as illustrated in blue-colored plane segments in Fig. 6.

Figure 6 shows the result of applying this method to a sample frame. In this figure, most detected planes belong to boxes (with IDs 1, 5, 13, and 17). The figure also validates the consistency in the size and orientation of the estimations and the expected orientation in the coordinate frames attached to each plane segment.

3.1.2.4 6D pose of the plane segment P

The 6D pose of the plane segment is a matrix composed of the rotation matrix \(\textbf{R}\) and the translation vector \(\textbf{t}\). The rotation matrix is computed as stated in the previous section. The translation vector is calculated as the geometric center of the point cloud \(ps_i.\textbf{p}\) projected onto the plane model with parameters \(ps_i.\varvec{\Omega }\).

3.1.2.5 Quality factor of the detected plane segment q

This factor allows to compare multiple detections that belong to a single box face, and select the most accurate one. It assumes that the detection accuracy is proportional to the density of scanned points per area. The density of scanned points cannot be determined online because the actual area of each plane segment is unknown. However, it can be inferred from the distance \(d_{co}\) between the camera–object, and angle \(\theta _{co}\) between the axis \(\textbf{z}_{c_h}\) and the normal vector of the plane segment, as presented in Eq. 2. The proposed quality factor is used in the tracking algorithm as a selection criterion when solving redundancy in detected planes. This redundancy is common in consecutive frames, as explained in section 3.1.4.

$$\begin{aligned} {q}(\theta _{co}, d_{co}) = {\left\{ \begin{array}{ll} (\frac{\theta _{co}}{90^{\circ }} - 1)\frac{th_{dmax}-d_{co}}{th_{dmax}-th_m}, &{} \text {if } \theta _{co} > 90^{\circ } \wedge (th_{d_{\text {min}}}< d_{co} < th_{d_{\text {max}}}) \\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(2)

The thresholds, \(th_{dmin}\) and \(th_{dmax}\), are the minimum and maximum distances between the object and the depth sensor recommended by the depth sensor manufacturer, \(th_m\) is the medium point between these distances. Equation 2 makes two assumptions: (1) \(\theta _{co}\) varies in the range [1, 180]\(^{\circ }\); for values outside this range, its complement to 360\(^{\circ }\) is used; (2) the standardization of the orientation in normal vectors has already been applied (described in the section 3.1.2 ).

The quality factor varies in the range [0, 1], taking the value 0 when the plane segment was mapped at distances \(d_{co}\) or orientations \(\theta _{co}\) outside the depth sensor limits. This type of mapping corresponds to sparse point clouds and imprecise pose estimations. The value 1 is reached when the mapping is done within the distance limits \(d_{co}\) and with an orientation \(\theta _{co}\) of \(180^{\circ }\). This value corresponds to the highest precision in the estimated pose. For variations of \(\theta _{co}\) below \(180^{\circ }\), the quality is reduced, reaching its minimum value when \(\theta _{co} = 90^{\circ }\).

Table 2 Descriptors for plane segments belonging to box 13 in session 10

Table 2 presents descriptors of detections of faces that belong to box 13 in session 10 and at different time instants: instant 25 (illustrated in Figure  6) and instant 27. The first column indicates the plane type. The second column contains the identifier. The third column presents the quality factor computed with Eq.  2 and using values \(th_{dmin} = 1500 \, \text {mm}\) and \(th_{dmax} = 5460 \, \text {mm}\). Columns four to six present the distance \(d_{co}\), the angle \(\theta _{co}\), and the number of inliers mapped by the acquisition system, respectively. The last column contains the point density per square centimeter, computed using the ground truth area of each plane segment that belongs to box 13.

Comparing the detections that correspond to top planes (rows 1 and 2 in Table 2), it is observed that the detection with the highest density \(\rho \) corresponds with the detection with the maximum quality factor: plane segment 27–2. Likewise, comparing the detections of lateral planes (rows 3 and 4 in Table 2), it is observed that the detection with the maximum density \(\rho \) corresponds to the detection with the maximum quality factor: plane segment 27–4. These comparisons allow us to conclude that the quality factor correctly infers the density of points per square area.

3.1.3 Filtering by prior knowledge

A direct application of the tracking algorithm is in the tracking of boxes during manual packing operations, where the box size is known before the assembly of the packing pattern begins. Consequently, the tracking algorithm uses this prior knowledge to estimate plane lengths and filter out detections that fall outside the expected values. Therefore, in Fig. 6, some planes detected in Fig. 5 and associated with the wall (planes 6 and 10) do not appear. It is also possible to integrate other priors, such as the number of boxes, the physical packing sequence, and geometrical relationships between plane segments that belong to the same box.

3.1.4 Creation of a global plane segments map

Fig. 7
figure 7

Global map of plane segments at instant 24, session 10

The objective of this stage is to create a global map of plane segments \(\mathbf {gp_s (k)}\), combining plane segments detected in current and past frames and discarding non-valid plane segments: detections that correspond to boxes that have exited the consolidation zone. An example of a global map from session 10, instant 24, is illustrated in Fig. 7. This figure shows combinations of redundant planes (with a single digit identifier), detections of past frames (13–9, 19–10, 19–55), and detections from the current frame (all planes with identifiers beginning with 24). The procedure to compute this global map is described in Fig. 8. The block “Combination of Redundant Detections” detects and combines redundant plane segments: detections that belong to the same physical object and were mapped in different frames. For example, consider the detection 25–2 in Fig. 6 and the detection 1 in Fig. 7. These detections must become a single plane segment in the global map \(\mathbf {gp_s}(25)\). The block “particle nanagement” defines a novel strategy to compute the validity of plane segments in the global map, and consequently, handle dynamic scenes. Finally, the block “update of global map” implements strategies to discard non-valid plane segments.

Fig. 8
figure 8

Block “Creation of a global plane segment map”

3.1.4.1 Combination of redundant detections

Redundant detections are two or more detections of the same face of a box (or plane segment). They arise from the camera path, which scans each object from distinct points of view. Redundancy is accentuated when the operator revisits the consolidation zone. The “combination of redundant detections” block aims to improve the precision of the estimations by combining these detections, either through fusion or selection between detections. The block identifies redundant plane segments based on the spatial overlap between elements of the local map \(\mathbf {lp_s} (k)\) and the global map \(\mathbf {gp_s}(k-1)\). Pairs of redundant plane segments have high overlap (greater than 50%) or low overlap (in the range [5,50]%). Low-overlap cases are combined through fusion: the inliers of each plane segment are merged; subsequently, new properties are calculated for the merged plane segment, and the latter is added to the global map \(\mathbf {gp_s}(k)\). High-overlap cases are combined through selection: the plane segment with the highest quality factor is selected to remain on the global map. When the quality factor is equal between redundant planes, the plane segment with the highest number of inliers is selected.

Plane segments belonging to the local plane map \(\mathbf {lp_s}(k)\) and not classified as redundant are added to the global map \(\mathbf {gp_s}(k)\). These detections are expected to correspond to new plane segments that appear in the camera’s field of view as the operator explores the consolidation zone.

Plane segments belonging to the global map \(\mathbf {gp_s}(k-1)\) and not classified as redundant remain in the global map \(\mathbf {gp_s}(k)\). If this “non-redundant” classification persists for several iterations within a time window (greater than \(th_{validity}\)), then these segments are removed in the “obsolete particle and plane segments elimination” block in Fig. 8. The time window was set experimentally at 10 iterations and \(th_{validity}\) in 0.3.

3.1.4.2 Particle management

The validity of a plane segment is computed by analyzing the historical records of the validity in the particle associated to the plane segment. A particle is valid at an instant k if exists a plane segment detected in the local map \(\mathbf {lp_s}(k)\) and located in the neighborhood of the particle. Each particle is related with a plane segment \(ps_i\) and has four descriptors: identifier (\(pt.ID\)), position in the 3D space (\(pt.position\)), quality factor (\(pt.q\)), and historical record of validity (\(pt.\textbf{v}\)). The particle management is performed in three iterative stages:

3.1.4.3 Initialization

This procedure is executed after creating the first local map of plane segments \(\mathbf {lp_s}(k)\). A particle vector \({\textbf {PT}} = pt_1, pt_2,..., pt_N\) is created in this stage. Each particle is associated with a plane segment \(ps_i \in \mathbf {lp_s}(k)\). The position and quality factor properties of \(pt_i\) are taken from \(ps_i\). The validity vector of each particle is initialized with the value \([k \, \text {true}]\), where \(k\) is the instant at which the processed frame \(\textbf{p}(k)\) was acquired.

3.1.4.4 Creating relationships between particles and global plane segments

Establishing a relationship between particles and plane segments is essential for discarding non-valid plane segments. The process assumes the existence of a particle with its historical validity vector. The mean value of this vector is computed within each window time \(K_{\text {window}}\). If the mean value is below a predefined threshold, it indicates that the particle has lost its validity. In this scenario, the established relationship enables the identification of plane segments that must be discarded due to non-valid particles. It is expected that these plane segments correspond to boxes that have exited the consolidation zone. The procedure to set the relationship is as follows: Each time a global plane map \(\mathbf {gp_s}(k)\) is created, Eq.  3 is evaluated between the plane segments \(ps_i \in \mathbf {gp_s}(k)\) and particles \(pt_j \in \textbf{PT}\), where \(d( )\) is a distance function between the geometric center of a plane segment \(ps_i\) and the position of a particle \(pt_j\). If the distance is less than a threshold (\(th_{radii}\)), a relationship \(\langle ps_i, pt_j \rangle \) is established; otherwise, it is discarded.

$$\begin{aligned} d(ps_i.gc, pt_j.position) < th_{radii} \end{aligned}$$
(3)

The threshold \(th_{radii}\) is defined based on the length of the edges of boxes in the consolidation area as defined in Eq. 4, where \(\overline{\mathbf {L_{x,onscene}}(k)}\) is a vector with the edge lengths \(L_x\) expected in the consolidation area at instant \(k\); with the subscript \(x = {1,2}\). It is expected that each particle in the \({\textbf {PT}}\) vector is part of at least one global plane segment.

$$\begin{aligned} th_{radii}=\frac{\sqrt{\max (\overline{\mathbf {L_{1,onscene}}(k)})^2 +max(\overline{\mathbf {L_{2,onscene}}(k)})^2}}{2} \end{aligned}$$
(4)
3.1.4.5 Updating particle properties

This section describes the method to update the particle properties at each iteration. When a local plane map \(\mathbf {lp_s}(k)\) is created, Eq. 3 is evaluated between plane segments \(ps_i \in \mathbf {lp_s}(k)\) and particles \(pt_j\) in the \(\textbf{PT}\) vector. If the distance between \(ps_i\) and \(pt_j\) is lower than the threshold \(th_{radii}\), then the particle properties of \(pt_j\) are updated as follows: Initially, its validity vector is appended with the value \([k, \text {true}]\). Then its position is updated using Eq.  5, which is defined as a weighted average of \(pt_j\) and \(ps_i\) positions, with weights determined by the quality factor as stated in Eqs. 6. Finally, the quality factor of \(pt_j\) is updated with the average of qualities between \(pt_j\) and \(ps_i\). This update procedure is applied only to the particle closest to the plane segment.

If a plane segment \(ps_i\) does not satisfy Eq.  3 with any particle in the \({\textbf {PT}}\) vector, a new particle is added to the \({\textbf {PT}}\) vector with properties taken from \(ps_i\). The validity vector is also initialized with the values \([k \, \text {true}]\). Particles \(pt_j\) that have no relationship with any of the plane segments \(ps_i\) accumulate a new element in their historical validity vector with the value \([k \, \text {false}]\).

$$\begin{aligned} pt_j.\textbf{position}(k)= & {} \,w_{pt} pt_j.\textbf{position}(k) + w_{ps} ps_i.\textbf{gc}(k) \end{aligned}$$
(5)
$$\begin{aligned} w_{pt}= & {} 1 + \frac{ps_i.q}{pt_j.q + 1}; w_{ps} = w_{pt} \cdot \frac{ps_i.q}{pt_j.q} \end{aligned}$$
(6)
3.1.4.6 Update of global map

At regular time intervals defined by the observation window \(K_{\text {window}}\), the algorithm evaluates the following condition for each particle \(pt_i\) in the particle vector \(\textbf{PT}\): If the mean validity component over the instants within the observation window is less than a specified validity threshold (\(th_{validty}\)), then the system initiates the removal of particles and associated plane segments from the global map \(\mathbf {gp_s}(k)\). These plane segments are expected to correspond to those previously extracted from the packing zone or relocated within that zone. The \(th_{validty}\) is set experimentally to 0.3.

3.2 Dataset

The dataset utilized is composed of annotations acquired from a motion capture system and point cloud videos recorded from a mobile camera. These recordings were obtained from a packing operator wearing a head-mounted camera while assembling optimal packing patterns during a packing operation [47]. The dataset is divided into sessions. Each session begins with a set of boxes randomly distributed in the consolidation zone. The operator then enters the consolidation zone, picks a box, and transfers it to the packing zone (see Fig. 1). This task is repeated for each box, following a predefined physical packing sequence (PPS). The session concludes when all boxes have been extracted from the consolidation zone.

Fig. 9
figure 9

Instances of box type 4 in consolidation zone

Types and instances describe boxes used in the sessions. A box type is a conceptual definition with predefined values for texture, deformation by use, and size (height, width, depth). For example, box type 4 has the descriptors: low texture, not deformed, 25 cm height, 40 cm width, and 30 cm depth. A box instance is the physical implementation of the box type. Figure 9 shows boxes used in a sample session, with four instances of box type 4. Each box, pointed by the arrow, has the properties defined for box type 4. In the 18 sessions evaluated, 10 different types of boxes were used, each one with instances in the range [1, 4]. Properties of these boxes are reviewed in Table 3. The texture property is associated with the presence (high texture) or absence (low texture) of advertising marks on the visible surface of the box. Four box types had high texture, and the remaining six had low texture. Five of these types of boxes had deformations due to use. Each type of box has a different size. The sum of length, width, and height was calculated, and the categories “big” were applied for sums greater than 0.9 m and “small” for other cases.

Table 3 Box types and its properties. H: height, W: width, D: depth

Table 4 presents descriptors of the 18 sessions. The “session ID” column contains the identifier of each session in the original dataset [47], and the next column contains the number of frames that mapped the consolidation area at each session. The “camera speed” column includes descriptors of the mean speed, standard deviation, and maximum speed reached during the session. Sessions with a mean speed > 0.6 m/s are categorized as high-speed sessions while the others are categorized as low-speed sessions. The following column describes the number of boxes (Q) at the beginning of each session, indicating the boxes with high texture (\(Q_{highTxt}\)), boxes categorized as big (\(Q_{big}\)) and boxes deformed by use (\(Q_{defByUse}\)). The last column contains the order in which the boxes were extracted from the consolidation area relative to the identifier of each box. The first nine rows in Table 4 contain low-speed sessions, and the last nine contain high-speed sessions

Table 4 Sessions and its properties. The sessions have been sorted based on the average camera speed

Images used to assess the 6D pose tracking system were acquired with a HoloLens 2 and a motion capture system (MoCap). The HoloLens 2 provides the pairs \(\{\textbf{p}(k),\mathbf {P_{c_h}}(k)\}\) of point cloud, and camera pose at each instant k. The proposed algorithm process colorless point clouds; therefore, the original format of the point cloud (which includes color information) was converted to a colorless format. The MoCap system, allows to acquire the ground truth poses \({\overline{\textbf{P}}_{\textbf{i}^\textbf{m}}}(k)\) of the planes belonging to boxes located in the consolidation area. The dataset is complemented with annotations of planes visible from the HoloLens 2 camera in each frame. For details about the devices and image capture methods, please refer to Ref. [47].

3.3 Experimental design

The experiment aims to compare performance metrics from a 6D pose tracking system, considering variations in the tracking algorithm and camera speed. The experiment was set up in a split-plot design, based on a randomized complete block design [48], with nine replicates. The experimental unit was defined as the manual packing operation. The main plots consisted of two levels of speed (\(V_{low}\), \(V_{high}\) ). The algorithm (with levels \(A_{wpk}\), \(A_{woutpk}\)) was selected as the subplot. Therefore, four treatments were considered (\(V_{low}\)+\(A_{wpk}\); \(V_{low}\)+ \(A_{woutpk}\); \(V_{high}\)+\(A_{wpk}\); \(V_{high}\)+\(A_{woutpk}\)). The response variable is the performance metric, specifically precision, recall, and F1 score.

3.4 Experimental method to compute performance metrics

This section describes the methods to compute the performance metrics, compare them, compute the function error to asses the estimated poses and the method to obtain true positives, false positives, and false negatives in the dataset. The defined method does not include comparisons with other techniques because none of the identified works can solve the tracking problem defined in this study, as discussed in the related works section.

3.4.1 Performance metrics

The selected metrics to evaluate the performance of the tracking algorithms were precision, recall, and F1 score. These metrics were chosen to facilitate performance comparison with other authors who report at least one of them.

The calculation of the metrics per session uses as input the sets \( \hat{\textbf{GP}}_i^j \) and \(\overline{\textbf{GP}}^j \). Here, \( i = \{wpk, woutpk\} \) is the index of the algorithm used, and \( j = \{top, lateral\} \) is the index of the type of plane evaluated. \(\hat{\textbf{GP}}_i^j = \{\hat{\textbf{gp}}_s^j(1), \hat{\textbf{gp}}_s^j(2), \ldots , \hat{\textbf{gp}}_s^j(N_{kf}) \} \) is a set of maps of global plane segments of type \( j \), estimated with algorithm \( i \), where \( N_{kf} \) is the number of frames processed in the session. \(\overline{\textbf{GP}}^j = \{ \overline{\textbf{gp}}_s^j(1), \overline{\textbf{gp}}_s^j(2), \ldots , \overline{\textbf{gp}}_s^j(N_kf) \} \) is a set of maps, each one with ground truth global plane segments visible from the RGB-D camera, in each processed frame. The output consists of the mean value and standard deviation of each of the metrics of interest: \( \overline{precision}_i^j \), \( \overline{recall}_i^j \), \( \overline{F1-score}_i^j \), \( precision\_std_i^j \), \( recall\_std_i^j \), \( F1-score\_std_i^j \). The calculation of true positives, false positives, and false negatives per frame is performed by comparing estimations \(\hat{\textbf{gp}}_s^j(k) \) and reference poses \( \overline{\textbf{gp}}_s^j(k) \), as explained in the section 3.4.4.

Precision measures the proportion of relevant detections in the global map \(\hat{\textbf{gp}}_{\textbf{s}}(k) \) to the total detections \( |\hat{\textbf{gp}}_{\textbf{s}}(k)| \). Relevant detections are those corresponding to annotated global plane segment maps \( (\overline{\textbf{gp}}_s(k)) \), which are plane segments satisfying two conditions: (1) they belong to boxes present in the consolidation zone at time \( k \), and (2) they are visible from the mobile camera at time \( k \) or in previous instants. Precision is defined as \( precision(k) = \frac{TP}{TP + FP} \), where \( TP \) represents true positives and \( FP \) represents false positives. Considering an augmented reality (AR) application for manual packing operation, a false positive from the tracking system reduces the application’s usability due to the appearance of virtual objects that do not correspond to the physical objects the operator interacts with.

Recall is the proportion of relevant detections in the global plane map \(\hat{\textbf{gp}}_{\textbf{s}}(k) \) to the total number of plane segments in the annotated global plane map \( |\overline{\hat{\textbf{gp}}_{\textbf{s}}}(k)| \). Recall is defined as \( recall(k) = \frac{TP}{TP + FN} \), where \( FN \) represents false negatives. When considering an AR application for manual packing operation, a false negative can cause application blockages since the tracking module continues to search for objects that, although in the scene, cannot be detected by the vision system.

The F1 score combines precision and recall into a single metric. This metric is useful when the cost of false positives and false negatives is similar, as is the case with the tracker applications to aid the manual packing.

3.4.2 Methodology to compare performance metrics

The comparison between metrics derived from the two available algorithms (\(A_{wpk}\), \(A_{woutpk}\)) is performed in two scenarios: (1) tracking of top-plane segments and (2) tracking of lateral-plane segments. Each test applies a two-step methodology: (i) graphical analysis and (ii) statistical analysis based on the experimental design in section 3.3.

The graphical analysis uses standard graphics presented in Figs. 10 to 15. In this standard, the outer horizontal axis enumerates the nine blocks of the experiment. Each block groups metrics computed from the available treatments: \(V_{low}\)+\(A_{wpk}\); \(V_{low}\)+ \(A_{woutpk}\); \(V_{high}\)+\(A_{wpk}\); \(V_{high}\)+\(A_{woutpk}\)). The inner horizontal axis presents the camera speed with the session identifier; \(V-s\) where V is for camera speed and s for session ID. The vertical axis indicates the magnitude of the metric. The colored bars show the average metric achieved with each algorithm: orange for \(A_{woutpk}\) and blue for \(A_{wpk}\). The black vertical line on each bar represents one standard deviation among estimates of the same session. In each graph, the best result is associated with bars close to 100% and standard deviation lines of low magnitude.

The statistical analysis was performed on metrics that satisfy the normality and equal variances assumptions (considering a significance level of 0.05). The analysis includes an ANOVA test and comparisons using the Tukey method on groups formed by the levels of the relevant factors (using a confidence of 95%).

3.4.3 Error function \(e_{ADD_{p}}\)

The calculation of the metrics was based on the error function called the average distance between distinguishable point clouds (\(e_{ADD_p}\)). The metric is based on the \(e_{ADD}\) defined in Ref. [49] and adjusted to be applied directly on point clouds. The property of distinguishable views is used in 3D objects that look different from each possible viewpoint in a scene. Objects that meet this property are considered non-ambiguous regarding the estimation of their pose. According to this definition, plane segments are ambiguous in pose estimation; however, they can be regarded as low ambiguity since there are few viewpoints for which the property of distinguishable views is not fulfilled. In addition, the symmetry of plane segments allows resolving the limitations of correspondences defined in Ref. [49] when calculating error functions in objects with indistinguishable poses. Therefore, the authors consider the use of the error function (\(e_{ADD_p}\)) relevant for the case of plane segments analyzed in this work.

The function \(e_{ADD_{p}}\) is calculated for an estimated pose \(\hat{\textbf{P}}\), assuming the existence of references for the pose \(\bar{\textbf{P}}\) and lengths \((\bar{L}_1, \bar{L}_2)\) of the plane segment present in an image \(\textbf{p}(k)\). The function requires defining the spatial sampling frequency \(ss\) for the point cloud and a tolerance \(\tau \) for misalignment. The procedure for calculating the error \(e_{ADD_{p}}\) for a plane segment in an image frame is as follows:

First, a synthetic point cloud \(\textbf{p}\) is created with the sampling frequency \(ss\), the reference length \((\bar{L}_1, \bar{L}_2)\), and geometric center at the origin of the coordinate system \(\mathbf {q_m}\). From this point cloud, point clouds \(\hat{\textbf{p}}\) and \(\bar{\textbf{p}}\) are calculated through projection of \(\textbf{p}\) with the estimated pose \(\hat{\textbf{P}}\) and the ground truth pose \(\bar{\textbf{P}}\), respectively. Furthermore, the rotation error \(e_R\) is calculated using Eq. 7 on the rotation matrices contained in \(\hat{\textbf{P}}\) and \(\bar{\textbf{P}}\). Second, the average distance between distinguishable point clouds is calculated by evaluating Eq. 8. An estimation is considered correct if \(e_{ADD_p}(\hat{P},\bar{P})<th_{ADD}\), where \(th_{ADD}\) is an acceptance threshold.

$$\begin{aligned} e_R(\hat{\textbf{R}}, \bar{\textbf{R}})= & {} \arccos \left( \frac{\text {Tr}(\hat{\textbf{R}}\bar{\textbf{R}}^{-1}) - 1}{2}\right) \end{aligned}$$
(7)
$$\begin{aligned} e_{\text {ADD}_p}(\hat{P}, \bar{P})= & {} {\left\{ \begin{array}{ll} \frac{1}{N} \sum _{i=1}^{N} \left[ |\bar{\textbf{p}}(i) - \hat{\textbf{p}}(i)|> \tau \right] , &{} \text {if } e_R < 90^{\circ } \\ \frac{1}{N} \sum _{i=1}^{N} \left[ |\bar{\textbf{p}}(i) - \hat{\textbf{p}}(N-i)| > \tau \right] , &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(8)

The parameters to compute the error function were:

  • Spatial sampling frequency \( ss \). This parameter allows to normalize point density between synthetic point clouds used in the assessment. It is defined in terms of the size of the plane segment \((\bar{L}_1, \bar{L}_2)\) and the number of points on the main diagonal \( N_{diag} \):

    $$\begin{aligned} ss = \frac{\sqrt{(\bar{L}_1)^2 + (\bar{L}_2)^2}}{N_{diag}} \end{aligned}$$

    where \( N_{diag} = 30 \); this value was selected seeking a compromise between quality of the reconstruction and computational complexity.

  • The misalignment tolerance \( \tau \) was set to 50 mm taking into account errors reported in works that employ the HoloLens 2 sensor to reconstruct wide areas [50].

  • The acceptance threshold \( th_{ADD} \) was set to 0.5.

3.4.4 Computing TPs, FPs, FNs

The calculation of true positives (TP) and false positives (FP) per frame is performed using the matrix \( \mathbf {E_{i,j}}(k) \), with \( i = 1, \ldots , N_e(k) \) as the index of estimations and \( j = 1, \ldots , N_{gt}(k) \) as the index of plane segments in the global annotation map \({\overline{\textbf{gp}_{\mathbf {s(k)}}}}\). In each cell of the matrix, the result of the comparison \( e_{ADD_p} ( \hat{P}, \bar{P} ) < th_{ADD} \) is stored. Cells with a false value are interpreted as estimations with an error less than the acceptance threshold \( th_{ADD} \), so each column \( j \) at \( \mathbf {E_{i,j}}(k) \) where a false entry is identified is selected as a true positive. Cells with a true value are interpreted as estimations with an error greater than the acceptance threshold, so each row \( i \) at \( \mathbf {E_{i,j}}(k) \) where all entries are true is interpreted as a false positive. Finally, each column \( j \) of the matrix \( \mathbf {E_{i,j}}(k) \) where all entries are true corresponds to a false negative. The above analysis allows to calculate precision, recall, and F1 score for each image frame (per-frame metrics). The per-session metrics are calculated as the average of the per-frame metrics along the session.

4 Results and discussion

4.1 Tracking of top-plane segments

The purpose of this test is to calculate and compare the metrics \( \overline{precision}_i^{top} \), \( \overline{recall}_i^{top} \), \( \overline{F1-score}_i^{top} \), \( precision\_std_i^{top} \), \( recall\_std_i^{top} \), \( F1-score\_std_i^{top} \)

4.1.1 Precision

Fig. 10
figure 10

Mean precision for top-plane segments. \(V-s\) indicates camera speed − session ID

Figure 10 presents the precision achieved at each block during the tracking of top-plane segments. The graph shows the superiority of the \(A_{wpk}\) algorithm compared to \(A_{woutpk}\) in all sessions. This behavior supports the hypothesis that including prior knowledge for filtering detected planes is associated with performance increases in the tracker.

The maximum precision (72%) is achieved by the algorithm \(A_{wpk}\) in session 27, while the minimum precision (17%) is recorded in session 13 with the algorithm \(A_{woutpk}\). The main differences between these sessions are the number of frames, the average camera speed, and the quantity of boxes. These values are (325 frames, 0.46 m/s, 12 boxes) for session 27 and (91 frames, 0.64 m/s, 18 boxes) for session 13, as presented in Table 4. The difference in precision between these sessions is associated with the presence of four instances of box type 1 and type 7 in session 13. These boxes exhibited high deviation among estimates from various frames (see the group of plane segments labeled as Box 1 in Fig. 7). Box type 1 has the smallest size among the dataset (125x150x125 mm in height, width, and depth). Consequently, we expect a mapping with low point density per area, which reduces the precision of the estimates. In the case of box type 7, the deviation among estimates is associated with deformation due to box usage. This combination of box types makes session 13 with the greatest precision challenges for tracking.

The data in Fig. 10 shows a high standard deviation per session (indicated by the black vertical line over each bar) in both algorithms. For the \(A_{woutpk}\) algorithm, the deviation is in the range [12, 18]%, while for \(A_{wpk}\) is in the range [14, 27]%. These deviation levels are explained by significant changes in visible plane segments between consecutive frames. These changes result from three characteristics of the packing process and mapping: (i) the limited field of view of the camera concerning the scanned area, (ii) the continuous movement of the camera, and (iii) the dynamic nature of the packing process. This combination means that a different portion of the scene is continuously being mapped, with a different number of boxes, different levels of occlusion, and a different amount of plane segments visible per frame. The increased deviation in the \(A_{wpk}\) algorithm is explained by considering that, in this algorithm, in addition to the variations in plane segments mentioned, plane segments are discarded in each frame through filtering by prior knowledge stage in section 3.1.3.

A statistical analysis was performed to validate the significance of the superiority of the \(A_{wpk}\) algorithm illustrated in Fig. 10. In the first instance, the assumptions of normality and equal variances for the precision residuals were confirmed. Subsequently, an ANOVA analysis was used, concluding that neither the speed-algorithm interactions (p value of 0.25) nor the camera speed (p value of 0.76) are significant in the precision of the tracking system. The factor that presents a significant effect is the algorithm used (p value of 0.00). Pairwise Tukey comparisons were conducted showing that the precision increases with the algorithm \(A_{wpk}\), recording a mean precision of 57.20% between sessions, which implies an increment of 21.97% regards the mean precision reached with \(A_{woutpk}\) (35.23%).

4.1.2 F1 score

Fig. 11
figure 11

Mean F1 score for top-plane segments. \(V-s\) indicates camera speed − session ID

Figure 11 compares the F1 score metric between the evaluated treatments during the detection of top-plane segments. This figure reiterates the superiority of the \(A_{wpk}\) algorithm over the \(A_{woutpk}\) algorithm in all sessions. This behavior was expected due to the definition of F1 score as the harmonic mean between precision (where \(A_{wpk}\) was superior) and recall (where the two algorithms were equivalent as presented in section 4.1.3).

The standard deviation is in the range [12, 18]% for the \(A_{woutpk}\) algorithm and [11, 25]% for \(A_{wpk}\). The relationship between these values and significant changes in the number of visible plane segments between consecutive frames during tracking is reiterated.

Statistical analysis on the residuals of the F1 score verified normality and equality of variances. Likewise, the ANOVA analysis determined that neither the speed-algorithm interactions (p value of 0.67) nor the camera speed (p value of 0.83) is significant in the system’s F1 score. Again, the factor that shows a significant effect was the algorithm used (p value of 0.00). The pairwise Tukey comparison allows to conclude that the F1 score is incremented when using the \(A_{wpk}\) algorithm, reaching a mean value of 58.76% between sessions. This value is located 14 percent units above the F1 score recorded with \(A_{woutpk}\) algorithm (44.17%).

4.1.3 Recall

Fig. 12
figure 12

Mean recall for top-plane segments. \(V-s\) indicates camera speed − session ID

Figure 12 compares recall achieved by the treatments in the tracking of top-plane segments. The equivalence in recall and standard deviation for both algorithm versions is evident. This behavior suggests that both algorithms are equally effective in detecting relevant plane segments in the scene, and the filtering by prior knowledge does not affect the algorithm’s ability to find these relevant plane segments. This conduct can be explained by considering that:

  1. 1.

    The recall metric is calculated in function of true positives and false negatives; it is independent of false positives.

  2. 2.

    The only difference between the compared algorithms lies in applying the prior knowledge filtering block in the \(A_{wpk}\) algorithm. This block eliminates planes that do not correspond to boxes in the scene, reducing the number of false positives from the tracker. This block does not have incidence on the number of false negatives.

Therefore, the filtering does not affect the recall metric, and the equivalence in recall achieved by the two algorithms is consistent.

Figure 12 also shows standard deviations in the range [11, 28]%, which is a high value. Again, this behavior is associated with significant changes in visible plane segments between consecutive frames. The maximum recall was achieved in session 33 (94%), and the minimum recall was recorded in session 13 (28%). The main differences between these sessions are the number of frames and the quantity of boxes. These values are (37, 8) for session 33 and (91, 18) for session 13. It is worth noting that the session with the lowest precision (session 13) now also exhibits the lowest recall, reinforcing the condition of poor performance in sessions with small boxes (type 1) and boxes showing deformation from use (type 7).

4.2 Tracking of lateral-plane segments

The purpose of this test is to calculate and compare the metrics \( \overline{precision}_i^{lateral} \), \( \overline{recall}_i^{lateral} \), \( \overline{F1-score}_i^{lateral} \), \( precision\_std_i^{lateral} \), \( recall\_std_i^{lateral} \), \( F1-score\_std_i^{lateral} \).

4.2.1 Precision

Fig. 13
figure 13

Mean precision for lateral-plane segments. \(V-s\) indicates camera speed − session ID

Figure 13 compares the precision achieved by the blocks while tracking lateral box planes. The superiority of the \(A_{wpk}\) algorithm over the \(A_{woutpk}\) algorithm is maintained; however, it is evident that the magnitudes of precision are lower compared to those achieved in the tracking of top planes (see section 4.1.1). This behavior is associated with more detections of lateral planes in contrast to top planes, as evidenced in Fig.  7. The behavior of the standard deviations remains, with \(A_{woutpk}\) recording values in the range [8, 17]% and \(A_{wpk}\) recording values in the range [9, 21]%.

The maximum precision achieved (51%) is obtained in session 19 with the \(A_{wpk}\) algorithm, while the minimum (14%) is achieved in session 17 with the \(A_{woutpk}\) algorithm. The differences between these sessions are identified in the number of frames recorded and camera speed. The values for session 19 are (189, 0.49 m/s), while for session 17, they are (50, 0.69 m/s). These values suggest an inversely proportional relationship between the camera speed and the precision in detecting lateral planes.

The statistical analysis validated the assumptions of normality and equality of variances on the precision residuals. Consequently, an ANOVA analysis was applied to validate the superiority indicated by Fig. 13. This analysis determined that neither the camera speed–algorithm interactions (p value of 0.67) nor the camera speed (p value of 0.80) are significant in the recorded precision. Again, the factor with a significant effect is the algorithm used (p value of 0.00). Pairwise Tukey comparisons were conducted allowing to conclude that precision increases when using the \(A_{wpk}\) algorithm, reaching a mean value of 31.71% between sessions; this value is 5.96 units above the mean value recorded with \(A_{woutpk}\) algorithm (25.75%).

4.2.2 F1 score

Fig. 14
figure 14

Mean F1 score for lateral-plane segments. \(V-s\) indicates camera speed − session ID

Figure 14 compares the F1 score achieved with the evaluated treatments while tracking lateral planes. In this figure, the superiority of \(A_{wpk}\) is preserved in all evaluated sessions.

The maximum F1 score (62%) is recorded for session 33 with the \(A_{wpk}\) algorithm, while the minimum (21%) is reached in session 17 with the \(A_{woutpk}\) algorithm. The main difference between those sessions is the number of frames, 37 in session 33 and 50 in session 17, as presented in Table 4.

The ANOVA statistical analysis repeated the trends identified in terms of independence of the F1 score concerning the speed–algorithm interactions (p value of 0.81), camera speed (p value of 0.62), and dependence on the tracking algorithm (p value of 0.00). Pairwise comparisons allowed to conclude that the F1 score increases with \(A_{wpk}\) algorithm, reaching a mean value of 40.04% between sessions. This value is located 5 percent units above the mean value recorded when using \(A_{woutpk}\) (35.02%).

4.2.3 Recall

Fig. 15
figure 15

Mean recall for lateral-plane segments. \(V-s\) indicates camera speed − session ID

Figure 12 presents the comparison of recall achieved during the tracking of lateral planes. This figure preserves the equivalence in magnitude and standard deviation of recall between the evaluated algorithms. The maximum recall (93%) is reached in session 33, while the minimum (34%) is recorded in session 3. The main differences between the sessions are the number of frames, camera speed, and the quantity of boxes, as indicated in Table 4. It is worth highlighting that session 3 which has small size boxes reached the lowest recall, reinforcing poor performance in sessions with small boxes (type 1).

4.3 Comparative analysis

Table 5 Mean values of the computed metrics along sessions

The mean values of the precision, recall, and F1 score metrics recorded with the evaluated algorithms are presented in Table 5. The analysis on top and lateral planes confirmed the superiority of the \(A_{wpk}\) algorithm for all evaluated sessions. The results indicate that integrating prior knowledge of box size increases the precision and F1 score of visual tracking systems due to a significant reduction in the number of false positives in top- and lateral-plane segments. This increase in precision can benefit the tracking of objects composed of plane segments, such as boxes.

The precision of top-plane tracking was higher than the precision of lateral-plane tracking (9.48% without prior knowledge and 25.49% with prior knowledge). This behavior can be explained by considering that several elements in the scene that did not correspond to boxes (such as walls and structures limiting the working area) were mapped as lateral-plane segments (as illustrated in irrelevant detections at Fig. 7). This behavior increased the false positives in evaluating lateral planes, reducing both precision and F1 score in contrast to the same metrics for top planes.

The recall metric in top-plane tracking was higher than in lateral-plane tracking (4.44% higher for both algorithms). This behavior can be explained by considering that the mapping conditions associated with distance and orientation between the camera and the object are better for top-plane segments. To support this relationship, Table 6 shows that the point density per area was higher in top planes when contrasting plane segments with a similar camera-to-object distance \(d_{co}\).

Table 6 Quality factor and other descriptors for local map of planes segment at session 10

The statistical analysis shows that the assessed algorithms perform invariantly to the camera speed. This feature is suitable considering that the speed of a human operator is dynamic in repetitive operations such as the packing process.

According to our literature review, this is the first analysis of a 6D tracking system conducted in a manual intralogistics packing environment (real or simulated). In this sense, our results extend previous findings in the field of visual tracking systems by quantifying their performance in real load packing environments. The results indicate that the performance achieved is 19.4% lower in precision than that reported for other experimental environments [36]. This outcome is associated with the following characteristics of the packing process and scene mapping:

  1. 1.

    Lower quality of estimations associated with images with high noise content from a mobile device attached to an operator’s head.

  2. 2.

    Increased odometric errors associated with large tracking areas (\(>10\,m^2\)).

  3. 3.

    Greater variability of detected objects between consecutive frames when considering the combination of dynamic environments, reduced camera field of view relative to the scanned area, and a moving camera.

None of the identified works have approached the specific problem of 6D pose multi-tracking in manual packing applications using a mobile camera. Then results presented in this work are not directly comparable with other published results. The closest work is that of Ref. [36], which reports a precision of 76.6% in a task of tracking cube-shaped objects from synthetic RGB images with a fixed camera, resolution of 1920x1080 pixels, and considering dynamic scenes. The main differences with our work are:

  1. 1.

    Plane segment tracking: In the research by Ref. [36], tracking is performed on box-shaped objects without focusing on detecting plane segments in intermediate instances. In our work, tracking is performed on plane segment-like objects. Detecting boxes involves grouping plane segments and eliminating those plane segments that do not have enough geometric relationships to be classified as box components. Therefore, it is expected that precision will increase in a box tracker fed by the plane segments detected by our algorithms due to the elimination of plane segments that do not correspond to boxes.

  2. 2.

    Variability of box properties: The precision data of 76.6% reported by Ref. [36] were calculated for cube-shaped objects; however, the variability in sizes, the number of boxes per scene, heterogeneity between box sizes in the scene, and textures were limited compared to our work.

  3. 3.

    Metrics: The authors do not report recall or F1 score metrics.

  4. 4.

    Larger scanned area: The scanned area in our presented work is approximately 21 \(m^2\) (91 times larger than that used in [36]). Consequently, odometric error in estimates made from captured images tends to increase.

  5. 5.

    Images with low resolution and high noise content: The images used in our work come from the HoloLens 2 mobile sensor and, thus, exhibit high noise content as well as limitations in field of view (\(75^{\circ } \times 65^{\circ }\)), resolution (320x288 pixels), sampling frequency (1–5 fps), shorter functional depth sensor distance, among others.

The camera path in this work is determined by the human operator, who considers internal factors (such as their experience) and external factors (such as operating procedures, layout of boxes in the consolidation zone, and physical packing sequence). This path impacts the number of redundant detections, as discussed in Sect. 3.1.4. The proposed algorithm is designed to address this, but other effects on the mapping stage of the pipeline must be considered. In the case of under-mapping, where the path avoids a specific area within the consolidation zone, objects in this area may be inadequately mapped or not mapped at all, reducing the precision in representing objects within that area. Conversely, in the case of over-mapping, where the path traverses an area multiple times under favorable conditions of distance and orientation between the camera and objects, objects within such an area will have a higher mapping density, leading to more precise estimations of their properties. A solution to avoid under-mapping issues is to explore the potential of the augmented reality system, which can suggest paths to the operator. This approach aims to improve spatial mapping by encouraging the exploration of unexplored zones and identifying more efficient paths to reach previously detected boxes.

5 Conclusion

This work addresses the problem of 6D pose tracking of plane segments from point clouds acquired from a mobile camera. The review indicates that this study represents a new exploration of 6D pose tracking which can potentially improve the performance of manual packing operations. The proposed algorithm includes a RANSAC fitting stage applied to the point cloud to identify and fit geometric planes, novel strategies to compute the 2D size and 6D pose of plane segments from the fitted point cloud, integration of local detections into a single global map by combining redundant detections based on a novel quality factor, and a particle management system to track and eliminate obsolete plane segments in dynamic scenes. A variant of the algorithm incorporates prior knowledge of box sizes in the scene to filter out irrelevant detections. The contribution of this is assessed by an experimental design that compares the performance of a plane segment tracking system, considering variations in the tracking algorithm (\(A_{woutpk}\), \(A_{wpk}\)) and camera speed (\(V_{high}\), \(V_{low}\)). The analysis was performed separately for the boxes’ top- and lateral-plane segments. The results confirmed increased precision (21.97% for top planes, 5.96% for lateral planes) and F1 score (14.59% for top planes, 5.02% for lateral planes) when prior knowledge of the scene is included. In addition, the robustness of the designed algorithms against camera speed variation is highlighted. This robustness is relevant for applications in manual operations characterized by the dynamic speeds of the operator (to whom the camera is attached). Furthermore, higher values of precision, recall, and F1 score were observed in tracking top-plane segments compared to lateral ones: the performance superiority for tracking top-plane segments with the \(A_{wpk}\) algorithm, was recorded at 25.49%, 4.44%, 9.15% for the precision, recall, and F1 score metrics, respectively. This behavior relates to better mapping conditions associated with distance and orientation for top-plane segments. The implications of this are relevant to designing the strategy to group plane segments that belong to a single box in the tracking box system. When comparing this development with others reported in the literature on box tracking, it is found that the test scenarios used in this work are more complex and realistic, considering greater variability of boxes in terms of quantity, size, and texture, occlusion between boxes, larger consolidation area for boxes, mapping of a wide working area using a camera with reduced field of view, presence of noise in images due to moving cameras, and the dynamic nature of the environment that also presents sequential output of boxes from the consolidation zone. This complexity in test conditions led to a lower value of performance metrics in the presented work compared to those in the consulted literature. The proposed algorithms are limited to tracking plane segments that compose boxes with full support on surfaces parallel to the ground plane and without stacking. Future work is proposed in four directions: exploring strategies to infer non-visible plane segments in stacking configurations using additional priors such as the physical packing sequence, geometric constraints between planes of a single box, and the number of boxes in the consolidation zone; developing a methodology for box tracking based on plane segment tracking, considering differences in precision and F1 score metrics between top and lateral planes; investigating the impact on metrics using higher-resolution depth sensors, such as LiDAR; and enabling comparisons with other techniques by complementing them with the missing stages. For example, dynamic scene and 6D pose estimation features can be added to related works in 3D modeling based on geometric fitting. In addition, data-driven reconstruction and 6D pose estimation proposals, such as those in Ref. [51], can be modified to address the problem under multi-tracking and dynamic conditions.

Availability of data and materials

The datasets analyzed during the current study are available in the Dryad repository, https://datadryad.org/stash/share/EE0sEaiydea9XLlTy1lXY-SMDmRAWef-IppfoYwQAYM. The source code developed during the current study is available from the corresponding author on reasonable request.

Abbreviations

ADD:

Average distance for distinguishable objects

ANOVA:

Analysis of variance

defByUse:

Deformed by use

diag:

Diagonal

FN:

False negatives

FP:

False positives

ID:

Identifier

LSD:

Least significant difference

MoCap:

Motion capture

RANSAC:

Random sample consensus

RGB:

Red, green, blue

RGB-D:

Red, green, blue, depth

SLAM:

Simultaneous localization and mapping

STD:

Standard deviation

TP:

True positives

highTxt:

High texture

higPer:

High perimeter

woutpk:

Without previous knowledge

wpk:

With previous knowledge

3D:

Three dimensional

6D:

Six dimensional

References

  1. D. Cuellar-Usaquen, G.A. Camacho-Muñoz, C. Quiroga-Gomez, D. Álvarez Martínez, An approach for the pallet-building problem and subsequent loading in a heterogeneous fleet of vehicles with practical constraints. Int. J. Ind. Eng. Comput. 12, 329–344 (2021). https://doi.org/10.5267/j.ijiec.2021.1.003

    Article  Google Scholar 

  2. A. Bortfeldt, G. Wäscher, Constraints in container loading - a state-of-the-art review. Eur. J. Oper. Res. 229, 1–20 (2013). https://doi.org/10.1016/j.ejor.2012.12.006

    Article  MathSciNet  Google Scholar 

  3. A. Trivella, D. Pisinger, Bin-packing problems with load balancing and stability constraints. INFORMS Transportation and Logistics Society 2017 (2017)

  4. A.G. Ramos, J.F. Oliveira, M.P. Lopes, A physical packing sequence algorithm for the container loading problem with static mechanical equilibrium conditions. Int. Trans. Oper. Res. 23, 215–238 (2016). https://doi.org/10.1111/itor.12124

    Article  MathSciNet  Google Scholar 

  5. P.G. Mazur, N.S. Lee, D. Schoder, T. Janssen, in Computational Logistics, ed. by M. Mes, E. Lalla-Ruiz, S. Voß (Springer International Publishing, Cham, 2021), pp. 627–641. https://doi.org/10.1007/978-3-030-87672-2_41

  6. B. Maettig, F. Hering, M. Doeltgen, Development of an intuitive, visual packaging assistant, vol. 781, vol 781 edn. (Springer International Publishing, Orlando, Florida, USA, 2019), pp. 19–25.https://doi.org/10.1007/978-3-319-94334-3

  7. V. Kretschmer, T. Plewan, G. Rinkenauer, B. Maettig, Smart palletisation: cognitive ergonomics in augmented reality based palletising. Adv. Intell. Syst. Comput. 722, 355–360 (2018). https://doi.org/10.1007/978-3-319-73888-8_55

    Article  Google Scholar 

  8. F. Lorson, A. Fügener, A. Hübner, New team mates in the warehouse: human interactions with automated and robotized systems. IISE Trans. 55, 536–553 (2023). https://doi.org/10.1080/24725854.2022.2072545

    Article  Google Scholar 

  9. Unocero. Centro de distribución mercado libre - así funciona (2019). https://www.youtube.com/watch?v=8eFhnpvaRB0 &t=653s

  10. Z. Hashemifar, K.W. Lee, N. Napp, K. Dantu, in 2017 IEEE 11th International Conference on Semantic Computing (ICSC) (2017), pp. 526–531. https://doi.org/10.1109/ICSC.2017.78

  11. M. Mishima, H. Uchiyama, D. Thomas, R. ichiro Ichiro Taniguchi, R. Roberto, J.P. Lima, V. Teichrieb, Incremental 3D cuboid modeling with drift compensation. Sensors (Switzerland) 19, 1–20 (2019). https://doi.org/10.3390/s19010178

  12. M. Ramamonjisoa, S. Stekovic, V. Lepetit, in Computer Vision - ECCV 2022, ed. by S. Avidan, G. Brostow, M. Cissé, G.M. Farinella, T. Hassner (Springer Nature Switzerland, 2022), pp. 161–177

  13. N. Olivier, H. Uchiyama, M. Mishima, D. Thomas, R.I. Taniguchi, R. Roberto, J.P. Lima, V. Teichrieb, Live structural modeling using rgb-d slam. Proceedings - IEEE International Conference on Robotics and Automation pp. 6352–6358 (2018). https://doi.org/10.1109/ICRA.2018.8460973

  14. R. Roberto, J.P. Lima, H. Uchiyama, V. Teichrieb, R. ichiro Taniguchi, Geometrical and statistical incremental semantic modeling on mobile devices. Computers & Graphics 84, 199–211 (2019). https://doi.org/10.1016/j.cag.2019.09.003

  15. D. Schmalstieg, T. Hollerer, Augmented Reality: Principles and Practice (Pearson Education, Los Angeles, 2016)

    Google Scholar 

  16. R.F. Salas-Moreno, R.A. Newcombe, H. Strasdat, P.H.J. Kelly, A.J. Davison, in 2013 IEEE Conference on Computer Vision and Pattern Recognition (2013), pp. 1352–1359. https://doi.org/10.1109/CVPR.2013.178

  17. C. Zhang, Y. Hu, Cufusion: Accurate real-time camera tracking and volumetric scene reconstruction with a cuboid. Sensors (Switzerland) 17 (2017). https://doi.org/10.3390/s17102260

  18. T. Pöllabauer, F. Rücker, A. Franek, F. Gorschlüter, in R. Gade, M. Felsberg, J.K. ed. by I. Analysis (Kämäräinen (Springer Nature, Switzerland, 2023), pp.569–585

  19. S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, N. Navab, Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 7724 LNCS, 548–562 (2013). https://doi.org/10.1007/978-3-642-37331-2_42

  20. J. Paulo, R. Roberto, F. Simões, M. Almeida, L. Figueiredo, J. Marcelo, V. Teichrieb, Markerless tracking system for augmented reality in the automotive industry. Expert Syst. Appl. 82, 100–114 (2017). https://doi.org/10.1016/j.eswa.2017.03.060

    Article  Google Scholar 

  21. L.C. Wu, I.C. Lin, M.H. Tsai, in Proceedings of the 20th ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (Association for Computing Machinery, New York, NY, USA, 2016), I3D ’16, p. 95-102. https://doi.org/10.1145/2856400.2856416

  22. S. Huang, W. Huang, Y. Lu, M. Tsai, I. Lin, in VISIGRAPP - Proc. Int. Jt. Conf. Comput. Vis., Imaging Comput. Graph. Theory Appl. (SciTePress, Prague, Czech Republic, 2019), pp. 375–382. https://doi.org/10.5220/0007692603750382

  23. Y. Wang, S. Zhang, S. Yang, W. He, X. Bai, Y. Zeng, A line-mod-based markerless tracking approach for ar applications. Int. J. Adv. Manuf. Technol. 89, 1699–1707 (2017). https://doi.org/10.1007/s00170-016-9180-5

    Article  Google Scholar 

  24. E. Fontana, W. Zarotti, D.L. Rizzini, in 2021 European Conference on Mobile Robots (ECMR) (2021), pp. 1–6. https://doi.org/10.1109/ECMR50962.2021.9568825

  25. T.P. Nguyen, S. Kim, H.G. Kim, J. Han, J. Yoon, in 2022 IEEE Eighth International Conference on Big Data Computing Service and Applications (BigDataService) (2022), pp. 22–26. https://doi.org/10.1109/BigDataService55688.2022.00011

  26. J. Yoon, J. Han, T.P. Nguyen, Logistics box recognition in robotic industrial de-palletising procedure with systematic rgb-d image processing supported by multiple deep learning methods. Eng. Appl. Artif. Intell. 123, 106311 (2023). https://doi.org/10.1016/j.engappai.2023.106311

    Article  Google Scholar 

  27. G. Zhang, Y. Kong, W. Li, X. Tang, W. Zhang, J. Chen, L. Wang, Lightweight deep learning model for logistics parcel detection. Vis. Comput. 40, 2751–2759 (2024). https://doi.org/10.1007/s00371-023-02982-z

    Article  Google Scholar 

  28. T. Chen, D. Gu, CSA6D: channel-spatial attention networks for 6D object pose estimation. Cogn. Comput. 14, 702–713 (2022). https://doi.org/10.1007/s12559-021-09966-y

    Article  Google Scholar 

  29. T. Chen, D. Gu, in IFAC-PapersOnLine, vol. 56 (Elsevier B.V., 2023), pp. 8048–8053. https://doi.org/10.1016/j.ifacol.2023.10.930

  30. F. Duffhauss, S. Koch, H. Ziesche, N.A. Vien, G. Neumann. SyMFM6D: Symmetry-aware multi-directional fusion for multi-view 6D object pose estimation (2023). https://doi.org/10.48550/arXiv.2307.00306

  31. H. Liu, G. Liu, Y. Zhang, L. Lei, H. Xie, Y. Li, S. Sun, A 3D keypoints voting network for 6DoF pose estimation in indoor scene. Machines 9 (2021). https://doi.org/10.3390/machines9100230

  32. L. Tian, C. Oh, A. Cavallaro, Test-time adaptation for 6D pose tracking. Pattern Recognition p. 110390 (2024). https://doi.org/10.1016/j.patcog.2024.110390

  33. F. Wang, X. Zhang, T. Chen, Z. Shen, S. Liu, Z. He, Kvnet: an iterative 3D keypoints voting network for real-time 6-dof object pose estimation. Neurocomputing 530, 11–22 (2023). https://doi.org/10.1016/j.neucom.2023.01.036

    Article  Google Scholar 

  34. Z. Liu, Q. Wang, D. Liu, J. Tan, Pa-pose: partial point cloud fusion based on reliable alignment for 6D pose tracking. Pattern Recogn. 148, 110151 (2024). https://doi.org/10.1016/j.patcog.2023.110151

    Article  Google Scholar 

  35. W. Zhu, H. Feng, Y. Yi, M. Zhang, Fcr-tracknet: towards high-performance 6D pose tracking with multi-level features fusion and joint classification-regression. Image Vis. Comput. 135, 104698 (2023). https://doi.org/10.1016/j.imavis.2023.104698

    Article  Google Scholar 

  36. A.S. Periyasamy, M. Schwarz, S. Behnke, in 2021 IEEE 17th International Conference on Automation Science and Engineering (CASE) (2021), pp. 488–493. https://doi.org/10.1109/CASE49439.2021.9551599

  37. Y. Labbe, J. Carpentier, M. Aubry, J. Sivic, in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16, ed. by Springer (Springer, 2020), pp. 574–591

  38. T. Nguyen, G. Reitmayr, D. Schmalstieg, Structural modeling from depth images. IEEE Trans. Vis. Comput. Graph. 21, 1230–1240 (2015). https://doi.org/10.1109/TVCG.2015.2459831

    Article  Google Scholar 

  39. F. Kluger, H. Ackermann, E. Brachmann, M.Y. Yang, B. Rosenhahn, Cuboids revisited: Learning robust 3D shape fitting to single RGB images. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition pp. 13065–13074 (2021). https://doi.org/10.1109/CVPR46437.2021.01287

  40. Z. Landgraf, R. Scona, T. Laidlow, S. James, S. Leutenegger, A.J. Davison, Simstack: A generative shape and instance model for unordered object stacks. Proceedings of the IEEE International Conference on Computer Vision pp. 12992–13002 (2021). https://doi.org/10.1109/ICCV48922.2021.01277

  41. M. Sundermeyer, Z.C. Marton, M. Durner, R. Triebel, Augmented autoencoders: implicit 3D orientation learning for 6D object detection. Int. J. Comput. Vis. 128, 714–729 (2020). https://doi.org/10.1007/s11263-019-01243-8

    Article  Google Scholar 

  42. H. Hu, F. Immel, J. Janosovits, M. Lauer, C. Stiller, in 2021 IEEE 17th International Conference on Automation Science and Engineering (CASE) (2021), pp. 1097–1103. https://doi.org/10.1109/CASE49439.2021.9551449

  43. R. Schnabel, R. Wahl, R. Klein, Efficient ransac for point-cloud shape detection. The Eurographics Association and Blackwell Publishing 2007(26), 214–226 (2007). https://doi.org/10.1111/j.1467-8659.2007.01016.x

  44. R.B. Rusu, Z.C. Marton, N. Blodow, M. Dolha, M. Beetz, Towards 3D point cloud based object maps for household environments. Robotics and Autonomous Systems 56, 927–941 (2008). https://doi.org/10.1016/j.robot.2008.08.005. Semantic Knowledge in Robotics

  45. S.C. Stein, F. Wörgötter, M. Schoeler, J. Papon, T. Kulvicius, in 2014 IEEE International Conference on Robotics and Automation (ICRA) (2014), pp. 3213–3220. https://doi.org/10.1109/ICRA.2014.6907321

  46. C.H. Rodriguez-Garavito, G. Camacho-Munoz, D. Álvarez-Martínez, K.V. Cardenas, D.M. Rojas, A. Grimaldos, in Applied Computer Sciences in Engineering, ed. by J.C. Figueroa-García, J.G. Villegas, J.R. Orozco-Arroyave, P.A. Maya Duque (Springer International Publishing, Cham, 2018), pp. 453–463. https://doi.org/10.1007/978-3-030-00353-1_40

  47. G.A. Camacho-Muñoz, J.C.M. Franco, S.E. Nope-Rodríguez, H. Loaiza-Correa, S. Gil-Parga, D. Álvarez-Martínez, 6D-ViCuT: Six degree-of-freedom visual cuboid tracking dataset for manual packing of cargo in warehouses. Data in Brief p. 109385 (2023).https://doi.org/10.1016/j.dib.2023.109385

  48. L. Meier, ANOVA and Mixed Models: A Short Introduction Using R (Chapman and Hall - CRC, 2022), vol. 1, 1st edn., chap. 7. https://doi.org/10.1201/9781003146216

  49. T. Hodaň, F. Michel, E. Brachmann, W. Kehl, A. Glent Buch, D. Kraft, B. Drost, J. Vidal, S. Ihrke, X. Zabulis, C. Sahin, F. Manhardt, F. Tombari, T.K. Kim, J. Matas, C. Rother, BOP: Benchmark for 6D object pose estimation. European Conference on Computer Vision (ECCV) (2018)

  50. S. Teruggi, F. Fassi, Hololens 2 spatial mapping capabilities in vast monumental heritage environments. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLVI-2/W1-2022, 489–496 (2022). https://doi.org/10.5194/isprs-archives-XLVI-2-W1-2022-489-2022

  51. B. Wen, J. Tremblay, V. Blukis, S. Tyree, T. Müller, A. Evans, D. Fox, J. Kautz, S. Birchfield, in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023), pp. 606–617. https://doi.org/10.1109/CVPR52729.2023.00066

Download references

Acknowledgements

This work would not have been possible without the support of PSI Laboratory at Universidad del Valle and Voxar Laboratory at Universidade Federal de Pernambuco.

Funding

This work was supported by (1) internal call for mobility support – CIAM from Universidad del Valle, and (2) general system of royalties of the department of Cauca (Colombia) through Cluster CreaTIC and their project “Fortalecimiento de las capacidades de las EBT-TIC del Cauca” (grant 1-2019).

Author information

Authors and Affiliations

Authors

Contributions

GAC: undertook conception, data curation, work design, analysis, data interpretation, algorithm development, drafted the work, substantively revised it, and edited it. SENR: contributed to supervision, work design, conception, analysis, data interpretation, substantively revised it, and edited it. HLC: contributed to work design, supervision, conception, analysis, data interpretation, substantively revised it, and edited it. JPSML: contributed to data analysis, interpretation, and substantively revised it. RAR: contributed to data analysis, interpretation, and substantively revised it. All authors have read and approved the final manuscript.

Corresponding authors

Correspondence to Guillermo A. Camacho-Muñoz, Sandra Esperanza Nope Rodríguez or Humberto Loaiza-Correa.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Camacho-Muñoz, G.A., Nope Rodríguez, S.E., Loaiza-Correa, H. et al. Evaluation of the use of box size priors for 6D plane segment tracking from point clouds with applications in cargo packing. J Image Video Proc. 2024, 17 (2024). https://doi.org/10.1186/s13640-024-00636-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13640-024-00636-1

Keywords