Semi-automated computer vision-based tracking of multiple industrial entities: a framework and dataset creation approach

This contribution presents the TOMIE framework (Tracking Of Multiple Industrial Entities), a framework for the continuous tracking of industrial entities (e.g., pallets, crates, barrels) over a network of, in this example, six RGB cameras. This framework makes use of multiple sensors, data pipelines, and data annotation procedures, and is described in detail in this contribution. With the vision of a fully automated tracking system for industrial entities in mind, it enables researchers to efficiently capture high-quality data in an industrial setting. Using this framework, an image dataset, the TOMIE dataset, is created, which at the same time is used to gauge the framework’s validity. This data-set contains annotation files for 112,860 frames and 640,936 entity instances that are captured from a set of six cameras that perceive a large indoor space. This dataset out-scales comparable datasets by a factor of four and is made up of scenarios, drawn from industrial applications from the sector of warehousing. Three tracking algorithms, namely ByteTrack, Bot-Sort, and SiamMOT, are applied to this dataset, serving as a proof-of-concept and providing tracking results that are comparable to the state of the art.


Introduction
The continuous, real-time tracking of entities of interest plays a crucial role in industrial settings from production facilities to warehouses [1].In light of future challenges, automated, vision-based tracking of industrial entities helps increase process transparency [2].The application potentials for the industry are manifold.With emerging needs in digitization and automation, industrial entities need to be continuously tracked in real time to increase the adaptability of logistics systems with different conditions in terms of layouts, conveyors, etc.The available, however still unused information of these entities could be leveraged to automate and efficiently design the subsequent interfaces in the process.
The adoption of object tracking in the industry would facilitate the creation of a future-proof, scalable, and flexible infrastructure for monitoring processes.Process steps that rely on manual object identification or scanning equipment could then be eliminated and replaced by comparatively inexpensive cameras that function in an environment-agnostic manner.Given these requirements, our vision of a fully automated tracking of multiple entities in the industry can be articulated as follows: In an industrial environment, such as a warehouse, all entities should be continuously tracked, classified and identified in real time.As a consequence, their location, 6D pose and identity are known at all times.This remains the case when multiple entities are present at once, might be in motion and might occlude one another.The sensors used for this purpose are comparatively inexpensive, do not need to meet a narrow set of criteria, do not need to be mounted in a very specific manner, and are easily obtainable.An example for one such sensor could be an RGB camera with a standard lens and resolution.The information that is inferred from the sensor data is used to monitor and optimize and to increase the transparency of existing processes (e.g., in the form of a digital twin).Thanks to some of this information, novel processes might emerge.A visualization of this vision, put into practice in a warehousing scenario, might look like, can be seen in Fig. 1.
Besides the task of object tracking [2], research in the field of tracking concerning humans has also been performed [3][4][5].We define the difference between (industrial) entities and (human) subjects in the sense that objects have simple and predictable movement patterns with only a brief and limited motion profile, if any.On the other hand, subjects, like humans, possess dynamic structures that are prone to self-occlusion, along with unpredictable and unrepeatable movement patterns.In our work, we only refer to object tracking, hence we take only industrial entities into account.

Problem Statement
As to put the herein described vision of a fully automated tracking system into practice, the following challenges have to be addressed: Realistic scenarios, demonstrating the movement of industrial entities throughout a common industrial environment, have to be chosen and planned.For this purpose, a viable data foundation, in the sense of entities that are commonly used in industrial settings, moved in a way in which they would be moved in the latter, has to be established.Out of these scenarios, a dataset has to be created.This dataset needs to contain annotated recordings, that can be used as trustworthy, ground truth training data for a computer vision algorithm.A set of such algorithms has to be selected and applied to the recorded data, and subsequently be compared to one another based on pre-defined evaluation metrics.Describing all these challenges, however, reveals the challenge that is at the core of this undertaking -the lack of a recording framework, that enables researchers to efficiently record and (semi-)automatically annotate data.

Goal of the Contribution
The goals of this contribution are the following: We aim to provide a framework for the continuous tracking of industrial entities over a network of cameras.The provision of such a framework for the research community is motivated by the increase in efficiency and reduction of laborious annotation work entailed by it.We will describe the process of creating this framework in detail.Further, we aim to create a dataset with high quality ground truth data, that can be used as a benchmark for subsequent research.This dataset will comprise multiple scenarios, that we will establish and describe in this contribution and that closely resemble industrial scenarios.We subsequently aim to apply a set of algorithms to the dataset, as to provide a proof-of-concept for our framework.

Structure and Methodological Approach
The next sections are structured as follows: Section 2 will outline and contextualize the related work on computer vision of tracking entities.This is followed by an explanation of the conducted experiments and used methodology in section 3. Section 4 shows the corresponding results.Finally, in section 5, the results are summarized, discussed, and an outlook is given on what further research in tracking industrial entities can look like.All in all, we want to provide a transparent approach on how state-of-the-art object tracking can be used as a benchmark for others.Our framework realises tracking with a concrete approach that is also applicable for the industry and is a key element for practical application.

Related Work
Computer vision based tracking is a research field that has gained attention in recent years.The rapid developments of this field of study lead to the emergence of numerous multi-object tracking algorithms and frameworks as well as datasets.Therefore, this chapter briefly presents the relevant literature related to camera based object tracking techniques and frameworks, existing computer vision datasets, and methods of dataset creation.We also discuss existing tracking approaches in different application domains.

Camera based Object Tracking
The points of interest for camera based object tracking are the detection of particular entities and the estimation of their movement trajectories while maintaining a distinct identification for each item within the camera view.In the current state of the art, Multi Object Tracking (MOT) is one of the computer vision based tracking concepts that has been widely implemented in diverse application fields.The tendency of using MOT can be seen for various algorithms and benchmarks and can be applied to single camera or multi camera systems.However, the deployment of MOT applications considers some real-world challenges, i.e., occlusion of entities over long time periods and the task of re-identification after the occlusion.Therefore, in this subsection, we review both systems to provide insights into their respective drawbacks and advantages.

Single Camera Systems
The single camera system is a fundamental system architecture for the development of tracking algorithms.Ciaparrone et al. [6] conducted a survey emphasizing the usage of Deep Learning (DL) in MOT for 2D data using the Single Camera Tracking (SCT) technique.The survey specifies that most MOT algorithms that are developed to be used with a single camera have four steps/stages in common: detection, feature extraction / motion prediction, affinity, and association.The implied aim is to implement DL at every stage and to evaluate the given algorithms as a whole on a MOTChallenge dataset [7].The datasets mostly consist of benchmarks for pedestrian tracking.Deep learning is mostly used for the first two stages, while only a few contributions implement DL approaches for affinity and association.
From this survey [6], the authors emphasize three important parameters to deploy MOT algorithms: (i) the detection quality, (ii) Convolutional Neural Network (CNN) for feature extraction, and (iii) Single Object Tracking (SOT) trackers.In terms of detection quality, appropriate detectors must be thoroughly selected to reduce the number of False Negatives (FN) in the Multi−Object Tracking Accuracy (MOTA) score.Currently, the best performing DL based detector is Faster Region−based Convolutional Neural Network (RCNN) from [8].In contrast, Single−Shot Detector (SSD) performs worse, as presented in [9,10].However, SSD was almost able to work in real-time (4.5 FPS), including the detection step.
For the feature extraction stage [6], the best-performing method, GoogLeNet [11], is applied to the datasets of MOT15 [12], MOT16 and MOT17 [13].Approaches that do not use appearance (whether they are deep or conventional methods) typically perform worse.Visual features alone, however, are insufficient to compute affinity; many of the better-performing algorithms additionally include other characteristics, particularly motion features.The integration of SOT to the private MOT detectors along with DL is considered to generate well-performing online trackers.
Authors of [14][15][16] have investigated a DL approach for affinity using the MOT16 [13] dataset.Both works of [14,16] demonstrate the reliable similarity measures to support person re-identification after occlusions and are able to reach the highest MOTA score of 49.3%.The survey also mentions that few have used DL to enhance the association process from the classical association, like the Hungarian algorithm, such as Recurrent Neural Network (RNN) [17], deep Multi Layer Perceptron (MLP) [9], and Reinforcement Learning (RL) [18].However, the usage of DL as to directly guide the association algorithm and to perform tracking is still at its starting stage.
The Simple Online and Realtime Tracking (SORT) [19] algorithm is regarded as the foundation for the online and real-time application of MOT.This approach implements Kalman Filter (KF) as the basic prediction of the tracklet bounding box between frames and the constant-velocity model as the motion model.One of the limitations of SORT is that it accumulates error estimation of the entity position over time due to obstacles or non-linear motion.To overcome this issue, the BoT-SORT tracker [20] was developed by combining the benefits of camera-motion correction, motion and appearance information, and a more precise Kalman filter state vector.In addition, this tracker provides a novel, straightforward, and compelling technique of Intersection over Union (IoU) and re-identification through cosine-distance fusion, in order to obtain stronger correlations between detections and tracklets.The authors [20] further integrate BoT-SORT into the novel Byte-Track [21], which uses the backbone of the high-performance detector YOLOX.Both BoT-SORT and Byte-Tracker are evaluated using the datasets from the MOT17 and MOT20 challenges.The trackers outperform all current trackers in the MOTChallenge, with the results from MOT17 test set, which are 80.2 IDF1 (the ratio of correct detections to the average number of ground truth and calculated detections), 65.0 HOTA (Higher Order Tracking Accuracy), and 80.5 MOTA.

Multi Camera Systems
As shown in the aforementioned survey results, SCT shows a promising solution to handle MOT tasks.Nevertheless, SCT covers a finite view of a single camera which leads to inadequate detection, tracking, and re-identification robustness, due to occlusions over longer time spans [22][23][24].To resolve the occlusion problem, a Multi−Target Multi−Camera Tracking (MTMCT) approach is proposed by several contributions [25][26][27][28].MTMCT defines the combination of the different perspectives from multiple networked cameras to detect and track entities.Common MTMCT pipelines begin with the SCT step based on tracking-by-detection from each camera [24,[26][27][28].The tracklets from the detection step are then generated as the input to get SCT for each entity in each camera.All tracked targets of each single camera or SCT are furthermore associated by using the camera clustering approach [28][29][30].The final output of MTMCT is Multi Camera Tracks (MCTs) in a high dimensional space which are obtained by the clustering step.
Zhang et al. [27] introduce a challenging benchmark for MOT on pedestrians that is comprised of two main modules: intra-and inter-camera tracking.Their dataset is recorded from non-overlapping video recordings from six to eight cameras with a resolution of 640 × 480.Intra-camera tracking generates tracklets for each individual camera that utilize the SCT algorithm.SCT's output is then forwarded to the inter-camera module where the data association takes place in the MTMCT system.Tracking Length (TL), Crossing fragments (XFrag), and Crossing ID-switches (XIDS) are three possible evaluation metrics.For scenarios two to six, TL results (percentage of the correctly tracked object) are varying from 70% to 80%, XFrag results (number of times for a linked pair of tracks) are ranging from 29 to 42 links, XIDS results demonstrate from 23 to 44 tracks that lack a link to the ground truth trajectories.
A survey about intelligent multi-camera video surveillance is carried out by the authors of [26].Their work introduces key technologies: multi-camera calibration, computation of camera networks topology, multi-camera tracking, object re-identification, and multi-camera activity analysis.The survey looks at ways of estimating 3D camera calibration, including intrinsic and extrinsic parameters, common ground plane, automatic calibration, and two cameras with substantial overlap.The survey also emphasizes the topology of multi-camera networks which explains the handover of objects and computation of the topology.There are multiple methods for topology computation, including; correspondence-based, correspondence-free, and topology inferred by non-overlapping camera networks.The section goes fairly in-depth into the ideas of inter-camera tracking based on multi-camera calibration, intercamera tracking with appearance cues, and solving correspondence views across multiple cameras.
Specker et al. [28] define an occlusion-aware MTMCT approach for vehicle tracking and re-identification that enhances both SCT and Multi Camera Tracks (MCTs) operation.Furthermore, the authors adopt the global feature learning model from [31] to handle vehicle re-identification.To improve the resulting accuracy, a multiple re-identification network is applied.The SCT setup introduces an occlusion handling strategy and additional modules for filtering faulty detections.These steps can be achieved by using temporal information from tracks.The MCTs setup uses a novel pipeline that includes a scene model, filtering of tracks, re-identification distance calculation, and hierarchical clustering.The hierarchical cross-camera clustering based on vehicle re-identification features is adapted from works of [32,33] to merge the multi-camera tracks by leveraging topological and temporal constraints of the tracks of each camera in the network.The authors [28] propose that in order to decrease the negative influence of overlapping vehicles, one should improve re-identification by excluding boxes in the background or with occlusion.

Computer Vison Datasets
Successful deployment of DL-based computer vision applications relies on relevant and high quality datasets [34].Nowadays, datasets are aimed to encompass diverse and specific use cases and current trends tend to be dominated by outdoor applications, i.e., MOTChallenge (MOT15 [12], MOT16, MOT17 [13], MOT20 [35]), KITTI [36], MS COCO [37] (Common Objects in Context).The MOTChallenge dataset is a popular framework containing a large collection of multiple people-tracking datasets in dense pedestrian scenarios and the evaluation benchmark for various tracker algorithms.
The MOTChallenge uses different metrics to evaluate the performance of MOT methods.Standard evaluation metrics include multi object tracking accuracy (MOTA) [38], higher order tracking accuracy (HOTA) [39], Identity F 1 Score (IDF1) [40], and Identity switches (IDs) [21].Metrics differ in their consideration of the causes of errors.The IDs metric counts the number of swapped object identities during tracking.The MOTA metrics combines three sources of errors and is defined as follows: where t is the current frame and GT is the total number of visible objects [13].
Alongside the TP, FP, FN, and TP measures, the HOTA metric considers the classification of associations.Given a TP c, the set of True Positiv Associations (TPAs) is the set of TPs with the same ground truth and predicted identities as c [39].The HOTA metric with a localization threshold α is defined as: with The IDF1 Score considers the assignment of objects to their ground truth identities and is defined as: An autonomous-driving related dataset is demonstrated in KITTI [36], that specifies various traffic scenarios.The published dataset contains six hours of video from the cameras and sensor measurements which are captured at 10 -100 Hz readings.Moreover, MS COCO [37] (Common Objects in Context) datasets contribute to providing daily life scenes with over 80 object classes and 200, 000 labeled images.Despite large datasets, MS COCO does not cover industry-related computer vision applications.MVTec ITODD [41] accommodates realistic industrial setups for 3D object detection and pose estimation.The dataset consists of 28 asset classes that are sorted in more than 800 scenes and labeled using approximately 3, 500 rigid 3D transformations as the ground truth [41], i.e., engine parts, metal plates, bearings, injection pumps, etc. Luo et al. [42] present a benchmark dataset for industrial tools (ITD) to identify different types of tools at the level of usage.This dataset is aimed to accurately forecast how a robot would interact with various industry settings.ITD includes more than 11, 000 hand-labeled RGB images in eight tool categories with 24 general industrial tools in total as well as their multi-perspective views of every tool.Regardless of various scenario views, this dataset only focuses on small industrial tools such as safety goggles, wrenches, screw drivers, etc.
Synthetic-based industrial object datasets are, e.g., created in the research work of [43,44].The authors of [43] develop both real-world and synthetic data of industrial metal or reflective objects that are arranged as multi-view RGB images with 6D object pose labels.The real-world objects dataset contains 600 scenes with 31, 200 RGB images and the synthetic data provides 42, 600 synthetic scenes containing 553, 800 images.The twin resemblance of synthetic and real-world datasets including a controlled environment facilitates simulation-to-real-world research.In this manner, computer vision based simulations with scalable scenarios are able to be conducted.Akar et al. [44] propose synthetic datasets of industrial objects for object detection applications.The datasets are generated as 200, 000 photo-realistic generated images with precise bounding box annotations that are categorized as 8 industrial objects in 32 scenarios.The warehouse environment model as well as the datasets are rendered using NVIDIA Omniverse.The goal of synthetic datasets is to automatically generate datasets for real-world multiple object detectors from genuine camera feeds.
The Logistics Objects in Context (LOCO) [34] dataset presents an indoor environment dataset for warehousing logistics.However, the LOCO dataset does not contain timestamps for the recorded image streams which renders it unsuitable for object-tracking algorithms.This type of logistics or industry related dataset is rare to encounter in research [45][46][47].The authors [34] intend to accelerate computer vision based research for logistics by emphasizing the creation of objects and scenes of warehousing entities and privacy protection of image acquisition.The LOCO dataset has 39, 101 images comprising 151, 428 annotated logistics entities such as pallets, pallet trucks, and forklifts.

Dataset Creation Methods
The creation of industry related datasets is the topic of this subsection.Obtaining and marking such datasets in an industrial environment can be difficult due to factors such as it being time-consuming, susceptible to human mistakes, and constrained by various privacy and security regulations [34,43,44].Therefore, using a semi-or fully-automated pipeline for the dataset creation should be considered.All setups of the related industrial dataset papers are summarized in Table 1.Semi-manual annotation for the 3D images of the industrial objects is adapted in MVTec ITODD [41].For each object, three types of scenes are captured: (i) those with only one instance of the object and no extra items, (ii) those with multiple instances of the object and no extra items, and (iii) those with both multiple instances of the object and additional clutter.The individual scene is recorded once using a 3D industrial camera, and twice using grayscale cameras: one scene with a randomly projected pattern and another one without a random pattern.Both grayscale and 3D cameras are located on top of the shelf setup and calibrated previously with regard to their relative position to the object.The recorded object is positioned on a calibrated turn's movements under the cameras that allow the multiple scenes to be captured automatically.In this manner, the ground truth of 3D object poses are transferred directly for every rotation.Instead of using a rounding box as the correctness measure, the authors [41] implement 3D pose based evaluation.The datasets are evaluated using 3D pose based methods: Shape-Based 3D Matching (S2D), Point-Pair Voting (PP3D), Point-Pair Voting with 3D edges (PP3D-E), Point-Pair Voting with 3D edges and 2D refinement (S2D), and RANSAC.Although S2D outperforms other methods when estimating the image results, a majority of the results are false positives.
PP3D-E performs the prediction well with a top-1 detection rate of 68% with the given threshold of 5% but the running time is higher (by 0.1 s) which must be improved for the industrial use.
The Industrial tool dataset (ITD) [42] is gathered utilizing a Kinect 2.0 sensor that can generate 30 RGBD frames per second, featuring a resolution of 1024 × 575 px, as well as 512 × 424 px depth frames.To collect the data, the tools are positioned within a distance range of 1 m -5 m from the camera.The tools are placed in their typical positions and industrial settings, while the camera is positioned at the same point of view as that of the worker's eyes.The worker walks smoothly around the target tool while maintaining a consistent focus on it.The labeling process is conducted manually by experts.Each worker is tasked with identifying the name of the tool, the category it belongs to, and its potential usage.The task requires a total of approximately 200 h to complete.The performed evaluations demonstrate that cluttered backgrounds and inconsistent ambient lighting impact tool detection.Moreover, the performance suffers from the worker's motion-induced visual blur.To achieve the industrial requirements, the refinement of detection methods is necessary.
The dataset for industrial metal objects, described in [43], is recorded in two parts -real-world and synthetic data.An industrial grasping robot, the Fanuc M20ia, is equipped with the data acquisition setup listed in Table 1 (except the 360 o camera) to record multi-view images of various scenes in the real world.The real-world scene is captured by each camera from 13 different viewpoints to obtain 6D poses of each object.Six different metal objects with different lighting setups are also considered during the recording.In addition, the objects are recorded in three different types of carriers: metal plates, small bins, and cardboard boxes.The labeling of 6D poses from object models is carried out semi-manually using a proprietary tool.The synthetic datasets are generated by mimicking real-world scenes, i.e., poses, lighting, models, textures on Unity for which the virtual environment uses a HDRI environment map.This map is constructed by the captured images from a 360 o camera using different types of exposures.Finally, all real-world and virtual scenes are generated as the dataset containing subfolders for each camera IDs and an individual subfolders corresponding to each CAD model of the respective object.To evaluate the labeling performance, de Roovere et al. calculate the pose errors using Maximum Symmetry-Aware Surface Distance (MSSD).
A full synthetic dataset for warehousing environments is rendered in NVIDIA Omniverse based on the Universal Scene Description (USD) method [44].Akar et al. employ Material AI tools to transform the captured images from real-world cameras and material scanners into realistic virtual models.The scene recording setups are emulated as authentic factory representations that have many assets and instances.For each scene recording, the randomized locations and rotations are assigned to the camera in order to capture the scene's randomness from diverse perspectives.Subsequently, synthetic image generation is initiated to automatically and accurately annotate the images in each scene up to the pixel level.FRCNN ResNet50 surpasses SSD DL model in terms of detecting stillages, transport robots, dollies, and pallets with the Average Precision (AP) metric at 0.5 are 69.90%,89.93%, and 48.60%, respectively.The recordings of the LOCO [34] dataset are captured using different types of cameras with diverse fields of view and resolutions in a real warehousing environment.The cameras are set up on a mobile unit with a special arm, thus enabling the re-adjustment of the camera's point of view.The mobile unit moves around the warehouse while changing the cameras' perspectives.The captured images are recorded and stored with a 1 Hz frequency.The LOCO annotator uses the backbone of the COCO annotator with additional features, such as an automated bounding box tool and new hotkeys.To ensure the privacy of the warehouse workers in the dataset, Mayershofer et.al. utilize a neural network to automatically perform pixelization of all detected faces during the annotation phase.The evaluated models exhibit a lower performance compared to the COCO benchmark, with an mAP at 0.5 ≈ 20%−40% on the LOCO benchmark.

Methodology
Due to the existing deficiency in the publicly available object tracking datasets in the logistics and industrial domains, we collect a custom dataset and annotate it in a semi-automated fashion.The following section describes our dataset recording procedure, our dataset structure, and the annotation process.The word entity is used in this work to refer to the recorded objects.This excludes commonly used references in the literature such as object pose estimation, object tracking, and object detection.

Planning and Execution of the Dataset Recording
We derive two situations from the warehousing sector that represent processes occurring in actual industrial use cases, namely a goods reception scenario and a block storage scenario.In order to ensure realistic circumstances, two different loading degrees of the pallets were recorded.In the first stage, only empty pallets are moved.The second stage involves fully loaded pallets.As to ensure a realistic environment, we use six different industrial entities (small load carriers, pallets, barrels, cardboard boxes, forklifts, and a mesh box, as shown in Fig. 2).Pallets of different types were used, including We define a pallet to be fully loaded if it is stacked with three layers of small load carriers on top of one another.In addition, entities such as barrels and cardboard boxes have been used and were not stacked.The first scenario, shown in Fig. 3, mimics an inbound material flow scenario that starts with an empty loading area, with the pallets set up to fill said area along the process.The dotted lines represent the spots that the pallets are placed in during this scenario.In the first stage, they are placed apart from one another while in the second scenario, they are placed more closely together.In the block warehouse scenario, shown in Fig. 4, the recordings being with a block of pallets that is already set up.Subsequently, individual pallets are pulled out and moved outside of the field of view of the cameras.For this scenario, a 2 × 2 block of pallets has been used in the first stage, and a 3 × 3 one in the second stage.In total seven recordings are performed, as shown in Fig. 5. Fig. 5a shows scenario 1, stage 1, during which the pallets are arranged with a considerable distance between them.The inspiration for this scenario is that the two lanes that are built in this way could be found in the goods-receiving area of a warehouse, e.g., as to unload trucks.The pallets are then unloaded, e.g., from a truck and are placed far apart to allow warehouse workers to inspect the newly arrived goods.In Fig. 5b, scenario 1, stage 1 with the closely placed pallets is shown.This scenario mirrors the loading process as it could be expected to be performed to load a truck.Fig. 5c and Fig. 5d show the first scenario in their second stage, i.e., with loaded pallets.Lastly, Fig. 5e, 5f and 5g show the second scenario, which mimics a block warehouse, in the above mentioned stages.During the recording of these scenarios, varying lighting conditions were used.

Setup and Data Collection
The area that is used to record the data proposed in this work is a former warehouse that has been transformed into an applied research facility.Its recording space is covered by six monocular RGB cameras providing parallel video streams.The area is also covered by a marker-based motion capture system [45] comprised of 52 infrared cameras.These cameras provide accurate poses of the tracked entities with respect to a common reference frame.This setup is shown in Fig. 6.The dataset is collected by deploying industrial entities within the recording space, according to the configuration of the scenarios mentioned in section 3.1.The entities are moved around by human operators to simulate inbound and outbound operations, again according to the previously described scenarios.While doing so, a video stream is captured through the RGB cameras.Simultaneously, the ground truth pose information for all tracked entities is acquired through the motion capture system.

Data Processing
The data collected by the motion capture system and the RGB camera system are processed on separate computers.The aim is to reduce the processing time necessary to request pose frames from the motion capture system and thus to increase the frames per second (FPS) of the streamed images from the RGB camera system.The frames from each of the six RGB cameras are collected on one computer along with their timestamps.The second computer collects information on entity IDs, entity poses, and timestamps from the motion capture system.The start and stop of collection from each of the systems are triggered manually.Each system's streams are synchronized in a post-processing phase.
In terms of hardware, six Genie Nano C2590 RGB cameras with 2 MP resolution are used.The cameras are fitted with a Kowa LM8HC-SW lens with a 79.4 × 63.0 field angle.All six cameras are connected to 10 Gigabit Ethernet switch, which passes the streamed data to a data collection computer via an optical fiber network connection.The motion capture system consisting of 52 cameras uses a mixture of Vicon Vero and Vicon Vantage cameras that are mounted on the ceiling and at different elevations in our research facility.
RGB camera settings such as brightness and white balance values were allowed to update periodically throughout the recordings.Illumination in the recording space was kept constant throughout each individual recording, changing in between recordings, and there was no significant color hue variation from the scene.Images were stored in raw bmp format and distortion was preserved.

Synchronization
The recording of video streams is event-triggered for each camera.However, to guarantee an equal number of retrieved images from all cameras, simultaneous capturing is necessary.Synchronized, simultaneous capturing also has the advantage of preserving the instantaneous state of the scene.Recording in such a manner can facilitate performing hand-offs between the different perspectives for multi-camera tracking algorithms.This also has the advantage of enabling more accurate re-identification of entities from different viewpoints.
Simultaneous capturing is done for all cameras by triggering a single image capture on each camera followed by trigger locking to prevent further capturing.The software lock is released on all cameras simultaneously only when image retrieval on all cameras has ended.Thus, for each capturing trigger, the slowest camera determines the overall FPS of the system.An average of approximately 20 FPS per scenario is achieved.
Beyond achieving synchronization amongst the RGB cameras, it is necessary to synchronize between the RGB camera system and the motion capture system due to data capturing rate differences.During our experiments, the motion capture system had a fixed pose update rate of 200 Hz.We match image frames to their respective poses based on the smallest timestamp difference between both instances.Since entities in the scene move at less than 1 m s and due to the high update rate of the motion capture system, pose differences between consecutive frames are insignificant.The synchronization between both streams is accomplished as a post-processing step.

Data Structure
Since the currently available datasets for object tracking lack the combination of systems used in this work, we collect our data and process it into a custom data structure.The final annotation data structure of our custom dataset is shown in Table 2.
..[-0.0035, -0.0036, -0.0014] -0.0037 [293, 0, The Image Path refers to the relative image path with respect to each camera view.Images are converted to jpg format for efficient storage.Entity Name refers to the entity ID as retrieved via the motion capture system.It is worth noting that initially an entry is preserved for all entities in each captured image, regardless of their existence in the captured scene.During the annotation phase, as discussed in section 3.4, invalid projections of the entities' 3D models are removed.Position and Orientation are 3 × 1 vectors defining the relative pose of the entities in 3D space with respect to each camera.Position data are provided in mm and orientation data are provided in radians in intrinsic XY Z Euler format.The position is obtained with respect to the motion capture system's global reference frame.The reference frames of the motion capture system and the RGB camera system are unified to enable the calculation of the transformation chain generating the entity's relative pose.The entry Delta Time is the smallest calculated time offset between the capturing time of the RGB image and its corresponding pose.The Bounding Box is the 4 × 1 vector defining the pixel coordinates of the top left x and y coordinates, along with the width and height of the box.The Visible flag indicates whether an entity is perceived in the field of view of the respective camera.The flag is generated automatically as part of the post-processing step of the annotation pipeline used.This is accomplished by disregarding entity 3D model projections when rendered at their ground truth pose, as discussed in section 3.4.Bounding boxes that correspond to entities that are invisible in the relevant camera view are denoted with coordinates of −1.Invalid data from the motion capture system, such as those obtained when an entity is outside the system's region of operation, are filtered out in a post-processing step.

Annotation
To maximize image capturing throughput, we separate the data collection phase from the annotation phase.In the annotation phase, we generate image annotations in an automated fashion by leveraging the 3D models' projection at the ground truth poses collected from the motion capture system.The annotation pipeline fits bounding boxes to the 2D image projections of the 3D models at their obtained poses in the scene relative to the camera of interest.
The annotation pipeline is comprised of different phases.Initially, the RGB images and motion capture system poses are collected simultaneously.Then the reference frames of the motion capturing system and the RGB camera system are unified, and incoming streams from both systems are synchronized.This is followed by the main phase during which the relative transformations are calculated between the tracked entities of interest and each camera.Finally, the 3D models are projected at their calculated relative transformations where they are fitted with bounding boxes to generate the final image annotations.

Results
The herein presented TOMIE dataset includes a total of 112, 860 images and 640, 936 entity instances.In comparison to similar datasets, the number of captured images outnumbers the biggest dataset [13] by a factor of 4, while the number of captured entity instances is approximately 25% smaller.
The annotations were generated using a computer equipped with an Intel Core i9 that possesses 28 cores and 128 GB of RAM.The renderer deployed, VisPy [50], uses the onboard Nvidia Titan Xp GPU with 12 GB of VRAM throughout the annotation process.Samples of annotated images are shown in Fig. 7.We provide the source code for our automated annotation pipeline1 for public usage as well as the source code for our data collection phase2 .During the annotation process, an average of 1.5 s was spent on each object instance in the recording.This amounts on average to 9 s spent per image for the annotation of all visible entities.The annotation speed achieved through the use of automated annotation is significantly higher than comparable manual annotation, like the one described in [51].Dataset statistics per camera and per entity are shown in Table 3 and Table 4.To evaluate how far our custom dataset can be used for training classifiers that achieve a performance sufficient for industrial applications, multiple experiments were conducted.For these experiments, three of the currently best-performing models for the MOT20 [35] dataset, namely ByteTrack [20], SiamMot [52], and Bot-SORT [20] were chosen.Publically available and official implementations for all models were used during the evaluation.The ByteTrack and Bot-SORT models rely on YoloX [53] as a backbone for object detection.To this end, one YoloX model was pre-trained on our custom dataset to be used for both evaluation models.The average precision and recall of the resulting model were measured and are shown in Table 5.The resulting object detection results are visualized on some samples of our custom dataset in Fig. 8 as well.All models were trained and evaluated on our custom dataset in accordance with their respective work.For evaluation, the CLEAR metrics [38], including MOTA, as well as IDF1, and HOTA were used.These metrics evaluate different aspects of the detection and tracking performance.The results are displayed in Table 6.
While the TOMIE dataset is composed of more data, the results show that the performance of the tracking algorithms does not match those of similar datasets.This deficit could be the result of the change in observed entities compared to MOT20, as well as limitations in the dataset itself.

Conclusion and Outlook
In this contribution, a novel framework and approach for the efficient computer vision based tracking of multiple industrial entities was presented.Using a space of approximately 16 x 8 sqm in a warehousing environment, 52 infrared cameras and six RGB cameras mounted on the ceiling and railings of this warehouse, a tracking space was defined.In this space, six industrial entities, including small load carriers, pallets, barrels, cardboard boxes, forklifts, and a mesh box were tracked using reflective markers and tracking software using infrared tracking hardware.With this tracking setup, the herein presented TOMIE dataset was recorded, including 112, 860 frames worth of RGB images and annotation files that contain approximately 16 min of recordings, after data synchronization and filtration.The recordings were subdivided into distinct logistical scenarios, drawn from industrial applications (e.g., setting up pallets in lanes, to be loaded into trucks).Three commonly used tracking algorithms, namely ByteTrack, SiamMot, and Bot-Sort, were applied to the herein developed dataset, performing overall worse than on comparable state-of-the-art datasets.
While developing the recording setup, during the process of recording itself, and while evaluating the resulting data and its use, additional limitations and challenges were encountered.

Limitations and Challenges
While setting up the camera network for recording, a major challenge arose while trying to mark the industrial entities in a way, in which they would be detectable and distinguishable for the infrared cameras.As previously described, the marking tape needed to be distributed along the faces of the entities in such a unique way, that they would be distinguishable by virtue of the resulting point cloud.When working with a limited amount of entities, that have large surface areas, this does typically not cause any trouble.However, applying the same approach to a multitude of entities, especially smaller ones (e.g., the small load carriers in our dataset), causes the infrared cameras to yield suboptimal tracking results.
In addition, the proximity of the entities that ought to be tracked to one another further complicated the tracking process.When the markers on the edges of one entity came too close to those of another, one or both entities tended to disappear in the tracking software, resulting in frames that provide users with no positional ground truth.However, both the ground truth and the realistic positioning of the entities in a way that resembles industrial applications is of importance.
Furthermore, the software used in the herein presented tracking setup does not enable the tracking of human motion.The operators in the recorded tracking scenarios were therefore not tracked and come with no labeled ground truth in our dataset.The addition of such data might be of interest for researchers in the field of human activity recognition or person re-identification.
Once recorded, the data proved challenging to be interpreted for the purpose of frame-wise object detection, due to the use of multiple RGB cameras and the underlying ground truth being infrared camera based.This is because the ground truth is calculated based on the markers on the given entity in combination with its 3D rendered model.Using this set of data, no information is given on visual occlusion by other entities present in the recording.This results in the creation of 2D bounding boxes as a ground truth that are accurate in free space but would result in poor IoU results, when used with common object detection algorithms, which would only detect the non-occluded parts of the entities.In addition, when using more than one RGB camera, the notion of the term occlusion becomes even more complicated to deal with, as an entity that is occluded in one perspective might be entirely visible in another.This results in bounding boxes being created for entities that are entirely occluded in some perspectives, which would lead to an IoU of 0%, if the data were to be put to a test.
Subsequently, once an industrial entity were to be detected, the interest would lie in the classification and identification of said entity.While classification is in part feasible with the herein presented recording setup, the identification of specific entities, analogous to the work presented in [46], would necessitate an altered sensor use.More specifically, this would entail the use of cameras at a level close to the ground and closer to the recorded entities, as to capture their surface structure in more detail.This however, might lead to further occlusions, due to camera positioning.
Looking back at the vision for a tracking system that was established in the beginning of this contribution, some limitations still persist.One such limitation of the above mentioned occlusions, that do occur in industrial scenarios that are uncontrollable.In addition, since this work was conducted in only a single recording environment, it is yet to be evaluated, whether the selected algorithms would perform similarly in another environment.
Finally, while handling the recorded data, synchronization problems occurred, in which the RGB and infrared frames were not overlapping as they should.The reason for this has yet to be further explored.Additionally, the volume of the data that is generated using this recording setup is not to be underestimated.An efficient way of handling such large amounts of data is also of great importance, as to increase the efficiency and applicability of our recording approach.

Follow-up Research
Taking the limitations mentioned in the previous subsection and our results in general into account, we identified the following ways in which our contribution could be expanded upon: The scenarios that were recorded could be expanded upon in terms of their diversity (i.e., different versions of the same scenarios or more scenarios to begin with) and their duration.Furthermore, the complexity of the scenarios could be increased by including a greater amount of industrial entities and a greater amount of entity classes, including human operators.
The way in which the industrial entities are marked with reflective tape could be analyzed once more, creating a system that would allow for a more reliable marking of a larger amount of entities.In doing so, reproducibility and result quality could be enhanced.
Finally, the tracking software that was used thus far could be replaced by a self-developed one, which could be tailored for a multi camera setup.This tracking software might then be able to not only provide bounding boxes that would take occlusions into account but might also provide 3D bounding boxes, including information on the entity's orientation in space.The use of depth information (e.g., by virtue of RGBD cameras) might be necessary to accomplish this task.

Fig. 1 :
Fig. 1: Visualization depicting the RGB camera based tracking of multiple industrial entities in a warehousing environment.

Fig. 3 :
Fig. 3: Schematic illustration of scenario 1 with its two degrees of pallet proximity.

Fig. 4 :
Fig. 4: Schematic illustration of scenario 2 with its two block warehouse pallet ordering structures.

Fig. 5 :
Fig. 5: Frames taken from the two scenarios and their respective stages, used for our recordings.

Fig. 6 :
Fig. 6: Data collection setup.(a) shows the RGB images of the same scene as viewed from the six camera, (b) shows entities as perceived in the motion capturing system (obtained for a different scene).Rays show the detected retro-reflective markers by the system.

Fig. 7 :
Fig. 7: Samples of annotated images from a single view and different scenarios from our custom dataset.Bounding box colors are unique to each entity class.

Fig. 8 :
Fig. 8: Samples of annotated images by the chosen object detector of different scenarios from our custom dataset.

Table 1 :
Comparison of industrial based datasets creation setups.

Table 2 :
Sample entries in post-processed annotated data.

Table 3 :
Dataset statistics per camera.

Table 4 :
Dataset statistics per entity class.

Table 5 :
Average precision (AP) and average recall (AR) for bounding box estimation of industrial entities

Table 6 :
Results on our validation data.