Comparison of synthetic dataset generation methods for medical intervention rooms using medical clothing detection as an example

The availability of real data from areas with high privacy requirements, such as the medical intervention space is low and the acquisition complex in terms of data protection. To enable research for assistance systems in the medical intervention room, new methods for data generation for these areas must be researched. Therefore, this work presents a way to create a synthetic dataset for the medical context, using medical clothing object detection as an example. The goal is to close the reality gap between the synthetic and real data. Methods of 3D-scanned clothing and designed clothing are compared in a Domain-Randomization and Structured-Domain-Randomization scenario using two different rendering engines. Additionally, a Mixed-Reality dataset in front of a greenscreen and a target domain dataset were used while the latter is used to evaluate the different datasets. The experiments conducted are to show whether scanned clothing or designed clothing produce better results in Domain Randomization and Structured Domain Randomization. Likewise, a baseline will be generated using the mixed reality data. In a further experiment it is investigated whether the combination of real, synthetic and mixed reality image data improves the accuracy compared to real data only. Our experiments show, that Structured-Domain-Randomization of designed clothing together with Mixed-Reality data provide a baseline achieving 72.0% mAP on the test dataset of the clinical target domain. When additionally using 15% (99 images) of available target domain train data, the gap towards 100% (660 images) target domain train data could be nearly closed 80.05% mAP (81.95% mAP). Finally, we show that when additionally using 100% target domain train data the accuracy could be increased to 83.35% mAP. In conclusion, it can be stated that the presented modeling of health professionals is a promising methodology to address the challenge of missing datasets from medical intervention rooms. We will further investigate it on various tasks, like assistance systems, in the medical domain.


Introduction
For computer vision challenges in the medical intervention space such as object detection, person detection or more sophisticated challenges such as activity detection, few datasets exist [1,19]. These datasets focus on 2D and 3D detection of the human pose. Additionally, although multiple cameras are used in the datasets, only around 700 frames per camera are annotated. Furthermore, datasets besides the aforementioned are only institution-related and not publicly available possibly due to data protection regulations and ethics requirements [18,28]. Likewise, the data sets may lack the necessary variance for transferability to other localities. Moreover, other use cases besides the detection of the human pose exist. These include AI-based sterility detection of health professionals, whether and where certain medical devices are located or action recognition of health professionals [14]. From the works using camera-based systems during real interventions, to the best of our knowledge, no datasets besides the mentioned are available for the public.
The successes of deep learning in recent years are among others, due to the availability of large datasets such as Imagenet [16] for Image Classification, or MS COCO [8] for Object Bounding Box Detection. In addition, the research of new methods or architectures like [4,9,21,2] and the research of high-performance hardware for parallel computing are to be mentioned. The authors of [20,7] analyze the hardware topic in depth. In this work however, special focus is put on the availability of datasets and methods for dataset generation in order to reduce the necessary amount of real data from the target domain.
Several works already deal with the generation of synthetic data and with the goal of reducing the reality gap between synthetic and real data. Among these are the work on Domain Randomization (DR) and Structured Domain Randomization (SDR) [24,25,12]. In addition to other work, it has already been shown that the use of synthetic image data can decrease the required amount of real data [25]. Likewise, synthetic data of persons exist [5]. However, the challenge lies in the specific characteristics of a medical intervention space. Health professionals wear special clothing with sometimes multiple layers, wear sterile gloves, masks and hairnets. The differences between conventional human data and data from the medical field are large. Nevertheless, DR techniques sound promising for use in research questions around medical interventions.
This work presents a comparison in terms of detection accuracy and generalizability of different methods for synthetic clothing generation using either 3D clothing scans (SCANS) or designed CAD clothing (CAD) with the Skinned Multi-Person Liner Model (SMPL) [10]. The comparison is performed using the example of medical clothing object detection. To generate synthetic training data, both methods (SCANS, CAD) are incorporated into a DR environment called NVIDIA Deep Learning Dataset Synthesizer (NDDS) [23] and an SDR environment implemented in Unity, based on [22]. Likewise, the aim of the presented methodology is to explore a pipeline for the generation of synthetic data for the medical field, so that further research questions from the intervention

Related work
With the rise of synthetic data generation methods, for example DR [24], it has already been shown that synthetic data can reduce the amount of real data required [25]. However, one focus of research is the reduction of the reality gap between the synthetic data and the target domain.
Here, the aforementioned DR has turned out to be one way to reduce the gap. One idea of DR is that if enough variance can be generated in the synthetic data, reality represents another variance of the target domain [24].
The work of Tobin et al. [24] and Tremblay et al. [25] showed, that an object detection network for robot grasping or car detection can be trained from synthetic images with random positioning, random lighting, random backgrounds, distractor objects and non-realistic textures alone. In addition, the work of Tremblay et al. showed, that the necessary amount of real target domain data can be reduced while maintaining adequate accuracy, when pretrained with DR-generated images.
Also the work of Borkman et al. [3] showed that when using Unity Perception for synthetic data generation, the amount of real-world data could be reduced to 10% when used together with the synthetic data, while achieving better Average Precision (AP) score as with all real-world data alone.
DR has already been successfully applied in various fields. In addition to the mentioned areas of car detection and robot grasping, the work of Sadeghi et al. [17] for flying a quadrocopter through indoor environments, Zhang et al.
[31] for a table-top object reaching task through clutter or James et al. [6] for grasping a cube and placing it in a basket can be named.
This leads us to believe that DR is a suitable approach for the medical intervention room domain, where no real data are largely available and access to that domain is widely restricted.
Ablation studies of [25] and [24] showed, that high-resolution textures and higher numbers of unique textures in the scene improve performance. Also, [31] come to the conclusion, after testing their hypothesis, that using complex textures yields better performance than using random colors.
In contrast to the DR approach is the photorealistic rendering of the scene and objects. A number of datasets have been created for this purpose in recent years. Here the works of [26,5,29] or [27] are to be mentioned. Some of these works combine real image data with DR and photorealistic rendered image data.
In [26] a photorealistic rendered dataset was created for 21 objects of the YCB dataset. Here, the objects are rendered in different scenes with collision properties when falling down. The dataset is intended to accelerate progress in the area of object detection and pose estimation. In [27], DR is combined with photo realistically rendered image data, for robotic grasping of household objects. Using the data generated in this way, the authors have managed to explore a real-time system for object detection and robot grasping with sufficient accuracy. They also showed that the combination of both domains improved performance as opposed to just one alone.
In the field of human pose estimation, the works of [5] and [29] need to be mentioned. Both works were able to show that the performance of networks can be increased by using synthetic and animated persons, respectively.
The work of [29] generates photorealistic synthetic image data and their ground truth for body part classifications.
In [5], animated persons are integrated into mixed reality environments. The movements were recorded by actors in a motion capture scenario and transferred to 3D scanned meshes. In their experiments, they were able to achieve a 20% increase in performance compared to the largest training set available in this domain.
State-of-the-art models for realistic human body shapes are the SMPL models introduced by [10] and improved by STAR in [11]. According to the authors, the SMPL model is a skinned vertex-based model which represents human shapes in a wide variety. In their work they learn male and female body shape from the CAESAR dataset [13]. Their model is compatible with a wide variety of rendering engines like Unity or Unreal and therefore highly suited to be used in synthetic data generation for humans. There also exist extensions to the SMPL model like MANO and SMPL-H which introduce a deformable hand model into the framework. MANO [15] is learned from 1000 highresolution 3D scans of various hand poses.

Methods
As previously mentioned, real-world data collection in medical intervention rooms is complex, costly, and requires approval from an ethics board and the persons involved. As shown in the previous, DR/SDR can help train an object detection network with sufficient performance in real-world applications.
However, one challenge in dataset generation for the medical intervention space is domain-specific clothing. We argue, that randomizing the clothing textures with random textures would help improve detection rates of the clothing types, but when applied in real-world applications, for example a colored T-shirt would not be distinguishable from the targeted blue colored specific area clothing. For the general detection of cars as in [25] the randomization technique makes sense, but for the domain-specific use case presented here something else should be used in our opinion.
The questions we try to address in this work are: 1. How can health professionals be modeled for synthetic data generation? 2. Which techniques are best suited for SDR/DR clothing generation? 3. Can we close the reality gap further by including greenscreen data (Mixed Reality, MR)? 4. Can the required amount of real data be reduced by using SDR/DR/MR? 5. Can the accuracy be improved when combining real and synthetic data? For point (1), we argue to use a deformable human shape model like the SMPL models. This provides sufficient variance of different human shapes and sizes. For point (2), we explore two different methods of clothing generation. First, we 3D scan various persons wearing medical clothing and generate a database of different medical clothing scans for each clothing type, which we call SCANS. Second, we commission a professional graphics designer to create assets based on the area clothing, which we call CAD. Regarding point (3), we take images in front of a greenscreen of different persons wearing medical clothing which we label by hand. For point (4), we investigate whether the required amount of real data can be reduced with consistent accuracy by mixing real and synthetic data. Finally, in point (5) we investigate whether the combination of synthetic image data and percentage of real data, improves the accuracy of real data alone.
To address the named questions further, we set up experiments where we want to detect the following classes with the help of the Scaled Yolov4 object detector [30].
The classes to be detected are: • humans • area clothing shirt • area clothing pants • sterile gown • medical face mask • medical hairnet • medical gloves.
Examples of the medical clothing are given in Fig. 1.
The following section describes the character creation process and why specific tools and models are used.

Character creation
The medical characters we use in SDR/DR are built through a combination of SMPL body models, textures, animations and clothing assets. Within the following section each of the components used are presented and it is explained why they are used.
A body model is required for the creation of synthetic humans. As the base of our characters we use the male and female model of the SMPL+H model from [15]. The models cover a huge variety of realistic human shapes, which can be randomized through ten blend shapes. We decide to use the extended SMPL+H model instead of the original SMPL model [10]. This is because one of our clothing items are gloves and through the hand rig of the SMPL+H model, we will be able to create more deformations of the glove asset.
The SMPL models alone are surface models without texture. For the generation of humans a human texture is needed. To add more variation and realism to the appearance of the characters, the texture maps from [29] are used. Out of the 930 textures, only 138 (69 of every gender) have been used. This is, as we created our own cloth assets, only the textures of people in undergarments were relevant. Those texture maps were created out of 3D body scans from the CAESAR dataset [13] and cover a variety of skin colors and identities, however all of the faces have been anonymized [29]. When working with synthetic humans in rendering engines, to create a variety of realistic humans, the human pose has to be modified. To provide a variety of realistic body poses, the models were animated through Motion Capture (MoCap) data, which has been captured within our laboratory. We track the movement of 74 joints down to the fingertips. We use an intrinsic Motion Capture suite with the Hand gloves Add-on called Perception Neuron Studio. 1 In order to keep the dataset simple, we only used one animation in our experiments. The potential to add more varying animations is given however.
After defining the body model, body textures and body poses, the medical clothing is needed. Two different approaches are investigated here. One is the generation of medical clothing using a 3D scanner and the other is the generation of designed clothing by a graphic designer. The 3D scanned clothing assets which we call SCANS are created with a 3D scanner called Artec Leo. 2 A 3D resolution of 0.2 mm was used to capture the medical cloths. For our synthetic training dataset we used clothing scans of 4 male and 4 female models. In this way, variations of the real-world textures, including reflections, wrinkles, colors and surface texture information are collected. After building an initial model from the 3D scanner, we adapt the cloths to fit the standard male and female  SMPL+H Character using 3D modeling techniques. According to our research, medical clothing usually come in the colors blue, green and light pink. To cover this variation in our dataset we augmented the texture maps. Examples of the scanned and rigged clothing assets can be seen in Fig. 2.
To evaluate the performance of 3D-scanned clothing assets, we compare them to hand designed clothing assets which we call CAD. Therefore, we have asked a designer 3 on Fiver to model the clothes. Examples of those assets can be seen in Fig. 3. We first evaluated to what extent freely available assets from the assets stores can be used for this purpose. However, there are no assets available that match our specific clothing in total. Therefore, we have decided to have the assets designed. The designed assets have been processed in the same way as our scanned assets. They are also deformable and are bound to the same rig.
The creation of the synthetic persons is done by means of the rendering engines Unreal Engine 4 and Unity. The NDDS plugin for the Unreal Engine is used to generate the DR image data and a Unity plugin is used to generate the SDR image data.
For the synthetic data generation of DR, an Unreal Engine 4 plugin called NDDS [23] is used. This allows the generation of RGB images at rates similar to real cameras, as well as depth image data and segmentation masks of the scene within Unreal Engine 4. The plugin also creates bounding box labeling data for each object in the Fig. 2 Examples of our 3D-scanned clothing assets with color augmentation scene in 2D and 3D. The tool was specifically developed for DR and therefore provides tools for scene randomization like object or camera position, lighting and distractor objects, among others. Using a modular character blueprint, NDDS enables the generation of synthetic datasets for sterile clothing using 3D scanned clothing or designed clothing. Example images are given in Fig. 4 on the top row. We create two separate datasets, one with SCANS assets and another with CAD assets for DR. An activity diagram, which represents the blueprint for modular character creation in NDDS, is given in Fig. 5.
For dataset generation using SDR, we used a Unity plugin called ML-ImageSynthesis [22] as a base and adapted it to work with the universal rendering pipeline (URP) for quality improvement. Using Unity 2020.3.32f1, additional components have been added to enable an export of additional metadata regarding each generated image such as camera parameters, bounding boxes and world position. SDR is made possible by making use of a variety of custom-made components which allow the randomization of parameters such as lighting, material, texture, position. The plugin ProBuilder provided by Unity was used to build an intervention room based on the target domain of the real dataset (Klinikum). Scene randomization is achieved by utilizing the aforementioned randomization components. An activity diagram, which represents the blueprint for modular character creation in Unity, is given in Fig. 6

Datasets
To investigate the potential accuracy difference between SCANS, CAD and the combination with real data, different datasets were generated. First, synthetic datasets of DR and SDR were generated for both SCANS and CAD clothing, using the presented pipelines in Unreal-Engine und Unity. These datasets are used to experiment to find out whether scanned clothing or designed clothing give better results.
Second, a dataset in front of a greenscreen was collected which we call Mixed-Reality (MR). It consists of 8 persons in the training dataset and 2 persons in the validation dataset. The recorded persons move in front of the green screen with a certain grasping motion, which is also used as motion animation for the synthetic data. This dataset aims to further close the reality gap between the synthetic image data and the real data by introducing real data in a mixed reality scenario without having to record data in the target domain.
Finally, a dataset of the target domain was recorded which we call Klinikum. It serves as a baseline comparison for all models and also represents the test data. This results in 331 labeled test data. In order to get a sufficient amount of testdata from the available, we decided to use a different split compared to the other datsets here.
In the following sections the lines Klinikum(100) or Klinikum(15) represent all available Klinikum train data, respectively, 15% randomly chosen train data. The lines real(100) and real(15) mean the same.
All datasets are divided into training and validation data. Examples of real data in front of the green screen with exchanged background can be seen in Fig. 7. Examples of the synthetic data can be seen in Fig. 4 and finally examples from the clinical test data can be seen in Fig. 8. Table 1 gives a breakdown of the sizes and distributions of the datasets.

Experiments
Experiments were performed to investigate whether and how well SCANS compare to CAD clothing for detection in the medical environment. Additionally, experiments where carried out to determine if a percentage of real data together with synthetic data can achieve sufficient accuracy or even surpass real data alone. Finally, MR data were included in the experiments to determine whether they could further close the reality gap.
For our experiments, we used Scaled-Yolov4 [30] implementation from GitHub. 4 At first, 6 different baseline networks were trained to show a basic comparison of the different methods and to determine if SCANS or CAD clothing provide better results. These baseline models include trainings with synthetic (DRscans, DRcad, SDRscans, SDRcad), mixed-reality (MR-DR) and real data from the clinic domain (Klinikum train).
Training was conducted with YOLOv4-p5 weights and default finetune parameters provided by Scaled Yolo-V4 GitHub repository. Only Mosaic Augmentation ratio parameters α and β were increased from 8.0 to 20.0. This is to weaken the blending images effect in the augmentation of the used implementation. Additionally a green-channel augmentation was used when MR data were present in the training dataset in order to reduce the greenscreen spill effect which we had troubles with in some classes. Here, we  try to establish a baseline for the MR-DR data. We experimentally found out that using the green-channel augmentation helps the accuracy. All networks were trained for 300 epochs and achieved convergence. All trained models were tested on the Klinikum test-set with IoU-threshold: 0.5 and confidence-threshold: 0.2. The used Yolov4 network was yolov4-p5, image size setting was 896 in training and test and the pretrained weights provided were used. The results of the baseline models are displayed in Table 2.
The results show, that CAD-based synthetic data generally give better results than SCAN based data on this experiment. This is why we use the SDRcad dataset for all follow-up experiments.
To investigate by how much the amount of real data can be reduced when used together with synthetic or MR data, while maintaining sufficient accuracy, experiments were conducted with a percentage distribution of the Klinikum training data. Here it is our main goal to find out whether using synthetic data together with MR data and percentages of real data surpasses the accuracy of real data alone. We choose 15% of real data because this results in 99 remaining training images which we argue is an adequate amount of image data which can be labeled by hand. We decided to use the mosaic augmentation during these experiments as well and use all datasets as training data instead of a finetune experiment. We argue that the network can better learn relevant features while maintaining the advantages of the additional synthetic data when seeing a variation of all used datasets mixed together with mosaic augmentation as when only finetuning. During these experiments, we decided  to include the aforementioned green-channel augmentation on all trainings. Additionally the real data runs were trained with the same number of optimization steps in order to ensure that the model converges while using less training data. The follow-up results with SDRcad, MR-DR as well as real data are shown in Table 3.

Results
The results of the first experiment can be seen in Table 2. When comparing the SCANS clothing and the CAD clothing in the DR and SDR scenarios, both times the data sets with CAD clothing provide better results. This was surprising for us at this point. The possible reasons for this are discussed in the chapter discussion. Similarly, when looking at the results of the individual classes in Table 4 it can be seen that, with the exception of the Mask class, the CAD clothing gives better results than the SCANS clothing in every case.
It is also clear from the results that SDR is superior to DR. This was to be expected based on previous work in this area, since in the presented experiments of SDR the environment is enriched with the objects present in the clinic and thus the network is better adapted to distractions.
While the MR-DR results are inferior to the SDR in many classes, they are superior to the DR except for the Gown class. The reasons for the poor performance of the Gown class in experiments with SDR have already been mentioned in the chapter 3.1. Here, the greenscreen spill was particularly negative, which is why the additional green-channel augmentation was applied.
The evaluation metric used is the mean Average Precision (mAP) with 2 different Intersection over Union (IoU) thresholds. The two thresholds are 0.5:0.95 for mAP and 0.5 for mAP50 as used in the Scaled-Yolov4 implementation [30].
For the Pants class, the MR-DR achieved the best results in this experiment for the mAP. For the mAP50, on the other hand, the situation is the same as for the other classes. This difference can probably be attributed to the inaccurate border of the pants under the shirt. Here, a possible different labeling of the real image data (Klinikum, MR-DR) compared to the automated labeling with the synthetic image data explains the difference.
In general, the two classes Mask and Glove deliver the worst results. This is also the case for the mAP50 category compared to the effect described for the Pants class. This can be attributed to the relatively small size of these classes. In the test data, difficult cases are included, which show the persons from the side. In these cases, the mask or the gloves are just visible from the side and the bounding box area covers a few pixels. This effect can be seen in the real data as well. Here, the Klinikum train dataset achieves an accuracy of 53.66% at mAP, whereas mAP50 is again at 95.85%. This effect can also be seen in the other training datasets but is less strong.
The results of the follow-up experiment, which examines the comparison of synthetic image data along with mixed reality data and a percentage distribution of real data, are shown in Table 3.
Here, the joint training dataset from SDR+MR improves the accuracy of the two individual datasets from the first experiment. However, the difference compared to 100% real data and even 15% real data is still present with the mAP and smaller in mAP50. Nevertheless this result is of great interest for future work and experiments, as it displays a way to avoid using real data from the target domain altogether. The possibilities of mixed reality together with synthetic data should therefore be further investigated. Furthermore, it can be seen that adding SDR+MR data to 15% and 100% real data increases the accuracy of the detections compared to real data alone. For 15% real data this is an increase of 2.53% and for 100% real data this is an increase of 1.4% for mAP.
Looking at the results of the individual classes, which are shown in Table 5, the dataset with SDR+MR+Klinikum(100) gives the best results for all classes except Mask.
Regarding the classes Mask and Glove which gave the worst results in the first experiment, the accuracy can be improved when merging SDR+MR data. This is another indication of the potential of synthetic and mixed reality data for applications in the medical field to significantly reduce the amount of real data required.
When looking at the results of the Pants class, which in the first experiment achieved 55.32% on SDR in mAP, this can be improved to 80.22% by combining with MR data. Likewise, the influence of the MR data with the greenscreen spill effect can be seen with the Gown class. The combination of SDR+MR data largely eliminates the negative influence of the MR data from the first experiment. This is an indicator for the noticeable low accuracy of this class, possibly due to the greenscreen spill. Inference result images of the training with SDR+MR+real(100) data can be seen in Fig. 9. Only for the visualization of the image result here, we used a slightly higher confidence threshold of 0.4 compared to the presented results of all tables (0.2).

Conclusion
We were able to show that the use of SMPL models together with scanned or designed medical clothing is a suitable method for modeling healthcare professionals for artificial intelligence questions in the intervention space using the example of medical clothing detection. During our experiments we found out that the designed clothing generally performed better on our test dataset than the 3D scanned cloths. This result surprised us, as we expected the potentially more accurate textures of the 3D scan to have a positive impact on detection rates. However, according to the results, it cannot be ruled out that artifacts in the rendering pipeline or pre-processing pipeline that we did not detect have an influence on this. Additionally, further work can investigate whether scanned clothing should be designed to be more deformable, as this can combine the advantage of scanned textures along with realistic movement of the fabric. In order to make a final statement about the potential of 3D scanned clothing for the modeling of health professionals, further experiments should be conducted. Using Mixed-Reality data together with the synthetic data closed the gap further and while the margin is quite small, we could show that when using synthetic, mixed reality and 15% real data the remaining gap towards 100% real data could be nearly closed. Generally we could show that when using synthetic and mixed-reality together with a percentage of real data, it surpasses real data alone. This is a good sign for the potential of synthetic and mixed-reality data in questions around medical interventions, as they contain enough information to close the reality gap. A trajectory with multiple percentage distributions of real data together with SRD+MR data is interesting for a larger test data set with multiple healthcare professionals and is the subject of future work. In the results shown, it has already been demonstrated that the fusion of SDR+MR data together with the real data improves the accuracy.
For questions in the intervention space, mixed-reality in particular allows data to be acquired outside the target domain to minimize privacy challenges. In future work, methods should be explored to reduce the greenscreen spill effect during data generation and to visualize the resulting data in more complex scenes similar to SDR. For this purpose, the use of deep learning networks for image enhancement is interesting to investigate.
In conclusion, the presented modeling of health professionals is a promising method to solve the problem of missing datasets from medical intervention rooms. We will further investigate it for various tasks in the medical field.

Appendix
In this appendix chapter, the full table results of the baseline experiment and the follow-up experiments are provided. Additionally to the all category, we report the results for all detection classes (Body, Gown, Shirt, Pants, Hat, Mask, Glove). Full results Table 4 shows the results of the baseline experiments and Table 5 shows the results of the follow-up experiments. Abbreviations Table 6 gives an overview of abbreviations used in the presented work. They are sorted in alphabetical order with sections for starting letters present in the work. The bold values generally represent the best results from the respective columns *Additional green-channel augmentation used