In our first blog, we reviewed why we chose to focus on hands-on-wheel detection and how synthetic data impacted our outcomes. Here we will discuss the data we used in our benchmark.
What data did we use?
We chose the DMD dataset as our real dataset because it contains a large number of drivers, driving scenarios, camera angles, and has a wide variety of tagged behaviors, including whether or not the hands are on the wheel. We split the dataset into a train, validation, and test sets based on the identity of the drivers in the dataset. In total, the dataset contains 651k frames, of which we use 531k for training, 47k for validation, and the rest for tests. The drivers are recorded using three cameras–one facing the driver’s head, one facing the driver’s body, and one facing the driver’s hands. A side view of the wheel offered a clearer perspective whether the hands are on the wheel. The probability of left and right hands off the wheel is different. Drivers in countries that drive on the left side of the road will typically perform other actions with their right hand while the left hand remains on the wheel. This bias can be seen in Table 1.
|Left Hand||Right Hand||Synthetic||Real|
|On||On||5,642 (50%)||214,192 (32.8%)|
|On||Off||3,546 (31.4%)||304,102 (46.7%)|
|Off||On||2,014 (17.8%)||122,416 (18.8%)|
|Off||Off||82 (0.7%)||105,79 (1.6%)|
Table 1. Label distribution in real and synthetic datasets.
How did we prepare the synthetic data?
We used the Datagen synthetic data platform to generate a diverse video dataset composed of different drivers who perform various actions in different vehicles. Among the multiple camera views available, we rendered the scene using a camera focused on the driver’s body, a similar viewpoint as the real data. Each scene is 10 seconds long and is rendered at 15 frames per second. Each image resolution is 256×256 and includes hand, body and wheel keypoints. See Figure 1 for some RGB examples from the synthetic dataset.
Figure 1. Sample images from our synthetic dataset
To maximize variance in our dataset we generated diverse sequences:
- Environment – Our dataset included various car types; large and medium SUVs and sedan type vehicles. The interior areas in the car differed to allow variance including seat types, wall colors and especially important for our task, different wheel types.
- Demographics – We used ten different drivers with different ethnicity, age and gender.
- Behaviors – We generated multiple behaviors such as falling asleep, turning around, texting, one handed driving, and two handed driving.
- Scene – We generated all sequences with random background and lighting conditions – daylight, evening light, or night.
In total, we generated 146 sequences.
For each frame, we separately labeled each hand as being on or off the steering wheel. The availability of 3D key points from our platform makes the hands-on-wheel labeling almost a trivial task. In total, the synthetic dataset contained 11,284 unique images. We split our train, validation, and test sets based on the driver identity. Our training set contained 8,834 images. The validation set consisted of 2,450 images following the same proportions as the train split (we undersampled both hands on wheel frames to match real data distribution).
Did we use any pre-processing?
To eliminate background distractions from the model, we manually cropped both the real images and synthetic images around the wheel. Then, only the wheel and hands are visible without any extra details. See Figure 2 for some examples from the real and synthetic datasets.
Figure 2. Examples from the real and synthetic datasets after cropping around the wheel
What was the model architecture?
We choose the lightweight MobileNetV3 architecture as the backbone for all our experiments, considering the real-time nature of our task. We replaced the classification head with two binary classification heads; each containing two fully connected layers activated with ReLU and a final fully connected layer with a sigmoid activation. The two classification heads predict, respectively, whether the left or right hand is on the wheel.
Our final blog will discuss the experiment and the results.
Dr. Jonathan Laserson, Head of AI Research at Datagen Technologies, is an expert in the field of photorealistic synthetic images. He shares how Neural Radiance Fields (NeRF) can be used to generate a nearly infinite number of synthetic assets to train AI models. Dr. Laserson also explains how synthetic objects can be represented in a latent space where features can be perturbed to modify the shape and texture of each asset.
CVPR 2022 is happening soon. The previous CVPR saw groundbreaking discoveries that pushed the boundaries of computer vision. This year is no exception. With topics running the gamut from deep learning architecture to image synthesis, CVPR 2022 will be a culmination of cutting-edge ideas in all areas of computer vision.
Here, we present six papers to watch from CVPR 2022:
1. SemanticStyleGAN: Learning Compositional Generative Priors for Controllable Image Synthesis and Editing
Recent advancements in Generative Adversarial Networks (GANs) gave rise to impressive progress in image synthesis. Acting as powerful generative priors, GANs lay a solid foundation for downstream synthesis tasks.
One well-known work is the StyleGAN series, where each generated image is conditioned on a set of latent codes. These latent codes are highly informative but also difficult to interpret. Yet, researchers remain steadfast in their exploration of these codes, in hopes of discovering methods of improving the controllability of GANs.
Such exploration yielded methods to control the global attributes of the generated images from StyleGANs. Others found that training GANs from scratch with additional supervision resulted in better control.
Despite these refinements, practitioners lamented the limited control they had over the local features generated by GANs. This is where SemanticStyleGAN shines.
SemanticStyleGAN provides more controllable image synthesis as its latent space is factorized based on semantic parts (like face, hair, or nose). Put simply, SemanticStyleGAN’s local latent codes can be used to control the structure of and texture of features like face, hair, and nose. (Figure 1)
Figure 1. A comparison of (a) StyleGAN against (b) SemanticStyleGAN. Notice the latent codes of SemanticStyleGAN are factorized over different regions while those of StyleGAN are associated with different granularities. (Source)
SemanticStyleGAN achieved exceptional disentanglement between different spatial areas. In the context of generating a face, spatial areas include the face, eyes, mouth, and hair (Figures 2 and 3). Users of SemanticStyleGAN could also disentangle background and background, allowing for precise control over the location and size of the foreground.
Figure 2. SemanticStyleGAN achieves exceptional disentanglement between different spatial areas.
Figure 3. SemanticStyleGAN holds precise control over the location and size of the foreground (Source)
StyleGAN could mix styles between generated images at different granularities. SemanticStyleGAN one-upped StyleGAN by allowing the transfer of styles of different local areas, instead of granularity (Figure 4).
Figure 4. SemanticStyleGAN could transfer styles locally. The style of the images from the first row (e.g. face, eyes, eyebrows, mouth, and hair) are transferred to those in the first column. Notice how only the style of a specific feature (not the entire image) is transferred.
SemanticStyleGAN represents a step toward more controllable image synthesis. Having discovered the means to have greater control over image synthesis tasks, we are inching towards a fully customizable and scalable image generator.
Meet us at booth #813 at CVPR 2022. Schedule a meeting here.
2. StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2
The realm of image generation has seen unprecedented growth within the short span of a few years. Unfortunately, the video generation community saw far fewer advancements. The challenge lies with the complexity of the underlying video dataset. It also does not help that modern video generators treat videos as discrete images and thus have intensive resource demands.
StyleGan-V challenges that status quo by thinking of videos as time-continuous signals. The researchers represented continuous motion with positional embeddings that have time-varying parameters. With that, they argued (and demonstrated) that a video with as few as two frames is sufficient to build a meaningful video generator.
Built on top of the image-based StyleGAN2, StyleGAN-V can produce long videos at a high frame rate efficiently. In particular, StyleGAN-V is only 5% more expensive to train than StyleGAN2 with only slightly worse (~10%) image quality. As a result, StyleGAN-V could be trained directly on 1024×1024 videos, a feat unparalleled by previous state-of-the-art video generators.
The authors compared their results with a few benchmarks. StyleGAN-V managed to produce a realistic hour-long video from the RainbowJelly dataset. On the other hand, other video generators like VideoGPT, DIGAN, and MoCoGAN-HD struggle to produce meaningful results after one hour. (Figure 5)
Figure 5. A snapshot of the videos generated by different video generators at the 1-hour mark using Rainbow Jelly 256×256 benchmark
On the MEAD 10242 dataset, StyleGAN-V could preserve the identity of a speaker, while MoCoGAN-HD produced noisy results after 1 second. (Figure 6)
Figure 6. Given the face image of a speaker from MEAD 10242, StyleGAN-V synthesized an hour-long video that preserved the identity of the speaker (bottom row). MoCoGAN-HD could not do the same (top row) [Source]
The output of StyleGAN-V can be controlled by tuning the latent space. To the delight of the researchers, StyleGAN-V inherits the perks of StyleGAN2, including the ability of semantic manipulation. To do so, the researchers edited the first frame of a video with CLIP and animated it with the StyleGAN-V (Figure 7).
Figure 7. StyleGAN-V, like StyleGAN2, has the ability of semantic manipulation.
Admittedly, StyleGAN-V still has some limitations to overcome. In particular, the model might produce repeated periodic motions at times, and handle certain new content poorly. That said, StyleGAN-V’s is still a solid foundation for even more powerful video generators in the future.
3. Portrait Eyeglasses and Shadow Removal by Leveraging 3D Synthetic Data
Many facial recognition techniques assume that the input facial images do not have occlusions. When that assumption is not met, the performance of the model degrades. One of such occlusions is eyeglasses, and being able to remove portrait eyeglasses could handle such cases. (Figure 8)
Figure 8. Portrait Eyeglasses and Shadow Removal [Source]
The task of removing eyeglasses (and their shadows) is not trivial. Not only do the facial features have to be recovered, but the lighting effects (like shadows, reflections, and distortions) associated with the glasses also have to be removed. Conceivably, modeling these effects are computationally expensive and difficult.
Thus, the author proposed a detect-then-remove algorithm that first detects the presence of the eyeglasses before removing them.
Due to a lack of real-world data, the authors trained both the detection and removal algorithm with a synthetic portrait dataset. The benefit of using a synthetic dataset is the ability to generate a diverse dataset spanning various identities, expressions, eyeglasses, and illuminations. As a result, the model can generalize across different lighting conditions, head poses, and eyeglasses (Figure 9).
Figure 9. Removing glasses across various settings [Source]
Notably, the authors also used synthetic images that cannot be captured in the real world for intermediate supervision. In particular, portraits with eyeglasses but no shadows (and vice versa) were used to train the de-glassing and de-shadow networks.
To bridge the gap between synthetic and real data, the authors use the Domain Adaptation network. It converts the input images – whether synthetic or real – into a uniform feature map.
The results from this paper are promising. A quick comparison of the authors’ results with state-of-the-art eyeglasses removal methods (Figure 10) reveals the limitations of current methods. For example, existing methods like CycleGAN, StarGAN, and pix2pix cannot completely remove the eyeglasses for some shadows.
Figure 10. Qualitative comparisons of different methods across the multiple images. [Source]
The authors achieved state-of-the-art results despite the lack of real-world data. We would even argue that the variety and customizability of synthetic data is an important ingredient in addressing the weakness of previous models. For many tasks beyond eyeglasses removal, synthetic data might be the key to pushing the boundary of other computer vision tasks.
4. Task2Sim: Towards Effective Pre-training and Transfer from Synthetic Data
Many computer vision models rely on pre-trained models trained on large real-world datasets. As demonstrated by the paper Exploring the Limits of Weakly Supervised Training, the downstream performance of such pre-trained models increases with the size of their training sets. It is not a surprise then that researchers are training models with ever-larger datasets in hopes of pushing the boundary of the state-of-the-art.
Yet, real-world data comes with strings attached. It is expensive and laced with privacy and ethical issues. An alternative to real-world data, synthetic data could be used to pre-train vision models in a scalable and privacy-preserving manner.
One question remains–what kind of synthetic data is best suited for a specific computer vision task? That was the question that a group of researchers from Boston University, Watson AI Lab, and Cornell University aimed to answer.
These researchers found that each downstream task has a favorable configuration of simulation parameters (like lighting, object pose, and backgrounds), and no one-size-fits-all solution exists. Not only that, they proposed the Task2Sim model, which is a unified model that finds the optimal simulation parameters for generating pre-training data for specific tasks.
The results from Task2Sim are impressive. Vision models trained on task-adaptive pre-training data generated by Task2Sim performed significantly better than those trained on non-adapative data at 20 distinct classification tasks.
The researchers demonstrated that Task2Sim is essentially a “one-shot learner”. Specifically, once trained on a set of seen tasks, Task2Sim can then predict parameters for novel unseen tasks. Future-gazing, Task2Sim could be used as part of a pipeline to automatically generate pre-training data when given some downstream examples.
We have covered the technical implementation of Task2Sim in detail in another blog post. There, we described its architecture, training process, and performance on various tasks.
5. BANMo: Building Animatable 3D Neural Models from Many Casual Videos
3D shape reconstruction often relies on specialized sensors. This is especially true for deformable objects, like humans or animals.
With the rise of virtual reality comes the burgeoning demand for creating animatable assets from images and videos captured by a regular camera. Doing so for humans was difficult without the help of specialized sensors. The same can be said for animals and any other 3D deformable objects (as opposed to rigid scenes like buildings).
Researchers from Meta AI and Carnegie Mellon University pondered on that problem and proposed BANMo, a Builder of Animatable 3D Neural Models. Instead of using specialized sensors to animate humans and animals. BANMo requires only casual videos of an object (Figure 11).
Figure 11. Input videos and BANMo’s reconstructions (Source)
When given multiple videos of a deformable object (e.g. a cat), BANMo could reconstruct an animatable 3D model without the need for any predefined shapes (Figure 12). The object can then be reconstructed with different poses.
Figure 12. An overview of BANMo’s reconstruction process
More concretely, BANMo consolidates 2D cues from thousands of images into a fixed canonical space. The output of the BANMo model is a neural blend skinning, which could be animated by manipulating bone transformations. The use of neural blending skinning is superior to using dynamic NeRF approaches in terms of the handling of pose variations and deformations.
From the experiments, the researchers found that BANMo could reconstruct humans and animals from real and synthetic datasets with high fidelity. By taking better advantage of the large number of frames from multiple videos, BANMo produced finer 3D shape details and motions than state-of-the-art approaches like ViSER. (Figure 13)
Figure 13. A comparison of BANMo with the state-of-the-art (ViSER)
6. Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields
Two years since its inception in 2020, Neural Radiance Fields (NeRF) is still one of the leading methods to synthesize novel views of complex scenes. Given a set of sparse input images of a scene, NeRF optimizes a volumetric representation of the scene. With that, it achieved state-of-the-art results in generating complicated geometry and appearance.
That said, NeRF is not perfect. In particular, it struggles with “unbounded scenes” where the scene could extend very far into the distance from any angle. Under such settings, the renderings are often blurry. Such a challenge arises due to three critical issues – parameterization, efficiency, and ambiguity.
Thus, Google and Harvard researchers present an extension of mip-NeRF (a variant of NeRF) called mip-NeRF 360. It uses a combination of techniques including non-linear scene parameterization, online distillation, and a distortion-based regulariser to overcome such challenges. Overall, the mip-NeRF 360 could synthesize unbounded scenes with high fidelity.
Kubric is an open-source Python package that uses Blender and PyBullet to generate realistic scenes and rich annotations. It is scalable as it can distribute the computational load over thousands of machines. The paper demonstrated Kubric’s ability to generate datasets for a wide range of tasks, including optical flow estimation and studying 3D NeRF models (Figure 14). Interested practitioners can use the Kubric package which is currently available on Github.
Figure 14. An example of a scene and annotations generated by Kubric
Real-world datasets often fail to capture edge cases. Thus, vision models trained on such datasets might fail to recognize such rare occurrences and cause harm to their users. For example, an autonomous vehicle that fails to recognize occluded stop signs might fail to react appropriately.
Synthetic datasets could address such robustness issues. That is the motivation of the researchers behind SHIFT, which is the largest synthetic dataset for training self-driving cars. (Figure 15)
Figure 15. The SHIFT dataset
The dataset includes scenarios from multiple weather conditions (cloudy, rainy, or foggy), time of day, and pedestrian intensity. It also includes a comprehensive sensor suite and annotations, which makes it suitable for a range of perception tasks. Further, SHIFT’s videos also are longer than those in real-world datasets, making them an ideal candidate for test-time learning.
CVPR 2022 will not disappoint
If there is one thing we can learn from CVPR 2022, it is that the state of synthetic image generation is rapidly advancing. In particular, we see the refinement of technologies like NeRF and GAN in producing higher fidelity synthetic images. We also witness an influx of papers that are using synthetic data to produce state-of-the-art results on a multitude of vision tasks. Looking forward, we foresee advancements to translate into tangible, practical use cases of synthetic data in real-world applications.
What we’ve covered here is only the tip of the iceberg of what CVPR 2022 has to offer. There are dozens of papers to pursue; hundreds of workshops and tutorials to engage with; plenty of computer vision experts to learn from. If you’re eager to learn more, you can study the conference proceedings which will be available via CVF and IEEE Xplore after the conference.
Meet us at booth #813 at CVPR 2022. Schedule a meeting here.
According to Gartner, we’re cool. Gartner® recently named Datagen as a cool vendor in their AI for Computer Vision report.
Our product allows organizations to generate high quality and domain specific images, both 2D and 3D for computer vision cases.
Our platform is unique and provides granular customization for our customers. Customers have control over camera lens specifications, lighting, environmental factors, demographic and metadata.
Our platform can provide domain-specific synthetic data such as home security, in-cabin, smart office and more with our data for humans, and object perception. As well, customers rely on us to alleviate bias and privacy concerns because the use of synthetic data, which is not linked to any individual, minimizes the risk of leaking PII.
Visit our booth #831 at CVPR 2022 to enter to win a drone! Schedule a meeting here.
What’s up next for Datagen? A whole host of new products able to generate synthetic data.
With our new smart office model, our users can generate an entire conference room including post-it notes and whiteboards, to name a few, create gaze detection and interaction with these objects.
Our interactive home security model allows users to generate a delivery person approaching the door, dropping off a package and identifying suspicious activity to iterate your models and improve performance.
Cool stuff, right?
*Gartner, “Cool Vendors™ in AI for Computer Vision”, Shubhangi Vashisth, Arun Chandrasekaran, et al. May 26, 2022.
GARTNER and COOL VENDORS are registered trademarks and service marks of Gartner, Inc. and/or its affiliates in the U.S. and internationally and are used herein with permission. All rights reserved. Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.