Preparing the Data for Hands-on-Wheel Detection

In our first blog, we reviewed why we chose to focus on hands-on-wheel detection and how synthetic data impacted our outcomes. Here we will discuss the data we used in our benchmark. 

What data did we use? 

We chose the DMD dataset as our real dataset because it contains a large number of drivers, driving scenarios, camera angles, and has a wide variety of tagged behaviors, including whether or not the hands are on the wheel. We split the dataset into a train, validation, and test sets based on the identity of the drivers in the dataset. In total, the dataset contains 651k frames, of which we use 531k for training, 47k for validation, and the rest for tests. The drivers are recorded using three cameras–one facing the driver’s head, one facing the driver’s body, and one facing the driver’s hands. A side view of the wheel offered a clearer perspective whether the hands are on the wheel. The probability of left and right hands off the wheel is different. Drivers in countries that drive on the left side of the road will typically perform other actions with their right hand while the left hand remains on the wheel. This bias can be seen in Table 1.           

 

Left Hand Right Hand Synthetic Real
On  On 5,642 (50%) 214,192 (32.8%)
On Off 3,546 (31.4%)  304,102 (46.7%)
Off On 2,014 (17.8%)  122,416 (18.8%)
Off  Off 82 (0.7%) 105,79 (1.6%) 
Total 11,284 651,289 

Table 1. Label distribution in real and synthetic datasets.

Read the full benchmark report — Hands Up: Leveraging Synthetic Data for Hands-on-Wheel Detection

How did we prepare the synthetic data? 

We used the Datagen synthetic data platform to generate a diverse video dataset composed of different drivers who perform various actions in different vehicles. Among the multiple camera views available, we rendered the scene using a camera focused on the driver’s body, a similar viewpoint as the real data. Each scene is 10 seconds long and is rendered at 15 frames per second. Each image resolution is 256×256 and includes hand, body and wheel keypoints. See Figure 1 for some RGB examples from the synthetic dataset.

Figure 1. Sample images from our synthetic dataset

To maximize variance in our dataset we generated diverse sequences:

  1. Environment – Our dataset included various car types; large and medium SUVs and sedan type vehicles. The interior areas in the car differed to allow variance including seat types, wall colors and especially important for our task, different wheel types. 
  2. Demographics – We used ten different drivers with different ethnicity, age and gender. 
  3. Behaviors – We generated multiple behaviors such as falling asleep, turning around, texting, one handed driving, and two handed driving.
  4. Scene – We generated all sequences with random background and lighting conditions – daylight, evening light, or night. 

In total, we generated 146 sequences.

For each frame, we separately labeled each hand as being on or off the steering wheel. The availability of 3D key points from our platform makes the hands-on-wheel labeling almost a trivial task. In total, the synthetic dataset contained 11,284 unique images. We split our train, validation, and test sets based on the driver identity. Our training set contained 8,834 images. The validation set consisted of 2,450 images following the same proportions as the train split (we undersampled both hands on wheel frames to match real data distribution). 

Did we use any pre-processing?

To eliminate background distractions from the model, we manually cropped both the real images and synthetic images around the wheel. Then, only the wheel and hands are visible without any extra details. See Figure 2 for some examples from the real and synthetic datasets.

 

Figure 2. Examples from the real and synthetic datasets after cropping around the wheel

What was the model architecture?

We choose the lightweight MobileNetV3 architecture as the backbone for all our experiments, considering the real-time nature of our task. We replaced the classification head with two binary classification heads; each containing two fully connected layers activated with ReLU and a final fully connected layer with a sigmoid activation. The two classification heads predict, respectively, whether the left or right hand is on the wheel.

Our final blog will discuss the experiment and the results. 

Read the full benchmark report — Hands Up: Leveraging Synthetic Data for Hands-on-Wheel Detection

Datagen at CVPR

Our team recently returned from CVPR 2022 and it was an incredible experience.  We were blown away by the people, the technology, the papers and even made time to visit New Orleans, eat a few beignets and listen to some jazz. 

The People

It takes a village to run a booth and experience everything that CVPR has to offer at the same time. Our Datagen team included our CTO, algo engineers, product, marketing and our VP R&D, to name a few. We met people from all over the globe – Saudi Arabia, Mexico, South Korea, Japan – and enjoyed spending time with each of them. It was exciting to understand how they see and use synthetic data, what the challenges are and what they need from synthetic data for their everyday work. 

We had over 500 conversations with both academics and engineers working in the industry. We discussed with our visitors where synthetic data is needed; humans in context, face recognition including face liveliness, generative algorithms with adding aspects to faces and verticals like in-cabin automotive, home security, metaverse ar/vr and smart office and fitness. We also spoke about the challenges in academia using synthetic data like the difficulty in rendering at scale, the lack of 3D artists, and how much effort it takes to create the many identities needed for their research. 

The Technology

Synthetic data was definitely trending at CVPR from showing how to use it in the best way to using it for data-centric AI. Everyone was interested and curious about what we do. But there were also many other cool ideas and papers that were presented. NeRF seemed to dominate and is making progress towards closing the gap in a variety of ways. There were over 50 papers on NeRF alone! Some of the most exciting advances are: Less images and faster training, accurate reflections, light control, material control, HDR and large scale (whole-block). See even more on our blog about 6 papers to watch at CVPR 2022 here.

The Research

Our CTO and co-founder, Gil Elbaz presenting at the 7th Workshop on Benchmarking Multi-Target Tracking: How far can synthetic data take us on synthetic data and our recent benchmark on leveraging synthetic data for hands-on-wheel detection and how this can be used to train driver monitoring systems.  Gil demonstrated the use of synthetic photorealistic in-cabin data, created on the Datagen platform, to train a Driving Monitoring System (DMS) that detects whether the driver’s hands are on the wheel. The experiment used synthetic data to train a lightweight neural network to detect when the driver removes their hands from the wheel. He presented achieving similar results to training on real data. This showcases the ability of human-centric synthetic data to generalize well to the real world, and help train algorithms in computer vision settings where data from the target domain is scarce or hard to collect.

Jonathan Laserson, Datagen’s Head of AI Research, presenting at Machine Learning with Synthetic Data on Applying StyleGAN On Top of Synthetically Generated Data. Neural generators like StyleGAN can generate photorealistic images in many domains after learning their distribution “bottom-up” from large image datasets. Even though it’s possible to manipulate the generated images in various ways, controlling the generated content is a hard task, as it requires reverse-engineering the latent space of the StyleGAN.

To bridge this domain gap between the level of diversity and photorealism, Jonathan proposed an initial version of the desired image using the top-down synthetic pipeline, and then inverted this image into the latent space of a StyleGAN trained on real images. He shows that the inversion maintains the same person identity, but adds photorealism and provides access to new modes of diversity. This enables us to generate synthetic, photorealistic image datasets that can be used to train computer vision models, such as face recognition, while retaining full control over the distribution of the data.

The Fun

New Orleans is definitely a fun city! There were lots of fun things to do and see in the city, at the Expo and the conference itself. Here are some of the highlights:

Autonomous Vehicles

Project Aria Glasses from Meta

Synthetic Mice

New Orleans

Karine Regev is Datagen’s VP of Marketing. She has over 17 years of experience in marketing, especially scaling security and AI tech startups. Karine has a track record of growing brand and market share, specializes in driving lead generation and developing a marketing team for B2B growth. She works to bring Datagen’s innovation to the global market and sharing the possibilities of synthetic data.