Preparing the Data for Hands-on-Wheel Detection

In our first blog, we reviewed why we chose to focus on hands-on-wheel detection and how synthetic data impacted our outcomes. Here we will discuss the data we used in our benchmark. 

What data did we use? 

We chose the DMD dataset as our real dataset because it contains a large number of drivers, driving scenarios, camera angles, and has a wide variety of tagged behaviors, including whether or not the hands are on the wheel. We split the dataset into a train, validation, and test sets based on the identity of the drivers in the dataset. In total, the dataset contains 651k frames, of which we use 531k for training, 47k for validation, and the rest for tests. The drivers are recorded using three cameras–one facing the driver’s head, one facing the driver’s body, and one facing the driver’s hands. A side view of the wheel offered a clearer perspective whether the hands are on the wheel. The probability of left and right hands off the wheel is different. Drivers in countries that drive on the left side of the road will typically perform other actions with their right hand while the left hand remains on the wheel. This bias can be seen in Table 1.           

 

Left Hand Right Hand Synthetic Real
On  On 5,642 (50%) 214,192 (32.8%)
On Off 3,546 (31.4%)  304,102 (46.7%)
Off On 2,014 (17.8%)  122,416 (18.8%)
Off  Off 82 (0.7%) 105,79 (1.6%) 
Total 11,284 651,289 

Table 1. Label distribution in real and synthetic datasets.

Read the full benchmark report — Hands Up: Leveraging Synthetic Data for Hands-on-Wheel Detection

How did we prepare the synthetic data? 

We used the Datagen synthetic data platform to generate a diverse video dataset composed of different drivers who perform various actions in different vehicles. Among the multiple camera views available, we rendered the scene using a camera focused on the driver’s body, a similar viewpoint as the real data. Each scene is 10 seconds long and is rendered at 15 frames per second. Each image resolution is 256×256 and includes hand, body and wheel keypoints. See Figure 1 for some RGB examples from the synthetic dataset.

Figure 1. Sample images from our synthetic dataset

To maximize variance in our dataset we generated diverse sequences:

  1. Environment – Our dataset included various car types; large and medium SUVs and sedan type vehicles. The interior areas in the car differed to allow variance including seat types, wall colors and especially important for our task, different wheel types. 
  2. Demographics – We used ten different drivers with different ethnicity, age and gender. 
  3. Behaviors – We generated multiple behaviors such as falling asleep, turning around, texting, one handed driving, and two handed driving.
  4. Scene – We generated all sequences with random background and lighting conditions – daylight, evening light, or night. 

In total, we generated 146 sequences.

For each frame, we separately labeled each hand as being on or off the steering wheel. The availability of 3D key points from our platform makes the hands-on-wheel labeling almost a trivial task. In total, the synthetic dataset contained 11,284 unique images. We split our train, validation, and test sets based on the driver identity. Our training set contained 8,834 images. The validation set consisted of 2,450 images following the same proportions as the train split (we undersampled both hands on wheel frames to match real data distribution). 

Did we use any pre-processing?

To eliminate background distractions from the model, we manually cropped both the real images and synthetic images around the wheel. Then, only the wheel and hands are visible without any extra details. See Figure 2 for some examples from the real and synthetic datasets.

 

Figure 2. Examples from the real and synthetic datasets after cropping around the wheel

What was the model architecture?

We choose the lightweight MobileNetV3 architecture as the backbone for all our experiments, considering the real-time nature of our task. We replaced the classification head with two binary classification heads; each containing two fully connected layers activated with ReLU and a final fully connected layer with a sigmoid activation. The two classification heads predict, respectively, whether the left or right hand is on the wheel.

Our final blog will discuss the experiment and the results. 

Read the full benchmark report — Hands Up: Leveraging Synthetic Data for Hands-on-Wheel Detection

Datagen at CVPR

Our team recently returned from CVPR 2022 and it was an incredible experience.  We were blown away by the people, the technology, the papers and even made time to visit New Orleans, eat a few beignets and listen to some jazz. 

The People

It takes a village to run a booth and experience everything that CVPR has to offer at the same time. Our Datagen team included our CTO, algo engineers, product, marketing and our VP R&D, to name a few. We met people from all over the globe – Saudi Arabia, Mexico, South Korea, Japan – and enjoyed spending time with each of them. It was exciting to understand how they see and use synthetic data, what the challenges are and what they need from synthetic data for their everyday work. 

We had over 500 conversations with both academics and engineers working in the industry. We discussed with our visitors where synthetic data is needed; humans in context, face recognition including face liveliness, generative algorithms with adding aspects to faces and verticals like in-cabin automotive, home security, metaverse ar/vr and smart office and fitness. We also spoke about the challenges in academia using synthetic data like the difficulty in rendering at scale, the lack of 3D artists, and how much effort it takes to create the many identities needed for their research. 

The Technology

Synthetic data was definitely trending at CVPR from showing how to use it in the best way to using it for data-centric AI. Everyone was interested and curious about what we do. But there were also many other cool ideas and papers that were presented. NeRF seemed to dominate and is making progress towards closing the gap in a variety of ways. There were over 50 papers on NeRF alone! Some of the most exciting advances are: Less images and faster training, accurate reflections, light control, material control, HDR and large scale (whole-block). See even more on our blog about 6 papers to watch at CVPR 2022 here.

The Research

Our CTO and co-founder, Gil Elbaz presenting at the 7th Workshop on Benchmarking Multi-Target Tracking: How far can synthetic data take us on synthetic data and our recent benchmark on leveraging synthetic data for hands-on-wheel detection and how this can be used to train driver monitoring systems.  Gil demonstrated the use of synthetic photorealistic in-cabin data, created on the Datagen platform, to train a Driving Monitoring System (DMS) that detects whether the driver’s hands are on the wheel. The experiment used synthetic data to train a lightweight neural network to detect when the driver removes their hands from the wheel. He presented achieving similar results to training on real data. This showcases the ability of human-centric synthetic data to generalize well to the real world, and help train algorithms in computer vision settings where data from the target domain is scarce or hard to collect.

Jonathan Laserson, Datagen’s Head of AI Research, presenting at Machine Learning with Synthetic Data on Applying StyleGAN On Top of Synthetically Generated Data. Neural generators like StyleGAN can generate photorealistic images in many domains after learning their distribution “bottom-up” from large image datasets. Even though it’s possible to manipulate the generated images in various ways, controlling the generated content is a hard task, as it requires reverse-engineering the latent space of the StyleGAN.

To bridge this domain gap between the level of diversity and photorealism, Jonathan proposed an initial version of the desired image using the top-down synthetic pipeline, and then inverted this image into the latent space of a StyleGAN trained on real images. He shows that the inversion maintains the same person identity, but adds photorealism and provides access to new modes of diversity. This enables us to generate synthetic, photorealistic image datasets that can be used to train computer vision models, such as face recognition, while retaining full control over the distribution of the data.

The Fun

New Orleans is definitely a fun city! There were lots of fun things to do and see in the city, at the Expo and the conference itself. Here are some of the highlights:

Autonomous Vehicles

Project Aria Glasses from Meta

Synthetic Mice

New Orleans

Karine Regev is Datagen’s VP of Marketing. She has over 17 years of experience in marketing, especially scaling security and AI tech startups. Karine has a track record of growing brand and market share, specializes in driving lead generation and developing a marketing team for B2B growth. She works to bring Datagen’s innovation to the global market and sharing the possibilities of synthetic data.

6 Papers to Watch at CVPR 2022

CVPR 2022 is happening soon. The previous CVPR saw groundbreaking discoveries that pushed the boundaries of computer vision. This year is no exception. With topics running the gamut from deep learning architecture to image synthesis, CVPR 2022 will be a culmination of cutting-edge ideas in all areas of computer vision. 

Here, we present six papers to watch from CVPR 2022:

1. SemanticStyleGAN: Learning Compositional Generative Priors for Controllable Image Synthesis and Editing

Recent advancements in Generative Adversarial Networks (GANs) gave rise to impressive progress in image synthesis. Acting as powerful generative priors, GANs lay a solid foundation for downstream synthesis tasks.

One well-known work is the StyleGAN series, where each generated image is conditioned on a set of latent codes. These latent codes are highly informative but also difficult to interpret. Yet, researchers remain steadfast in their exploration of these codes, in hopes of discovering methods of improving the controllability of GANs.

Such exploration yielded methods to control the global attributes of the generated images from StyleGANs. Others found that training GANs from scratch with additional supervision resulted in better control. 

Despite these refinements, practitioners lamented the limited control they had over the local features generated by GANs. This is where SemanticStyleGAN shines. 

SemanticStyleGAN provides more controllable image synthesis as its latent space is factorized based on semantic parts (like face, hair, or nose). Put simply, SemanticStyleGAN’s local latent codes can be used to control the structure of and texture of features like face, hair, and nose. (Figure 1)

Figure 1. A comparison of (a) StyleGAN against (b) SemanticStyleGAN. Notice the latent codes of SemanticStyleGAN are factorized over different regions while those of StyleGAN are associated with different granularities. (Source)

SemanticStyleGAN achieved exceptional disentanglement between different spatial areas. In the context of generating a face, spatial areas include the face, eyes, mouth, and hair (Figures 2 and 3). Users of SemanticStyleGAN could also disentangle background and background, allowing for precise control over the location and size of the foreground. 

Figure 2. SemanticStyleGAN achieves exceptional disentanglement between different spatial areas. 

Figure 3. SemanticStyleGAN holds precise control over the location and size of the foreground (Source)

StyleGAN could mix styles between generated images at different granularities. SemanticStyleGAN one-upped StyleGAN by allowing the transfer of styles of different local areas, instead of granularity (Figure 4).

Figure 4. SemanticStyleGAN could transfer styles locally. The style of the images from the first row (e.g. face, eyes, eyebrows, mouth, and hair) are transferred to those in the first column. Notice how only the style of a specific feature (not the entire image) is transferred.

SemanticStyleGAN represents a step toward more controllable image synthesis. Having discovered the means to have greater control over image synthesis tasks, we are inching towards a fully customizable and scalable image generator. 

Meet us at booth #813 at CVPR 2022. Schedule a meeting here.

2. StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2

The realm of image generation has seen unprecedented growth within the short span of a few years. Unfortunately, the video generation community saw far fewer advancements. The challenge lies with the complexity of the underlying video dataset. It also does not help that modern video generators treat videos as discrete images and thus have intensive resource demands.

StyleGan-V challenges that status quo by thinking of videos as time-continuous signals. The researchers represented continuous motion with positional embeddings that have time-varying parameters. With that, they argued (and demonstrated) that a video with as few as two frames is sufficient to build a meaningful video generator. 

Built on top of the image-based StyleGAN2, StyleGAN-V can produce long videos at a high frame rate efficiently. In particular, StyleGAN-V is only 5% more expensive to train than StyleGAN2 with only slightly worse (~10%) image quality. As a result, StyleGAN-V could be trained directly on 1024×1024 videos, a feat unparalleled by previous state-of-the-art video generators.

The authors compared their results with a few benchmarks. StyleGAN-V managed to produce a realistic hour-long video from the RainbowJelly dataset. On the other hand, other video generators like VideoGPT, DIGAN, and MoCoGAN-HD struggle to produce meaningful results after one hour. (Figure 5)

Figure 5. A snapshot of the videos generated by different video generators at the 1-hour mark using Rainbow Jelly 256×256 benchmark

On the MEAD 10242 dataset, StyleGAN-V could preserve the identity of a speaker, while MoCoGAN-HD produced noisy results after 1 second. (Figure 6)

Figure 6. Given the face image of a speaker from MEAD 10242, StyleGAN-V synthesized an hour-long video that preserved the identity of the speaker (bottom row). MoCoGAN-HD could not do the same (top row)  [Source]

The output of StyleGAN-V can be controlled by tuning the latent space. To the delight of the researchers, StyleGAN-V inherits the perks of StyleGAN2, including the ability of semantic manipulation. To do so, the researchers edited the first frame of a video with CLIP and animated it with the StyleGAN-V (Figure 7).

Figure 7. StyleGAN-V, like StyleGAN2, has the ability of semantic manipulation.

Admittedly, StyleGAN-V still has some limitations to overcome. In particular, the model might produce repeated periodic motions at times, and handle certain new content poorly. That said, StyleGAN-V’s is still a solid foundation for even more powerful video generators in the future.

3. Portrait Eyeglasses and Shadow Removal by Leveraging 3D Synthetic Data

Many facial recognition techniques assume that the input facial images do not have occlusions. When that assumption is not met, the performance of the model degrades. One of such occlusions is eyeglasses, and being able to remove portrait eyeglasses could handle such cases. (Figure 8)

Figure 8. Portrait Eyeglasses and Shadow Removal [Source

The task of removing eyeglasses (and their shadows) is not trivial. Not only do the facial features have to be recovered, but the lighting effects (like shadows, reflections, and distortions) associated with the glasses also have to be removed. Conceivably, modeling these effects are computationally expensive and difficult. 

Thus, the author proposed a detect-then-remove algorithm that first detects the presence of the eyeglasses before removing them. 

Due to a lack of real-world data, the authors trained both the detection and removal algorithm with a synthetic portrait dataset. The benefit of using a synthetic dataset is the ability to generate a diverse dataset spanning various identities, expressions, eyeglasses, and illuminations. As a result, the model can generalize across different lighting conditions, head poses, and eyeglasses (Figure 9).

Figure 9. Removing glasses across various settings [Source]

Notably, the authors also used synthetic images that cannot be captured in the real world for intermediate supervision. In particular, portraits with eyeglasses but no shadows (and vice versa) were used to train the de-glassing and de-shadow networks.

To bridge the gap between synthetic and real data, the authors use the Domain Adaptation network. It converts the input images – whether synthetic or real – into a uniform feature map. 

The results from this paper are promising. A quick comparison of the authors’ results with state-of-the-art eyeglasses removal methods (Figure 10) reveals the limitations of current methods. For example, existing methods like CycleGAN, StarGAN, and pix2pix cannot completely remove the eyeglasses for some shadows.

Figure 10. Qualitative comparisons of different methods across the multiple images. [Source]

The authors achieved state-of-the-art results despite the lack of real-world data. We would even argue that the variety and customizability of synthetic data is an important ingredient in addressing the weakness of previous models. For many tasks beyond eyeglasses removal, synthetic data might be the key to pushing the boundary of other computer vision tasks. 

4. Task2Sim: Towards Effective Pre-training and Transfer from Synthetic Data

Many computer vision models rely on pre-trained models trained on large real-world datasets. As demonstrated by the paper Exploring the Limits of Weakly Supervised Training, the downstream performance of such pre-trained models increases with the size of their training sets. It is not a surprise then that researchers are training models with ever-larger datasets in hopes of pushing the boundary of the state-of-the-art.

Yet, real-world data comes with strings attached. It is expensive and laced with privacy and ethical issues. An alternative to real-world data, synthetic data could be used to pre-train vision models in a scalable and privacy-preserving manner.

One question remains–what kind of synthetic data is best suited for a specific computer vision task? That was the question that a group of researchers from Boston University, Watson AI Lab, and Cornell University aimed to answer. 

These researchers found that each downstream task has a favorable configuration of simulation parameters (like lighting, object pose, and backgrounds), and no one-size-fits-all solution exists. Not only that, they proposed the Task2Sim model, which is a unified model that finds the optimal simulation parameters for generating pre-training data for specific tasks. 

The results from Task2Sim are impressive. Vision models trained on task-adaptive pre-training data generated by Task2Sim performed significantly better than those trained on non-adapative data at 20 distinct classification tasks. 

The researchers demonstrated that Task2Sim is essentially a “one-shot learner”. Specifically, once trained on a set of seen tasks, Task2Sim can then predict parameters for novel unseen tasks. Future-gazing, Task2Sim could be used as part of a pipeline to automatically generate pre-training data when given some downstream examples.

We have covered the technical implementation of Task2Sim in detail in another blog post. There, we described its architecture, training process, and performance on various tasks.

5. BANMo: Building Animatable 3D Neural Models from Many Casual Videos

3D shape reconstruction often relies on specialized sensors. This is especially true for deformable objects, like humans or animals. 

With the rise of virtual reality comes the burgeoning demand for creating animatable assets from images and videos captured by a regular camera. Doing so for humans was difficult without the help of specialized sensors. The same can be said for animals and any other 3D deformable objects (as opposed to rigid scenes like buildings). 

Researchers from Meta AI and Carnegie Mellon University pondered on that problem and proposed BANMo, a Builder of Animatable 3D Neural Models. Instead of using specialized sensors to animate humans and animals. BANMo requires only casual videos of an object (Figure 11).

Figure 11. Input videos and BANMo’s reconstructions (Source)

When given multiple videos of a deformable object (e.g. a cat), BANMo could reconstruct an animatable 3D model without the need for any predefined shapes (Figure 12). The object can then be reconstructed with different poses.

Figure 12. An overview of BANMo’s reconstruction process

More concretely, BANMo consolidates 2D cues from thousands of images into a fixed canonical space. The output of the BANMo model is a neural blend skinning, which could be animated by manipulating bone transformations. The use of neural blending skinning is superior to using dynamic NeRF approaches in terms of the handling of pose variations and deformations. 

From the experiments, the researchers found that BANMo could reconstruct humans and animals from real and synthetic datasets with high fidelity. By taking better advantage of the large number of frames from multiple videos, BANMo produced finer 3D shape details and motions than state-of-the-art approaches like ViSER. (Figure 13)

Figure 13. A comparison of BANMo with the state-of-the-art (ViSER)

6. Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields

Two years since its inception in 2020, Neural Radiance Fields (NeRF) is still one of the leading methods to synthesize novel views of complex scenes. Given a set of sparse input images of a scene, NeRF optimizes a volumetric representation of the scene. With that, it achieved state-of-the-art results in generating complicated geometry and appearance.

That said, NeRF is not perfect. In particular, it struggles with “unbounded scenes” where the scene could extend very far into the distance from any angle. Under such settings, the renderings are often blurry. Such a challenge arises due to three critical issues – parameterization, efficiency, and ambiguity.

Thus, Google and Harvard researchers present an extension of mip-NeRF (a variant of NeRF) called mip-NeRF 360. It uses a combination of techniques including non-linear scene parameterization, online distillation, and a distortion-based regulariser to overcome such challenges. Overall, the mip-NeRF 360 could synthesize unbounded scenes with high fidelity.

Bonus Papers 

Kubric: A scalable dataset generator

Kubric is an open-source Python package that uses Blender and PyBullet to generate realistic scenes and rich annotations. It is scalable as it can distribute the computational load over thousands of machines. The paper demonstrated Kubric’s ability to generate datasets for a wide range of tasks, including optical flow estimation and studying 3D NeRF models (Figure 14). Interested practitioners can use the Kubric package which is currently available on Github.

Figure 14. An example of a scene and annotations generated by Kubric

SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation

Real-world datasets often fail to capture edge cases. Thus, vision models trained on such datasets might fail to recognize such rare occurrences and cause harm to their users. For example, an autonomous vehicle that fails to recognize occluded stop signs might fail to react appropriately. 

Synthetic datasets could address such robustness issues. That is the motivation of the researchers behind SHIFT, which is the largest synthetic dataset for training self-driving cars. (Figure 15)

Figure 15. The SHIFT dataset 

The dataset includes scenarios from multiple weather conditions (cloudy, rainy, or foggy), time of day, and pedestrian intensity. It also includes a comprehensive sensor suite and annotations, which makes it suitable for a range of perception tasks. Further, SHIFT’s videos also are longer than those in real-world datasets, making them an ideal candidate for test-time learning.

CVPR 2022 will not disappoint

If there is one thing we can learn from CVPR 2022, it is that the state of synthetic image generation is rapidly advancing. In particular, we see the refinement of technologies like NeRF and GAN in producing higher fidelity synthetic images. We also witness an influx of papers that are using synthetic data to produce state-of-the-art results on a multitude of vision tasks. Looking forward, we foresee advancements to translate into tangible, practical use cases of synthetic data in real-world applications. 

What we’ve covered here is only the tip of the iceberg of what CVPR 2022 has to offer. There are dozens of papers to pursue; hundreds of workshops and tutorials to engage with; plenty of computer vision experts to learn from. If you’re eager to learn more, you can study the conference proceedings which will be available via CVF and IEEE Xplore after the conference. 

Meet us at booth #813 at CVPR 2022. Schedule a meeting here.

Datagen Named as a Gartner Cool Vendor

According to Gartner, we’re cool. Gartner® recently named Datagen as a cool vendor in their AI for Computer Vision report. 

Our product allows organizations to generate high quality and domain specific images, both 2D and 3D for computer vision cases.

Our platform is unique and provides granular customization for our customers. Customers have control over camera lens specifications, lighting, environmental factors, demographic and metadata. 

Our platform can provide domain-specific synthetic data such as home security, in-cabin, smart office and more with our data for humans, and object perception. As well, customers rely on us to alleviate bias and privacy concerns because the use of synthetic data, which is not linked to any individual, minimizes the risk of leaking PII.

Visit our booth #831 at CVPR 2022 to enter to win a drone! Schedule a meeting here.

What’s up next for Datagen? A whole host of new products able to generate synthetic data. 

With our new smart office model, our users can generate an entire conference room including post-it notes and whiteboards, to name a few, create gaze detection and interaction with these objects. 

Our interactive home security model allows users to generate a delivery person approaching the door, dropping off a package and identifying suspicious activity to iterate your models and improve performance.  

Cool stuff, right?

*Gartner, “Cool Vendors™ in AI for Computer Vision”, Shubhangi Vashisth, Arun Chandrasekaran, et al. May 26, 2022.

GARTNER and COOL VENDORS are registered trademarks and service marks of Gartner, Inc. and/or its affiliates in the U.S. and internationally and are used herein with permission. All rights reserved. Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

Facial Landmarks and Face Reconstruction: An Intro

Face (facial) landmarks are essential attributes of a human face that allow us to distinguish between different faces. Facial landmarks are used to identify and represent key parts of a human face, such as nose, eyebrow, mouth, or eye corners. A face image’s landmarks are usually the 2D coordinates of their positions in the image plane. 

These landmarks are widely used for computer vision applications, like 3D face morphing, head pose estimation, face reconstruction and extraction of facial regions of interest. A common representation of facial landmarks using predefined 68 points is presented in Figure 1. Unfortunately, many important face characteristics cannot be adequately recovered solely based on a set of sparse facial landmarks. It is difficult, for example, to tell if a person has high cheekbones without landmarks on the cheeks. Therefore, creating a high-quality 3D model solely based on a few facial landmarks does not seem possible.

Fig.1: 68 facial landmarks

To overcome this hurdle, additional characteristics of face images are used to build 3D face models. For example, previous works harnessed optical flow (between face images taken from different angles) and depth maps information for this task. These characteristics (optical flow and depths) are usually leveraged in addition to landmarks as constraints on a 3D face model. For example, optical flow computed using a 3D face model, reconstructed from a face image sequence, should be consistent with ground-truth optical flow for this sequence. Unfortunately, relying on these characteristics for face reconstruction can be problematic as they are often inaccurate, difficult to obtain and expensive.

Facial landmarks are often combined with the aforementioned characteristics to build more accurate face models with the help of differential rendering techniques. Differentiable renderers are applied to determine the relationship (usually quite complex) between the facial identity features (such as landmarks or 2D projections) and the 3D face model parameters for shape and texture. Such renderers allow for construction of differentiable loss function for face model parameters which is usually optimized by gradient descent algorithms. Unfortunately, differentiable rendering methods are highly complex and usually involve intricate procedures of neural network training. Additionally, in many cases simplifying assumptions on face images properties are required to make these methods computationally feasible.

The authors of the paper ask the following question: is there a way to construct a high-quality 3D face model based on ONLY facial landmarks without exploiting other image characteristics like depth maps/optical flow and without using differentiable rendering techniques? We already know that it is not possible to build an accurate face model solely based on a sparse set of facial landmarks, covering only a frontal part of the face. But perhaps we can do it, if we have a dense set of landmarks, which cover the entire head (see Fig. 2). Having such a rich set of landmarks seems to contain sufficient information to build an accurate 3D face model. 

Fig.2: Given a single image (top), we first robustly and accurately predict 703 landmarks (middle). To aid visualization, we draw lines between landmarks. We then fit our 3D morphable face model to these landmarks to reconstruct faces in 3D (bottom).

However, annotating a human face with a dense set of landmarks is a very difficult task. Even though humans are capable of labeling face images with a small number of landmarks accurately, annotating them with hundreds of dense landmarks seems almost impossible. Several methods have been developed for predicting dense facial landmarks, but unfortunately these methods often suffer from high computational complexity and are incapable of predicting landmarks densely covering the whole head, including eyes and teeth and therefore will not suffice the task of 3D head reconstruction.

Paper’s Idea: An Essence

The authors of this paper tackle this problem with a different approach. Instead of using real data, the researchers suggest using synthetically generated facial images with perfect annotations of dense landmarks. The proposed approach consists of three main stages:

1. Rendering 100k synthetic face images, using the method proposed by Wood et al. with 700 perfectly consistent 2D landmarks, fully covering the whole face region. The generated set covers human faces with different hairstyles, clothing, facial expressions and textures made from different camera locations. 

Fig.3: Given an image, we first predict probabilistic dense landmarks L, each with position µ and certainty σ. Then, we fit our 3D face model to L, minimizing an energy E by optimizing model parameters Φ.

2. Training of a convolutional neural network (CNN) for prediction of landmark location along with the prediction uncertainty. Importantly, the rendered face images themselves are not explicitly used for the model training procedure but solely their 2D landmarks.

3. Fitting of a morphable 3D face model to these dense landmarks 

It is noteworthy that the synthetically generated facial images with their dense landmarks are the primary enabler of the proposed scheme as a human annotation of 100K images with 700 landmarks is impractical. 

Landmark Prediction Model Training:

The landmark prediction model is probabilistic. Namely a neural network is trained to predict  expected position = (x,  y) of a landmark together with its standard deviation(std) , which measures the prediction uncertainty. The model is trained to predict small σ-s (low uncertainty) for visible landmarks (e.g. on the front of the face), whereas high values of σ are predicted for hidden (e.g. behind hair) landmarks to represent their high uncertainty. Note that a rendered image contains only the landmark positions but not their uncertainty.

Fig. 4: When parts of the face are occluded by e.g. hair or clothing, the corresponding landmarks are predicted with high uncertainty (red), compared to those visible (green).

To enforce these requirements on the trained model, the authors used a Gaussian negative log likelihood loss function:

where L denotes the number of landmarks. The first part of the loss penalizes the model for too  high uncertainty while the second part controls the prediction accuracy.

3D face model fitting:

After training the probabilistic landmark prediction model, a 3D face model is reconstructed by minimizing a two terms loss function:

1. The first term Elandmarks fits the probabilistic landmarks from the first stage.  This term penalizes the 3D model for inaccurate “explanation” of 2D landmarks and has the following form:

where, for the k-th landmark viewed by the j-th camera in the i-th frame [µijk, σijk] are the 2D location and uncertainty predicted in the previous stage, while xijk is the 2D projection of that landmark on the 3D model. Note that the loss function has a reduced penalty for the hidden landmarks (e.g., by hair or clothes), which are given a high uncertainty by the first stage model.

1. The second term is a regularizer imposing the inherent face properties on the model output. It consists of 5 terms: Eidentity, Eexpression, Ejoints, Etemporal and Eintersect. Here, Eidentitypenalizes unlikely face shapes, Etemporal requires face mesh vertices to remain still between neighboring frames, while Eintersect aims at avoiding skin intersections with eyeballs or teeth .Finally, the objectives of Eexpression and Ejoints are to minimize joint rotation and facial expression (with respect to a “neutral” face position and expression) when describing the training data (landmarks).

Performance Improvement:

The proposed approach performance was compared with several recent methods on 2 benchmarks: the NoW Challenge and the MICC dataset. The evaluation is made in the context of single and multi-view tasks, single view means a face model to each image separately while multi-view means a face model is built based on all images of a specific subject. For the former benchmark, the proposed technique outperformed SOTA methods for both tasks. 

Fig. 5: Results for the NoW Challenge. We outperform the state of the art on both single- and multi-view 3D face reconstruction.

Furthermore, the authors showed that 3D face reconstruction is more accurate when fitting with more landmarks.

Fig.6: Ablation studies on the NoW validation set confirm that denser is better: model fitting with more landmarks leads to more accurate results. In addition, we see that fitting without using σ leads to worse results.

Conclusion:

A large dense set of facial landmarks has been shown to be a powerful signal for 3D face reconstruction. Construction of large datasets containing face images with such a high number of perfectly consistent landmarks can only be accomplished using synthetic data, since it is impossible to to manually annotate.