Synthetic Data and Facial Landmark Detection

Can synthetic data lead to better performance on the computer vision task of facial landmark detection?

In our first two blogs here and here, we learned about our experiment with combining synthetic data (created with the Datagen platform) with real-world data in training a facial landmark detection model.

In our first blog, we discussed  domain gaps and laid the groundwork to proceed with our experiment. Next, we discussed the hypothesis that “a landmark detection model trained on a combined real and synthetic data set will outperform one trained solely on real data”.  To test that, we adopted a truly data-centric approach. Using fixed landmark detection model architecture, we experimented with five strategies of training the model; two of which (mixing and fine-tuning) involve combining real and synthetic datasets as seen in Table 1, below. 


Type Name Number of images Purpose
Synthetic Dataset DGFaces (Figure 1) 10k labeled images Training set
Microsoft Dataset  100k labeled images (Only 10k is used in this experiment) Training set
Real dataset 300-W Train (Figure 2) 6296 labeled images Training and validation set
300-W Common (Figure 3) 1108 labeled images Test set only

Table 1: Datasets used


Figure 1

Figure 2: Sampled images from 300-W Train that are used in the validation set

Figure 3: Sampled images from 300-W Common that are used in the test set.


Read the full white paper: Facial Landmark Detection Using Synthetic Data


To compare the strategies we used fairly, we had to ensure that the label adaptation model performs reasonably well. First, we trained a model with synthetic data only. Then, we used model label adaptation to overcome the label gap. The performance of models using model label adaptation is shown in Table 2.

Train Dataset Amount of data points used Pipeline implemented by Error  [NME] ↓
DGFaces 10k Datagen 3.93
Microsoft  10k Datagen 3.63
Microsoft 100k MS[1] 3.09
300-W Train 6296 Datagen 3.32
300-W Train 6296 MS[1] 3.37

Table 2: The performance of models using model label adaptation

Comparing the 5 Strategies 

Here, we compare the performance of these five strategies with different amounts of real data Nreal. (Figure 4)

Figure 4

We can extract three insights from Figure 4.

The model “fine-tuned” with 50 real images has approximately the same performance as the “real-data-only” model with 250 real images. This shows that the use of synthetic data reduces the amount of real data needed. (Figure 4A)

Figure 4A

The models trained on the combination dataset outperform those trained solely on real data alone. This demonstrates the value of combining real-world and synthetic data in landmark detection model training. (Figure 4B)

Figure 4B

The performance gap between the “real-world-only” model and the “fine-tuning” model is the largest when the amount of real data is small. This indicates that the value of synthetic data is potentially the largest when real data is limited. (Figure 4C)

Figure 4C


Synthetic data has the potential to drastically improve computer vision tasks. Yet, the hurdle of visual and label domain gaps fly in the face of such potential. Fortunately, these can be easily overcome with label adaptation and data augmentation. 

Having overcome such domain gaps, we find that combining synthetic and real-world data outperforms the strategy of using real data alone. We also learned that fine-tuning is the optimal strategy for creating the best landmark detection with real and synthetic data.

In the future, we are excited to replicate this experiment on other computer vision tasks. We also have plans to explore more methods of improving the landmark detection benchmark. Stay tuned!

Read the full white paper: Facial Landmark Detection Using Synthetic Data

Task2Sim: Towards Effective Pre-Training and Transfer from Synthetic Data


Neural network (NN) models pretraining on large datasets, like ImageNet, became a standard procedure in computer vision in the last years. Model pre-training is especially effective when only a small amount of data is available for training. In this case, training highly expressive models, for example large-scale neural networks, may lead to overfitting and model pre-training is one of the means to reduce it (overfitting).

Synthetic data, generated by a graphic simulator, can be leveraged to enrich datasets used for model pre-training. Naturally, downstream task performance (e.g. accuracy) of a model, pretrained with synthetic data, depends on the synthetic data characteristics. For example, synthetic data generated with different simulator parameters (e.g. lighting, object pose, backgrounds, etc.) usually leads to different downstream task performance. It is therefore worthwhile to customize pre-training data based on downstream tasks. 

Now the question is whether it is possible to optimize synthetic data simulator parameters to maximize pre-trained model performance for a specific downstream task? The paper addresses the aforementioned problem by proposing a technique, called Task2Sim, which is designated to determine the best simulator parameters for a given downstream task. Task2Sim is trained on a set of downstream tasks (called “seen” tasks in the paper) and was shown to be able to predict best simulation parameters for “unseen” tasks without need of additional training. 

Task2Sim Architecture:

Finding optimal simulator parameter configuration is a challenging problem due to a high number of possible simulator parameters. As a result, the number of possible parameter configurations is very large (and grows exponentially with every parameter added), making a brute force approach impractical. Instead, Task2Sim trains a neural network, mapping downstream task representation into an optimal configuration of simulator parameters.

To construct such mapping we need downstream task representation, aka task embedding vector. Such representation should reflect semantic similarity between tasks. For example, representation of a cat breed classification task should be much closer to a dog breed classification task than to a truck classification task. Fortunately, technique for constructing task representation was developed in the paper TASK2VEC: Task Embedding for Meta-Learning. This technique, called Task2Vec, is capable of building fixed-dimensional embedding for a variety of visual tasks, while the task is represented by its dataset with ground truth labels. A major advantage of Task2Vec embedding is the fact that it does not rely on any knowledge of class label semantics and doesn’t depend on the number of classes. Therefore Task2Vec can be leveraged to represent a wide spectrum of visual tasks.

Now our objective is to train a model predicting the simulator configuration, achieving maximal accuracy, for a Task2Vec representation of a given downstream task. Simply put, vector representation of a downstream task is fed into a NN, called Task2Sim, which maps this vector to synthetic data generator parameters (e.g. a blur amount, lighting direction, backgrounds, etc). Then the data simulator (the authors used Three-D-World platform) generates a dataset of synthetic images based on these parameters. A NN-based classifier is pretrained on these synthetic images, and its accuracy is used to update Task2Sim’s parameters based on the pre-trained classifier performance on the downstream task. 

Task2Sim NN outputs a distribution over quantized values (e.g. several blur amount levels or lightning directions) of each simulation parameter. Denoting the number of simulator parameters by M, Task2Sim output is M distributions  over the simulator parameters.

Task2Sim Training

As a key objective of Task2Sim is to maximize downstream task accuracy after pre-training, this accuracy serves as a measure of the training success. The paper assumes that downstream task accuracy is not a differentiable function of the simulator parameters, which enables usage of any graphic simulator as a black box. Therefore, gradient-based training methods cannot be used for Task2Sim optimization. Instead, the paper leverages the REINFORCE algorithm to estimate the gradient of downstream task performance with respect to Task2Sim parameters. 

REINFORCE is a classic reinforcement learning (RL) algorithm belonging to a family of policy-gradient methods. A RL algorithm goal is to determine the policy with a maximum reward, while policy is a strategy for picking a particular action at a given time. Policy gradient methods are iterative techniques for modeling and optimizing policy directly. However, what are rewards, policy, and action in our case?1

Naturally a reward is the downstream task accuracy after pre-training for a dataset generated with a given configuration of simulator parameters. A simulator parameter configuration is actually the “action” leading to a reward.  An “action” is a vector of simulator parameter configuration, where  is a value of i-th simulator parameter. This parameter configuration is fed to the simulator to build a synthetic dataset. This dataset is used to pretrain a model and then downstream task performance (reward!!) is estimated with the pretrained model.

A policy is M distributions of simulator parameters, generated by Task2Sim NN. i-th configuration parameter is generated from distribution .

In formal terms, an  action    can be generated by sampling from these M distributions.  Probability of action is defined by:  and is the distribution of the parameter of the i-th simulator. An “action” distribution is computed by feeding task-embedding Task2Vec to Task2Sim models. Then the following function is maximized with respect to Task2Sim NN parameters :

where Ω denotes space of all possible actions a and R(a) is the reward (downstream task accuracy) achieved when the training data is created with the action a (simulator parameters).

As we already mentioned, the authors used REINFORCE technique to update Task2Sim NN weights . REINFORCE updates these weights, using the gradient of reward R with respect to , which is estimated by the following expression: 

where R(a) is approximated by a single sample from distribution over configuration P(a) and is the set of configuration parameters corresponding to the mode (argmax) values of distributions 2

NN weights θ are updated using mini-batches of downstream tasks sampled from a set of “seen” tasks. In addition, the paper updates these weights via “replaying” high-reward actions encountered in the past (self-imitation learning).

Performance Evaluation and Results:

Task2Sim performance on a downstream task is estimated by applying a 5-nearest neighbors classifier on features generated by a backbone NN, on a dataset generated with the simulator parameters outputted by Task2Sim. In other words, the label for example x is determined by the labels of 5 labeled examples that are closest to x in the feature space. Apparently, this technique is faster than the commonly used techniques for transfer learning methods evaluation, such as linear probing or full network finetuning. 

The tables below contain the comparison of Task2Sim performance with several other performance methods.

Task2Sim significantly outperforms two “basic” model pretraining approaches: “Random” (simulator parameters chosen at random) and “Domain Randomization” (Uses all variations from simulation in each image) on 12 downstream tasks. This conclusion holds for both seen (pre-training is performed on downstream tasks) and for the unseen  tasks configurations (model is NOT pretrained on downstream tasks). In addition, Task2Sim performance is almost as good as that of ImageNet data pretraining on seen tasks, and it is even better on the unseen tasks. The latter is really impressive: it means that synthetic data, generated with Task2Sim allows for building (pretraining) more robust models, even than with the real data.


  1.  The RL framework, proposed by the paper, does not involve any state and thus can be viewed as either a stateless RL or contextual bandits problem.
  2.  In fact, the authors propose to sample from a noised version of the simulator parameters(action) distribution to encourage exploration of the “action space”.

The Pace of Progress in Academia and Industry

In a recent Unboxing AI podcast episode, Gil Elbaz, Datagen co-founder and CTO, sat down to speak with Lihi Zelnik-Manor, an associate professor in the faculty of electrical engineering at the Technion, and the former general manager of Alibaba DAMO Israel Lab. ​​ 

Professor Zelnik-Manor holds a PhD and MSC with honors in computer science, from the Weizmann Institute of Science and a BSC in mechanical engineering from the Technion. Her main area of expertise is computer vision. Professor Zelnik-Manor has done extensive community contribution, serving as the general chair of CVPR 2021, and will serve as the general chair of ECCV 2022 and as the program chair of ICCB 2025.

This transcript has been edited for length and clarity. Listen to the full episode here.

Interplay between academia and industry

Zelnik-Manor: In industry, you have resources that you might not have in academia, but in academia, you have resources that you don’t have in industry. I think academia and industry have different roles. Industry picks up domains that are mature and have short term outcomes and short term impact. 

Academia has the option, but also, I think, the duty to think long-term. Deep learning grew in academia. Quantum grew in academia, and there are many other domains that grew there and then they shifted to industry. What I hope to see is that academia will find its focus on problems that need to be thought about or new things entirely.”

Teaching advice

Zelnik-Manor: When I started my career as a professor, I went to ask for advice. How should I choose my students? People gave me different advice. Some said to choose the ones that are best in math, all kinds of stereotypical things.

The advice I really liked a lot was: choose someone who is excellent in something unrelated to work. So it can be excellence in music or excellence in sports or excellence in whatever. Choose those people, they will nourish you better. This is a good guideline because people have different preferences and different goals. 

How to create a team

Zelnik-Manor: If you go back to the 1990s, teams were organized by profession. For experts in graphics, there was a graphics team and teams were generally organized by their expertise. If you wanted, for example, to create a product or a feature that required a mobile developer and a backend developer and an algorithm and product manager, you would take each of them from their individual teams. Each of these people would now have to communicate and align their plans. These are teams who have people with diverse expertise. They’re like an independent unit that can deliver the feature or the product.

A squad could have one person from each of those teams. A squad can be formed for a certain task and you could disassemble them when the task is complete. If the task is continuous, they can remain as a squad and that makes it easier to manage and tie all the loose ends together in order to deploy in a timely manner. Their OKRs are aligned and they are working together for the same goal. 

Getting started in CV

Zelnik-Manor: To get started in computer vision, I would recommend going through the traditional path, studying, getting a strong bachelor’s degree and then a PhD. 

A PhD gives you a chance to become a world expert in something. Then you have this unique knowledge base to take with you for life. From my perspective, of having a career for many years, I see myself using different aspects of the things that I’ve learned. It used to be geometry, then it’s statistics and probability, and optimization and programming and different languages.

You can work now until you’re 80. So we’re talking about 50, 60 years of work. How do you prepare yourself for that? I would recommend building a very strong infrastructure in math, computer science, and physics. Build the infrastructure of the tools, and then you can always learn and evolve and continue to remain relevant.

It’s about the people

Zelnik-Manor: I think there’s a citation from Steve Jobs, I don’t remember the accurate wording, but something like –  it’s not about the technology, it’s about the people. Put your faith in the people. Communication is very challenging because people often think it has happened and maybe it hasn’t. At the end of the day, I don’t think the main barriers for the success of our products and technologies are the technologies. They are the people who do it.

If you recruit the right people that like the vision and mission, they believe in it. They’re passionate about their profession. Doesn’t matter a mobile engineer or an algorithm or researcher. If they’re passionate about what they’re doing and they join a company that they love, its mission and vision, and they fit the culture, then you have everything in place.

Edge Cases in Autonomous Vehicle Production

“Because [the autonomous vehicle] is a product in the hands of customers, you are forced to go through the long tail. You cannot do just 95% and call it a day. The long tail brings all kinds of interesting challenges,” says Andrej Kaparthy, the director of artificial intelligence and Autopilot Vision at Tesla, at the 2020 CVPR Keynote.

Here, “long tail” refers to edge cases in autonomous vehicles (AV). Edge cases are probable scenarios that have a low probability of occurrence. These rare occurrences are easily missed and thus are often missing in datasets. While humans are naturally proficient at dealing with edge cases, the same cannot be said of AI. Thus, they have to be dealt with carefully.

Figure 1: A real image of a truck with a reflective surface (Source)

Figure 2: A real image of a chair flying across the road (Source)

The cost of ignoring edge cases is high. In 2018, a Google autonomous vehicle crashed into an oncoming bus when it detected sandbags surrounding a storm drain and had difficulty interpreting the situation correctly. In 2020, Toyota’s autonomous e-Palette collided with a visually impaired pedestrian. “It shows that autonomous vehicles are not yet realistic for normal roads,” Toyota CEO Akio Toyoda commented after the accident.

The pressure to deal with edge cases is mounting for car manufacturers. AVs which can better handle edge cases are safer and have a better prospect of becoming fully driverless quicker. 

Dealing with edge cases is not a trivial task. This blog post analyzes how synthetic data can handle edge cases in production. We will focus our attention on the development of autonomous vehicles, which remains a challenging feat due to the sheer number of possible edge cases.

Edge Cases Come in Multiple Forms

Kai Wang, the Director of Prediction at Zoox, shared that edge cases come in perceptual and behavioral forms. Perceptual edge cases involve rare sightings like trucks with protrusions. Behavioral edge cases are events that require more careful maneuvering of the vehicle, like making a tight three-point turn, navigating through a dense crowd, or dealing with jaywalking pedestrians. 

In the CVPR talk, Andrej Kaparthy illustrated examples of perceptual edge cases that Tesla observed when training their vehicle to stop. Some examples (Figure 3) of these include:

  • Conditional stop signs (e.g. “stop except right turn”)
  • Moving stop signs (e.g. on a school bus)
  • Temporary stop signs (e.g. held by a person)
  • Occluded stop signs (e.g. hidden by a vehicle)

Figure 3. A variety of stop signs that constitute edge cases (Source)

Challenges of Edge Cases in Production

Models in production are fallible – and can perform poorly when faced with edge cases. That is why Tesla has a “Data Engine” that empowers its development of autonomous vehicles (Figure 4).

The principle of the data engine is easy to understand. First, inaccuracies in existing models are detected. Tesla then adds such cases to its unit tests. It also collects more data of similar cases to retrain the model. Such an iterative approach allows it to catch as many edge cases as possible.

Conceivably, the act of collecting and labeling edge cases can be exorbitantly expensive, and in some cases dangerous and impossible. Here’s an example of a dangerous edge case to replicate (Figures 5A and 5B)

Figure 5A. A man emerging from a manhole remains undetected by the vehicle 

Figure 5B. A man emerging from a manhole is incorrectly classified. (Source: DRisk and BBC)

Testing model in production on edge cases using synthetic data

This is where synthetic data can help address the problem. As early as 2018, Bewley et. al. demonstrated that their autonomous vehicle trained only on simulation data drive in public urban roads in the UK (Figure 6). Since then, more researchers have explored the use of synthetic data for training autonomous vehicles.

Figure 6. Comparison of simulated training and actual test environments in “Learning to Drive from Simulation without Real World Labels” (by Bewley et. al.)

More recently, NVIDIA recently proposed a strategic approach named “imitation training” (Figure 7). In this approach, the failure cases of existing systems in the real world are replicated in a simulated environment. They are then used as training data for the autonomous vehicle. This cycle is repeated until the model’s performance converges. 

Figure 7. The imitation training approach involves the “train, evaluate and simulate” cycle (Source)

Consider the following edge case involving a partially visible truck (Figure 8). The original model produced an incorrect bounding box (left, yellow rectangle). After training on synthetic imitation data (Figure 9), the improved model could produce the correct bounding boxes on this edge case (Figure 8, right)

Figure 8: The baseline model prediction of the truck (left, yellow) compared against the improved model trained with synthetic data (right, blue). The ground truth is in gray.

Figure 9: Examples of synthetically generated images imitating the false truck detection in Figure 8

In general, the models trained using the synthetic imitation data strategy are found to have superior performance as compared to those trained on real data only. 

Testing model in production on edge cases using synthetic data

Apart from training models in production, one can also create synthetic data to test existing models on edge cases. 

As AI matures, more data teams find value in developing test cases for their AI models. Much like software engineering has test-driven development (TDD) as part of its best practices, AI practitioners are likely to develop unit tests for their AI models, argued Yaron Singer from Robust Intelligence. To extend that principle, Yaron also suggested the use of fuzzing for AI, where random inputs are fed to the model in hopes of uncovering its vulnerabilities. Having a robust testing protocol reduces the possibility of AI accidents or AI systems breaking unexpectedly.

Testing model performance

To thoroughly evaluate the robustness of an AI system, the unit tests must include both the general and edge cases. Yet, certain edge cases might not be available from existing real-world datasets. To that end, AI practitioners can use synthetic test data. 

One example is the ParallelEye-CS, a synthetic dataset for testing the visual intelligence of autonomous vehicles. The benefit of creating synthetic data over using real-world data is the high degree of control over the scene of each image. Figures 10 and 11 below illustrate how the authors tweaked the scenes and environments to include different edge cases in the data set.

Figure 10: The same scene but in different environments.  (top left: rainy; top right: dark;  bottom left: sunny; bottom right: flood)

Figure 11. Scenes with varying levels of occlusions. (top left: with railing; top right: with divider; bottom left: with a hidden bus; bottom right: with all occlusions)

Testing model for fairness in edge cases

Apart from optimizing model performance, AI practitioners must also ensure that the AI models in production are fair, even in edge cases. Such a sentiment is echoed by Martin Fowler, who called for AI practitioners to validate their models in production for potential biases.

For instance, the model should not perform poorly on minority classes (like the facial recognition software which misidentifies black faces). Neither should a model discriminate against minority groups (like how Amazon’s AI recruiting tool penalized female candidates).

Companies can no longer afford to turn a blind eye to AI fairness as regulations like GDPR start imposing fairness requirements on decisions made by machine learning models. Yet, there are limited toolkits and publicly available datasets for AI practitioners to evaluate their AI on fairness. 

Synthetic data can help. Datagen recently published our findings on the use of synthetic images in uncovering population biases in facial landmark detections (Figure 12). In particular, we discovered that the performance of a facial landmark detection model differs significantly between two populations of a synthetic dataset (e.g. men vs women). This is a telltale sign that the model in production is biased against a certain group. 

We can extend this concept to AVs. For example, we can test the model on pedestrians of different races and ages to uncover its population biases. 

Synthetic data for edge cases

Synthetic data is here to stay as a viable solution to edge cases in production AV models. It supplements real-world datasets with edge cases, ensuring that AVs remain robust even under freak events. It is also more scalable, less prone to errors, and cheaper than real-world data.

With the rise of the data-centric AI movement, AI practitioners are starting to improve on the value of the data. At Datagen, we believe that the use of synthetic data in solving edge cases will not be restricted to autonomous vehicles. It can be extended to facial recognition, medical imaging, advanced manufacturing, and many other industries. 

Using Synthetic Images To Uncover Biases

In January 2020, Robert Williams was arrested for shoplifting after examining the security footage. The investigators followed the lead of a facial recognition system that flagged Williams’ photo as a match against the grainy footage (Figure 1). The problem? Williams was nowhere near the crime scene when the incident happened.

Figure 1. A photo of the alleged shoplifting suspect from the security footage and the driver’s license photo of Robert Williams  

Williams’ wrongful arrest raised awareness of the fallibility of facial recognition, particularly in underrepresented minorities. The source of the problem is the biased and unbalanced dataset used to train many existing facial recognition AI today.

Image datasets are problematic 

Unfortunately, even the most established datasets today proliferate existing social biases. Pundits initially lauded the now-defunct Tiny Images Dataset as a comprehensive library of items, only to be disappointed later by revelations that it is rife with racist, misogynistic, and demeaning labels. Machines trained on such biased datasets perpetuate the biases in their outputs against vulnerable and marginalized individuals and communities. 

Even if datasets do not contain harmful labels, they are likely to under-represent minority groups. Unsurprisingly, models trained on unbalanced datasets perform poorer on the minority class in an effect dubbed bias amplification. In particular, Buolanmwini et al. found that commercial gender classification systems performed much poorer on dark-skinned females than light-skinned males. Another study found that face recognition and age estimation models were biased towards faces under 18 and over 60 years old. 

We cannot understate the harmful effects of such underrepresentation. A passport robot that tells Asian applicants to open their eyes is a PR nightmare (Figure 3). An autonomous vehicle that cannot recognize dark-skinned individuals can cause a life-threatening tragedy.

Figure 3. A screenshot of an erroneous passport photo rejection notice (Source)

Unfortunately, many real-world applications inherit biases encoded in training images. Computer vision practitioners ought to catch such biases before they cause harm. Yet, the question remains–how do we detect biased computer vision systems?

Detecting biased computer vision systems

One existing solution is to use balanced, richly annotated real-world datasets to detect the difference in performances across different groups. For example, KANFace features diverse ages, genders, and skin colors capable of uncovering biases for various face analysis tasks (Figure 4). 

Figure 4. KANFace is annotated for identity, exact age, gender, and kinship (Source)

Another solution is to use synthetic data for detecting biases against minority classes. This idea is not new. Tesla demonstrated its simulated environments (Figure 5) used to test their autonomous vehicles in edge cases.

Figure 5. An example of a rare scene simulated by Tesla (Source)

Detecting biases in facial landmark detection

The same idea can be applied to uncover biases in models trained for facial analysis tasks, as we demonstrated in our paper Using Synthetic Images to Uncover Population Biases in Facial Landmarks Detection

To validate this hypothesis, we found that the performance of a facial landmark detection model differs significantly between two populations of a synthetic dataset (e.g. men vs women). This leads us to conclude that synthetic images could be used to uncover population biases for landmark detection tasks. 

Similar results are found when evaluating the same model on real datasets. This signifies that both synthetic images and real images are capable of testing the bias of a landmark detection model.

Figure 6: Steps for the experiment

Here are the steps to validate this hypothesis (Figure 6)

  1. Use the DLIB package to detect facial landmarks on two datasets–CelebA (Figure 7) and FFHQ-Aging (Figure 8). These datasets are used because they come with facial landmarks annotations and labels of appearance attributes (like gender, age, and skin color)
  2. Compute the normalized mean error (NME) between the model output and the ground truth. The NME reflects the deviance between the detected position and the ground truth.
  3. Stratify the errors based on the attributes to identify any biases in the model. For example, the model is biased against women if the NME for men is lower than women on average.
  4. Repeat steps 1 to 3 using a synthetic dataset created by Datagen (Figure 9).
  5. Compare the biases identified using the synthetic dataset against those identified using real datasets.

Figure 7. Sample images from CelebA dataset

Figure 8. Annotated examples from FFHQ Dataset 

Figure 9. Example of face images from the Datagen synthetic dataset

Table 1 shows the results of the experiment. 

Table 1

1 FFHQ-Aging does not have the “beard” attribute annotation 

In summary, the model’s bias is apparent when tested on both real and synthetic datasets. For example, when tested on CelebA datasets, the model’s mean NME on young faces is 0.40% lower than that on old faces. When tested on FFHQ-Aging and Datagen’s synthetic dataset, the model’s NME on young faces is similarly lower than those on older faces. Simply put, the model is shown to be biased against older individuals when tested on both real and synthetic datasets.

We observe the same results when testing the model on other attributes. The model favors men (vs. women), those without a beard (vs. those with a beard), and white-skinned (vs. black-skinned) individuals.

Using synthetic images to uncover population biases

This set of results gave us confidence that the synthetic face images generated can uncover weaknesses in a facial landmark detection model. 

This is good news for computer vision practitioners who want to detect biases in their models. It is often challenging to set aside sufficient images in the test set to detect biases against certain groups. This is no longer a problem if developers have access to synthetic data.

Synthetic images for fair AI

Computer vision AI cannot be expected to make fair decisions when they are trained on datasets rife with existing social biases against underrepresented groups. AI practitioners have the onus to detect biased models and create fair AI models. Synthetic images will become a cornerstone in that process.

That is why we plan to test for biases against additional attributes in other trained models. Our future research aims to rectify such biases by adding synthetic data in the training process.

If you are interested in using synthetic data to detect potential biases in your systems, talk to us today. 


Real and Synthetic Data for Facial Landmark Detection

In part 1 of this series, we discussed  domain gaps and  laid the groundwork to proceed with our experiment.

This experiment hypothesizes that “a landmark detection model trained on a combined real and synthetic data set will outperform one trained solely on real data”.

To test that, we adopted a truly data-centric approach. Using fixed landmark detection model architecture, we experimented with five strategies of training the model; two of which (mixing and fine-tuning) involve combining real and synthetic datasets, as described below.

Strategy 1: Training the model on real data only

A face-to-landmark model is trained only on real data.

Strategy 2: Model with label adaptation

In this strategy, the face-to-landmark model is trained on synthetic data. The labels are then adapted after training.

Strategy 3: Model with data adaptation

For the synthetic data, a label adaption model (trained on real data) converts synthetic data labels to adapted labels. A face-to-landmark model is trained using only synthetic data and the trained face-to-landmark model predicts the label of real data. This strategy trains a face-to-landmark model on label-adapted synthetic data.

Read the Facial Landmark Detection Using Synthetic Data Whitepaper

Strategy 4: Mixing (Training on combined real-world and label-adapted synthetic data)

In the first setup, a combined dataset is used to train the model.

We first select Nsynthetic images from the Datagen synthetic dataset, which passes through the trained label adaptation model.

The resulting label-adapted synthetic images are combined with Nreal images from the real dataset in the set ratio of α, where α is the batch synthetic sampling ratio. 

This combined dataset is then used to train a face-to-landmark model. The predicted labels are then compared against the ground truth labels.

Strategy 5: Fine-tuning (Training on synthetic, fine-tuning on real data)

In this strategy, the model is first trained purely on the DGFaces synthetic dataset and then fine-tuned on real data. 

Experimental Design


The performance of each model is judged based on the normalized mean error (NME). To calculate NME, one would find the (normalized) distance between the ground-truth and predicted labels for each facial feature. These distances are summed up to give the NME.

Validation and test sets

The same real validation and test sets are used to compare each model. A 300-image subset of the 300-W Train dataset is used as the validation set, while the 300-W Common dataset is the test set.

In our next blog, we will talk about the results of the experiment.

VOS: Learning What You Don’t Know


Deep learning models’ safe deployment in real-life scenarios requires accurate detection of out-of-distribution (OOD) data. Deep neural networks (DNNs) are usually trained under the assumption that training and real world data distributions coincide. Real-world tasks, however, fail to uphold this assumption, leading to erroneous and high-confident predictions for OOD data. Simply put, the absence of OOD examples during training leaves the deployed model performance unpredictable to OOD inputs. Therefore, methods that can reliably identify OOD inputs are vital for safe deployment of DNN models in high-stakes applications such as healthcare and autonomous driving. This phenomenon is illustrated in Fig.1, where a moose is mistakenly identified as a pedestrian with a high confidence.

The paper develops an OOD detection mechanism which is trained during neural network optimization through constructing synthetic OOD samples. The authors propose to leverage these synthetic OOD samples to train a model to recognize OOD samples along with optimizing it (model) on an in-distribution (ID) task (classification and object detection).

Leveraging the developed OOD detection mechanism, we can specify and characterize edge cases and under-represented regions of the data. With this knowledge, we may be able to improve our coverage of these regions. For example, synthetic data can be generated to cover these problematic areas.

Main idea of the solution:

Previous approaches leverage generative models like Generative Adversarial Networks (GAN) models to generate OOD samples in the image space. Such models are often difficult to train, and the selection of an optimal training set can be challenging as well. In order to cope with these flaws, the authors propose to generate OOD samples in more tractable low-dimensional feature space rather than the original high-dimensional image space. Synthetic OOD samples are generated from low-likelihood regions in the feature space for every image category. Simply put, OOD samples are generated in the feature space areas located far away from the in-distribution samples for each class. 

Then the network is trained to distinguish between the generated OOD samples and in-distribution samples and is simultaneously optimized to perform the ID task (classification and object detection). Note that the OOD detection task is framed as a binary classification problem performed in the feature space. Thus, in addition to a bounding box and class scores distribution, a neural net produces a so-called uncertainty score for every detected object of an input image. This uncertainty score attains small values for ID samples and larger values for OOD samples allowing for an effective OOD samples detection. 

Learn more about the power of synthetic data.

Technical Explanation:

Let’s discuss the synthetic outliers generation procedure. As it was already mentioned, these outliers are constructed in the low dimensional feature (representation) space. To obtain an image representation, an image is first fed into a backbone neural network. Then the result is processed by a bounding box proposal generator to obtain feature (representation) vectors for every detected object. 

Next, in order to generate outliers in the feature space, the training set feature distribution should be estimated. Feature distribution is assumed Gaussian (discussed in the next section) and is estimated separately for each class (category). Finally synthetic outliers (OOD) are generated for each class in low-probability regions of the estimated class features distribution. In simple terms, feature space vectors located far away from the average feature value for the samples from a given category are sampled randomly to form synthetic outliers. 

Assuming clusters formed by features of samples from all in-distribution categories are sufficiently separated, these generated outliers “correspond” to samples that do not belong to any ID category in the original pixel space. Then DNN is trained to distinguish between the outliers and ID samples by giving low uncertainty scores to the outliers and high scores to the ID samples. A sample uncertainty score can be viewed as the reciprocal of a sum of the probabilities of this sample belonging to each ID category. If this sum is low, the sample is likely an outlier whereas ID samples are characterized by higher values.

Possible Shortcomings/Insights of VOS:

The main flaw of the proposed method is the Gaussian assumption of the feature vectors distribution for every category. Using this method for the cases when this assumption does not hold (e.g multimodal), may lead to generation of “false” synthetic outliers.  The use of “false” outliers to train the OOD detector can cause ID samples to be incorrectly identified as OOD, thereby degrading the performance of both the ID task and the OOD detector. A possible way to overcome this issue is to impose Gaussian distribution on category feature distributions. It can be achieved by adding distance between feature and Gaussian distribution, such as Kullback-Leibler divergence or Earth Mover distance to the loss function.

An important aspect of the proposed method is the dimension of the feature space. Being too low may affect the “representativity” of the feature space resulting in inadequate coverage of the generated outliers. On the other hand, if the feature space dimension is too high, a high amount of synthetic outliers will be required to cover the entire OOD region. If we generate an insufficient number of such outliers, the boundary learned by the classifier won’t be good enough to distinguish ID samples from OODs. 


The proposed method constitutes an efficient tool for detection of under-represented regions in the test data by learning rejection areas with respect to each category. Furthermore, synthetic data generation process is optimized through focusing on the problematic under-represented regions of the data distribution. 

Final Note: Employing VOS approach for a trained network w/o retraining?

If you already have a trained network, synthetic outliers can be generated with the proposed approach. Then another model can be trained to differentiate between ID samples and these generated outliers.

Learn more about the power of synthetic data.

We Just Raised $50M in Round B!

I’m excited and proud to announce today that Datagen has closed $50M in Series B financing led by our new investor Andy Vitus from Scale Venture Partners, with participation from our existing investors TLV Partners, Viola Ventures and Spider Capital. Additional investors taking part in the round include financial funds Vintage IP, Viola Growth and others. Thank you to Scale Venture Partners and to all our investors whose belief in us enables us to substantially accelerate our growth with our total funding reaching $70m to date, and to take the data market for Computer Vision AI by storm.

When we first started Datagen back in 2018, Gil, my partner and Datagen CTO, and I had a vision of teaching AI to see the world through AI-powered 3D simulations. We knew that performance increases with data in the deep learning era, and that manual data acquisition is the #1 bottleneck holding the industry back from expediting orders of magnitude faster. Thanks to the traction we presented, alongside our broad market understanding and maturity, we secured $18.5M in funding exactly 11 months ago. 

In these past 11 months, there were unbelievable leaps in Datagen’s traction, going hand-in-hand with the trend the industry has experienced. Our conversations with customers and domain experts have changed from ‘explain to me what synthetic data is’ to ‘how can I solve my tasks with synthetic data’. We no longer have to show the value of synthetic data to our customers, instead, they come to us to share their success stories. 

And the progress this market has made ever since is simply enormous. 

According to Anthony Goldbloom and Gartner, the future of data is simulation. Goldbloom, the founder and CEO of Kaggle, says that “Synthetic data is an incredibly promising way to increase dataset size and diversity and allow us to build stronger models across all computer vision use cases.” Gartner predicts that by 2024, “60% of the data generated will be synthetic data” and “that it will completely overshadow real data in AI models” by 2030. 

Datagen is also leading the way for a new paradigm of AI development – Data-as-Code. Data-as-Code is to data what Infrastructure-as-Code is to infrastructure: It turns heavy operational processes into a seamless, easy-to-control programmable interface. Even more importantly, it’s an approach that data scientists and AI engineers are eagerly adopting.

With synthetic data, the training data is just an artifact of running code. The users gain full control over the content of their data and managing it becomes just as easy as managing code. Essentially, synthetic data means that AI models can be the product of pure code. Thus, Data-as-Code becomes an enabler for data-centric AI and a way to manage data programmatically in one unified streamline.

Synthetic data is not just another type of data, combined with additional streams of data it will create better-performing ML models. Synthetic data is the new generation of data for AI. As Datagen’s product and technology matures, we see an increasing number of use cases addressed and solved exclusively using simulation and synthetically generated data. This is a strong signal that a new way of developing AI models is already here, changing the way we bring Computer Vision applications to production.

Datagen will continue to unlock the potential of visual AI and be a trusted advisor for our Fortune 500 customers to develop their future products in the worlds of AR/VR/Metaverse, in-cabin vehicle safety, security, robotics, IoT security and more. 

I always say to our team that Datagen is not about the code we develop or the product we build. Datagen is about people. And amongst the different types of DNAs companies build, at Datagen we’ve genuinely been proving that top-performing companies can be built in healthy ways, without compromising on culture, respect and personal growth mindset. People here want their colleagues to succeed as their top priority, which creates an extraordinary work environment and very special connections. When our people are happy, communicating and balanced – the organization overperforms as a whole.

Andy Vitus, Partner at Scale VP, who brings years of experience as an engineer, an entrepreneur and an investor, joins us on this journey and infuses substantial industry knowledge in creating products that developers love. 

Andy sums it up like this – “As we enter a new, data-centric age of machine learning, a streamlined, operationalized data pipeline is poised to be the most lucrative piece of the machine learning puzzle, this is why we are placing our bets on Datagen, who is creating a complete CV stack that will propel advancements in AI — fundamentally transforming the way computer vision applications are developed and tested. The potential impact of what Datagen has to offer, across a broad range of applications, is staggering.” 

We’re lucky to have investors on board who understand the future of AI and are thankful to have them on this journey with us.  

Onward and upward!

Read the press release.

Ofir Zuk (Chakon) is the co-founder and CEO of Datagen.

Applied AI in a Data-Centric World

How many times have we heard sentences such as “Data is the new oil” or “Data is food for AI”? Even though there is some truth to these catchphrases (after all, we data scientists/AI scientists do spend most of our time on the data), the vast majority of academic AI papers still revolve around new approaches that improve x% on famous dataset benchmarks such as ImageNet. Even in the industry, most data scientists often get their data as a given constraint for the problem they need to solve, and not as a variable they can optimize to better solve their task at hand.

The idea to prioritize data over algorithms is not new. Actually, most leading AI companies (for example Tesla) are designing their AI workflows specifically to optimize data acquisition and tagging, However, this idea has been brilliantly reintroduced and explained by Andrew Ng in March 2021, and this is where (as far as I know) the term Data-centric AI (DCAI) was coined for the first time. Since Ng’s presentation, more and more people are joining forces to make AI more data-focused. Actually, a great hub was recently created to gather the most important resources and breakthroughs on this topic.

In this blog post, I do not want to explain the concepts behind DCAI in detail, since you can easily learn them from the links above. So before diving into the core ideas of this blog post, here is below the main idea of DCAI, so that you can continue reading even if you are not familiar with the field:

Even though the concept of DCAI is quite easy to grasp, I do think the implementation of DCAI in practice is far from obvious, so I want to discuss in this blog post the actual steps involved to make a successful DCAI project, and explain how it will change our job as AI/data scientists (non-related note: please forgive my lack of consistency in job titles, there’s unfortunately no standard across the industry). Basically, I want to answer the following questions:

  • Why now? What makes data-centric AI more relevant today?

  • Aren’t data acquisition and model optimization two orthogonal tasks that can be performed in parallel by different people?

  • Why and how does it completely change our work as Applied Scientists?

  • What are the most important steps of a successful DCAI workflow?

  • Where is the DCAI industry going in the future?

In particular, here are the main points I want to address in this post:

  1. Iterating on the data is becoming much easier today, in particular thanks to the emergence of synthetic data (full disclosure: I work for a synthetic data provider called Datagen) and data labeling platforms.

  2. State-of-the-art models on the other hand are becoming a commodity.

  3. The DCAI methodology starts with engineering your TEST set (spoiler: I should have written test setS).

  4. In a DCAI world, the most important task AI scientists will have to perform is debugging their algorithm to find the optimal data their network needs (good news: it’s much more exciting than traditional debugging!).

  5. The future of DCAI will likely involve powerful AIs that would automatically debug neural networks, and derive from this debugging the kind of data that needs to be created.

Why is data-centric AI happening now?

You may have noticed that the idea of DCAI is overall quite simple, obvious and even not so new. This fact induces 2 important questions:

  • Why would the DCAI revolution happen now? After all, deep learning started to take off almost 10 years ago now, and it has always been clear that data was the core component there.

  • Aren’t data acquisition and model optimization two orthogonal tasks? In other words, maybe it makes sense that the AI scientists focus on the algorithm while someone else focuses on gathering more data?

The answer to the first question lies in the fact that up until now, data acquisition was a slow, complex, and expensive process, which involved a lot of operational challenges. In particular, there are 3 important operational (but not scientific!) challenges that need to be addressed when building a high-quality dataset:

  • Data diversity: you need to make sure your data is highly varied, has enough edge cases, but at the same time be careful to avoid biases in your dataset. For example, if all of your pictures come from a sunny country, your machine learning model may perform really badly on rainy days (side note: did you ever notice that most self-driving cars were all initially tested in sunny places?)

  • Labeling: this step is a difficult and approximate science: different conventions lead to different labels. Sometimes, having the exact labels is even an impossible task (for example it is impossible for humans to extract the exact 3D information from an image, since an image is by definition 2D).

  • Formatting: all data should be wrapped up in a format that contains all the information you need.

However, things have changed at many levels nowadays:

  • Creating state-of-the-art deep learning models has never been easier. Almost all the top AI papers now come out with open-source code, many python libraries such as, or pytorch lightning enable developers to train deep learning models with the latest architectures (for example, Transformers) and training tricks (e.g. one-cycle learning policy).

  • More and more companies are selling data (some datasets can even be downloaded and used for free). However, while this solution is faster, it can still be very expensive, and most importantly will rarely cover your edge cases.

  • The data labeling process has been considerably streamlined by products such as scale.comAmazon Sagemaker, or Dataloop. Things like finding labelers to work on your data, creating labeling redundancy (to improve labeling consistency), and managing data are now much easier.

  • The rise of synthetic data (computer-generated data) is a complete game-changer for the world of AI (in particular computer vision). This type of data enables companies to acquire realistic data with perfect labels and perfect control, at a fraction of the time (and cost) it takes to acquire real data. It has also been proven (in particular by Datagen and Microsoft) that synthetic data can significantly reduce the amount of real data you need to train a model. In practice, synthetic data gives a real superpower to AI scientists: being able to create the data they need.

Therefore, the same way the Internet made it so simple for entrepreneurs to create companies and iterate on their projects, the commoditization of data acquisition enables AI practitioners to quickly create a baseline for their model, and iterate efficiently on the data until it is ready to be deployed.

Let’s now address the second question mentioned at the beginning of the paragraph:

Is acquiring data part of an AI scientist’s job?

If you read the previous paragraph carefully, you probably already have the answer. A few years ago, acquiring data used to be an OPERATIONAL job, and had therefore no reason to be handled by an AI scientist (who would probably have done a very poor job anyway ^^).

However, the commoditization of data discussed above enables the AI scientist to fully engineer the data he or she uses to train a model. And as we will see, this engineering task is far from simple, and requires important research skills.

In other words, in most cases, I would say that YES, part of our job as AI scientists is to gather the data you need to train a model.

The 5 steps of a data-centric AI development

Ok, all this sounds very cool, but what does it mean in practice? If my job is not so much to train models anymore, what should I do then? How does this DCAI methodology get implemented in practice and what is our role in it as scientists?

(Rest assured, you will see that your scientific skills are critical to the success of a DCAI project!)

Step #1: Carefully engineer your test setS!!

This is something that few people talk about, but the implementation of data-centric AI should actually start with your TEST SET (or more precisely validation set as we’ll discuss later).

In principle, the test set (and metrics achieved on this test set) is the ground for many very impactful business decisions. In particular, the results of your algorithm on your test set will likely be the key element to decide whether or not it should be deployed to production. However, very often, data scientists create their test sets just by taking a random split of their training set. While this may be fine for a PoC, it will quickly show its limit if you want to build a robust algorithm that won’t fail every 2 days in production.

The first 2 things a data-centric AI scientist needs to build are:

  1. A generic test set, which represents as closely as possible the probability distribution of cases you expect to get in production.

  2. Several “unit test sets”, which consist in designing specific test sets that are meant to measure whether your algorithm is robust to specific cases (for example: can a car detect pedestrians in low-light conditions). This approach was in particular well explained by Tesla’s Head of AI Andrej Karpathy.

    Step #2: Get your end-to-end data-centric pipeline running

    Finish your test environment

    Although building the test set is generally the most laborious step to building your evaluation environment, you cannot achieve much if you just have data. Ideally, your evaluation process should also include:

    1. Proper metrics that describe the trade-offs you’ll need to make in production (for example, execution time vs accuracy). And yes, there ARE ALWAYS trade-offs.

    2. Debug visualizations that show things that your metrics cannot quantify, if possible in a way that allows you to understand the problem more specifically. For example, if you build an image denoising algorithm, you should have a visualization of the inputs/outputs/ground truths side-by-side so that you can compare them and detect artifacts such as blur.

    3. A one-push-button system (e.g. python script) to evaluate any given method according to all of the points described above.

    As a side note, throughout my career, I’ve been struggling to find good tools to help me build test environments more quickly. While many MLOps solutions exist today (Weights & BiasesTensorboard, etc…), they are usually very focused around Machine Learning (which imposes a constraint on the solution to your problem), and are often more about debugging your networks than evaluating a given solution. Today, I am more and more using a tool called QA-Board, an open-source software released by Samsung which provides a complete environment for a combined quantitative and qualitative evaluation environment. ~~

    Build your baseline

    In order to finish your end-to-end data-centric pipeline, you also need a baseline, i.e. the quickest solution you can find to your problem that can be evaluated in your environment. I talked about this more in depth in my first post “Fall in love with the problem”, but it’s worth saying it again: your baseline SHOULD NOT take you a long time to build, and SHOULD NOT give good performances. The goal is to check that all the blocks of your pipeline are integrated, and to be able to “appreciate” the quality of the results you’ll get when you will use more complex solutions. Very often, you actually don’t even need a training set to build a baseline: you can either go for non-ML solutions, or leverage the ocean of open-source code available online.

    Ok, now that you have your end-to-end pipeline running, it is time to improve on your baseline, and this is where things will spice up from an algorithmic perspective.

    Step #3: Discover the data you really need

    Thanks to the efforts you put in step 1 and 2, you may be able to get some initial insights on your baseline’s performances. For example, it may fail on specific unit tests, and this may give you a hint regarding the type of data you want to gather.

    But more often than not, defining accurately the data you need is much harder than it looks, because the failure modes of your algorithm are in general unknown in advance, and your “unit test sets” are only wild guesses of the kind of data that could cause an algorithm to fail.

    Surprisingly, I did not find a lot of academic papers answering the question “What are the 2-3 main characteristics in my data that cause my network to fail”. As far as I can see, there are 3 potential approaches there:

    Deploy the algorithm in shadow mode

Even though your baseline is far from robust enough, it may be a good idea to put it in production in “shadow mode”, meaning that its predictions will not affect anything, but you’ll be able to gather a lot of data and monitor the kind of corner cases you’ll have to face. Here is a quick table of pros & cons for this method:

  • It forces the company to perform the integration work, which as discussed previously is one of the biggest failure risks of an AI system.

  • This gets you exposure to the EXACT production distribution, with all its corner cases and so on. You can therefore extract all the failure modes, and add them directly to your training/test sets until you are satisfied with the algorithm’s performances.

  • You can (theoretically) access all the production logs and context to understand precisely the context in which failure occurs. In contrast, when you work with “offline” data, you can only access the context which you have proactively collected at the dataset collection time.

  • Detecting when neural networks fail is a hard open research problem. In a classification task, you can use the network’s confidence, but even then, it is known to be quite unreliable.

  • Integrating a network to production (especially in shadow mode) has a strong cost for your company, and the company may rightfully decide to wait until you’ve shown promising results to make this investment.

  • It requires the company to decide upfront HOW your algorithm will be integrated in production. This reduces significantly your algorithmic freedom afterwards, and may limit the performance achievable eventually.

Data attributes inference

Whenever your test set is big (and it should be), it is often very hard to find interpretable characteristics of your data that correlate with your network’s failures. But if you enrich your unit test sets with metadata (such as number of people, location of objects, etc…), you’ll be able to leverage many classical data science techniques (KDEclusteringacross INTERPRETABLE dimensions of your data. For example, if you run an inference network on each of your test images to infer the gender of the subject for each image, you may suddenly realize that your algorithm performs much better on men than women, which is a directly actionable insight from a data perspective since you can add more women to fix this issue. In order to perform this enrichment, you can take advantage of all the off-the-shelf networks that exist today. For example, Google’s Mediapipe project makes it very easy for anyone to extract information (3D face landmarks, body pose, etc…) about images, with relatively high robustness and accuracy.

Another possibility is to use synthetic data as a test set. Since synthetic data are built by an algorithm, every metadata you wish to have about this data is theoretically accessible. For more details about this approach, I highly recommend reading Datagen’s Neurips paper.

Unstructured failure modes discovery

Unfortunately, in many cases, you won’t know beforehand the characteristics of your data that are causing a network to fail. Ideally, we would like to have a toolbox that analyzes the model’s performances, and gives us actionable insights on how to improve them.

While the academic literature is relatively scarce on this topic, here are 2 interesting works I came across recently:

    • Stylex: very recent and interesting paper by Google that shows how to automatically discover and visualize the main data characteristics that affect a classifier. In other words, their method will help you identify and visualize exactly the attributes that cause a dog vs cat classifier to classify an image as dog or cat.

Explaining a Cat vs. Dog Classifier: StylEx provides the top-K discovered disentangled attributes which explain the classification. Moving each knob manipulates only the corresponding attribute in the image, keeping other attributes of the subject fixed. Source: Google’s blog post

  • Virtual Outlier Synthesis: another amazing and recent paper. They basically found a way to generate images that are out-of-distribution, i.e. corner cases that the network hasn’t seen a lot during training, and is therefore struggling on. Although this paper doesn’t really bring an explanation as to WHY these outliers are failing, the fact that it can generate new failure case candidates by itself is pretty amazing.

As part of my work at Datagen, I’ve also developed such a method to detect neural networks failure modes in an unstructured way. More precisely, I built a Variational Auto-Encoder that can transform an input failure case into the most similar image that would make the network succeed. When trained and tested on the facial landmarks detection task (on the 300W dataset), we observed for example that the VAE was removing glasses from people, and transformed babies into grown-up faces, hinting that the network has more difficulties dealing with babies and people with glasses.

(As you may know, one of the famous problems with VAEs is that they create blurry pictures, as can be seen above. While this may be an issue in many cases, it is not in our “debugging” case, because these images are meant to be interpretable, not beautiful).

Step #4: Gather the new data and adapt your training pipeline

Once you’ve identified the type of data you need, you need to gather more of this data. This is typically where things may get harder: if you need real data, you may need to write a long list of precise requirements, and wait for 6 months until you get the new dataset back. This is exactly why I believe synthetic data is going to be a revolution. You can now use platforms like Datagen to create data with the exact characteristics of your failure cases, and a lot of variance on everything else.

However, keep in mind that your training procedure may need to be adapted if you have a new dataset. In particular, if you use synthetic data, both our team and Microsoft observed that 2 important changes need to happen in your training procedure:

  • More data augmentation is needed, because synthetic data is often “too clean”, and therefore does not represent well enough the real world.

  • Label adaptation: unless the labeling conventions of the synthetic data completely aligns with the ones you use in your test sets, you will need to find a way to adapt the output of your network so that it fits the ground truth of your test set, and not the one of your synthetic data. We have found that the easiest and most effective way is to pretrain your network with synthetic data, and then fine-tune it on real data.

Step #5: Scale it until you make it!

Once you have your data-centric process installed, you will need to repeat it as fast as possible, i.e. perform the following steps:

  1. Analyze the failures as described in step 2 → Conclude about the dataset to add

  2. Generate one or several new datasets based on these insights.

  3. For each new dataset found:

    1. Retrain the network with this new dataset (+the previous ones to avoid the catastrophic forgetting issue)

    2. If this new dataset improved the performances in your unit tests:

      1. Keep it for future iterations.

      2. Go back to step 1 with your freshly trained network

    3. If it did not:

      1. Drop this new dataset

      2. return to step 1 with the previous network

Important note: For the sake of simplicity, I’ve mostly talked about test sets for now, but needless to say that as soon as you start iterating over this test set, and start to make decisions based on your test sets’ results, your final algorithm has a strong risk of overfitting. This is why all this closed loop optimization process should happen with the validation set, and not really the test set per se.

Long-term: AI-powered data-centric AI

At the end of the day, if figuring out the optimal data you need to solve a task is such a hard algorithmic task, maybe it would make sense to build a complete closed-loop system (some kind of next-gen AutoML) that not only optimizes your model’s hyperparameters, but also the data it gets fed with.

If you think about this for a second, this idea is actually very comparable to how a child learns. Of course, a child’s brain learns very quickly and efficiently from any data it comes across. However, the data he or she receives is not random at all: it is carefully selected by people who care for the child more than anyone else in the world, and are also (hopefully) themselves very smart: his parents.

However, in order to get there, a lot of very interesting problems need to be addressed:

  • On the AI side: as we’ve seen, discovering the failure modes of your network, how to adapt your training procedure to new data, and how to optimally generate data based on those failure modes are still research questions, and I personally haven’t found a lot of papers dealing with it.

  • On the MLOps side: no matter how smart your algorithms are, you will always need a lot of trials and errors to figure out the best training data and hyper parameters. This means that you will require automated systems that can run as many experiments as possible in parallel, while optimizing all of your hardware resources. Some of these experiments will be related to data gathering, some others related to neural network training, and they will need to communicate and be orchestrated in a very smart way. I actually expect the demand in MLOps tools for DCAI to grow very quickly in the coming years. I personally would love to have such tools.

Unfortunately, here again, the scientific literature on these questions (especially MLOps) is quite scarce, but an interesting paper called “AutoDC” (presented at the last NeurIPS data-centric AI workshop) gave some initial directions on how to approach these questions. I would personally be very interested to see these AutoDC concepts taken to the world of synthetic data, since we would then be able to have fully automated systems that can train neural networks to optimally solve a given task.

source: AutoDC paper


As you can see, data-centric AI is one of the most exciting areas to work on, as it is both THE most impactful way to perform AI, and it is actually itself full of fascinating AI challenges! If working on those challenges gets you excited, you should definitely talk to us!

This blog was originally published on the ImpactfulAI blog here.

Sebastien Derhy is the Applied Science Lead at Datagen, and has more than a decade of experience in the field of computer vision and 3D algorithms. After receiving his MSc from Ecole Polytechnique (#1 engineering school in France), he started as a visiting researcher at the Technion in 2011 with Prof. Ron Kimmel, where he contributed to the nascent world of computer vision from depth cameras. With fellow students from the Technion, he  co-founded the startup Fitterli. After this experience, he joined Samsung, in which he led the development of computer vision algorithms for one of the first worldwide event-based vision sensors (DVS). Finally, he led the development of several vision and data science-based tools to improve navigation quality at Via Transportation. In October 2020, he joined Datagen to take part in the revolution of computer vision with synthetic data.


Q&A with Anthony Goldbloom, Founder and CEO, Kaggle

Gil Elbaz, Datagen’s co-founder and CTO, sat down with Anthony Goldbloom, the co-founder and CEO of Kaggle. Kaggle hosts machine learning competitions, where data scientists download data and upload solutions to difficult problems. Before Kaggle, Anthony worked as an econometrician at the Reserve Bank of Australia, and before that the Australian Treasury. He holds a first call honors degree in Econometrics from the University of Melbourne. 

This interview has been edited for length and clarity.

Q:  There’s so much amazing talent on Kaggle.  What was your original vision for Kaggle? Did you have something else in mind? 

Anthony Goldbloom: I was a journalism intern at the Economist magazine, and I wrote an article about predictive analytics, and machine learning applied to business problems. I was interviewing companies, and was thinking, I would love to work on some of the problems I was interviewing people about. The idea behind Kaggle was to give companies access to people like me, and people like me access to the interesting problems that companies post.

And I think it’s really largely fulfilled that goal. How interesting and exciting to try a problem, get to a certain level of performance, then see what the winners did that you didn’t. It’s a unique learning opportunity. With each challenge you participate in, your performance gets better and better. I think Kaggle has largely fulfilled what I had hoped originally it would do. 

Q: How has Kaggle evolved over the years, what it started from and what is the future of Kaggle?

Anthony Goldbloom: Some of the biggest changes we’ve made to Kaggle over the years have been the introduction to a hosted notebook product. We introduced that because we noticed that when people were competing in competitions, they were sharing. We introduced this so that people in our community can share code in our hosted notebook, and this was a huge change. Over time, it has evolved into a real, beautiful, hosted notebook environment; really stable, very powerful, a very nice environment.

The second big part is what we call our public data platform. We allowed anybody in our community to share any dataset with each other without a challenge. The public data platform has been really powerful, and allows our users to actually create their own competitions.

We look at ourselves as really trying to provide learning by doing. 

Listen to the entire interview on the Unboxing AI podcast

Q: How do you see Kaggle evolving with the need for very large compute in order to train, let’s say generative methods or various reinforcement learning methods? And the need for access to data where we’re only scratching the surface of what’s possible? 

Anthony Goldbloom: One of the very exciting things happening on massive datasets is when it comes to solving pragmatic, real world problems.

Kaggle considers ourselves as not the place where new machine learning gets invented, but where it gets sifted and sorted. We figure out what’s real and what isn’t on real pragmatic problems. And so I think, in most cases, being able to fine tune an existing model and not having to spend money on training from scratch, ends up being the dominant strategy. 

Q: How would you describe the community as a whole and its ability to solve problems together?

It’s open to anybody, right? If you have an internet connection, Kaggle is accessible to you as a learning opportunity and as a way to get credentialed. The fact that everybody is on the same playing field is another really nice feature of this community. 

Q: Do you see Kaggle as a community that can be put together for positive impacts?

Anthony Goldbloom: We work with the radiology industry in North America on challenges ranging from taking chest x-rays to diagnose COVID to CT scans, to diagnose lung cancer, to a large range of medical challenges.

Raising awareness on public good type challenges is definitely an area that Kaggle has done good work on in the past. And I expect us to continue.

Q: What would you recommend to new people starting out in the machine learning space or the computer vision space? 

Anthony Goldbloom: My answer is probably somewhat predictable, but get on Kaggle. I’m very much learning by doing type. I think it’s important to learn some basic Python as a starting point. Kaggle has some really nice courses where we try to teach you the basics of Python, the basics of supervised machine learning. They’re not supposed to be a really rigorous grounding in any of these topics, but they’re supposed to teach you just enough that you can start rolling up your sleeves and playing by yourself.

And challenges are a really good way to learn. You probably don’t want to spend more than half an hour or an hour a day on a challenge. And maybe one idea didn’t work, but then something you think of later makes an improvement. It’s a very nice way in my view to learn.

Listen to the entire interview on: