Saving Lives with Deep Learning & Robust Infrastructure

Gil Elbaz, Datagen CTO and co-founder, interviewed Idan Bassuk, VP AI at Aidoc, about computer vision, natural language processing, and AI explainability in the medical field. During the episode, Idan discusses the challenges that anyone building an AI team for scale should know. 

AI Algorithms and Software Engineer, Idan Bassuk, joined Aidoc four and a half years ago as VP AI. He now leads a 90-person team focused on all aspects of AI, from dataset development to algorithmic research to deployment of continuous monitoring in over 500 medical centers worldwide. Idan spent ten years in the IDF before joining Aidoc as its first employee. 

This transcript has been edited for length and clarity. Listen to the full podcast here.

What does Aidoc do?

Idan Bassuk: Aidoc is the leading provider in AI for medical imaging like CT scans, X-rays, and MRIs that are interpreted by radiologists. Radiologists specialize in reading, interpreting and diagnosing, based on these medical images. 

We have built a system around the concept of always being on. It means that we are always running in the background. We don’t wait for the radiologist to send us a question about a scan, but we are listening to the databases of the hospitals. In many cases, we are even connected directly to the scanners themselves. And once this new scan gets acquired, we automatically identify it, analyze it with AI algorithms to detect different types of medical conditions, such as brain hemorrhages, spine fractures, and strokes. These are critical and life threatening medical conditions. We often do this analysis before the radiologist even opens the scan. And our goal is to drive the radiologists to get to the most important and the most critical patients earlier.

How does Aidoc save lives? 

Idan Bassuk: We do retrospective studies on years of data, depending on what needs to be done to be statistically significant. And we’ve done it on millions of scans already, even as part of research. Research is conducted independently by these medical centers, and they’ve published dozens of academic papers on it.

We have seen in this controlled research that we are improving the patient outcomes by reducing the missed detection rate by radiologists. And we have a very good ratio between our sensitivity, or what is known as the recall, and the number of false positives that we provide.

One of the great things is, really on a daily basis, we get emails or WhatsApp messages from doctors that are using our products giving us examples of patients whose lives we saved or helped save. 

How do you build your team?

Idan Bassuk: I lead the AI group at Aidoc, including the concept of the AI group. It’s not only a group of the algorithm, engineers or researchers, but it contains all the teams that are responsible, in any way, for developing the AI or bringing it to production to the real world.

In parallel to this AI Operations Center, this group also contains several other teams, each of them part of the concept of holistically attacking the AI challenge from many directions. We have a data engineering team that is responsible for data engineering platforms. enabling data mining, enabling building the data sets and the platform, which enables us to develop the datasets across dozens of types of data and scans and reports and medical records, etc. 

Our data engineering team builds platforms that enable us to utilize the data to the best extent possible. There is a notion, in AI, that more data is better, but more data is not necessarily better. It’s not interesting. It doesn’t teach the algorithm anything new. Most of the patients are healthy. But our goal is to train on the smallest datasets possible, which contains the most interesting scans and not on the largest dataset possible. We think that that’s not the correct KPI. 

Listen to the full podcast here

That’s one team. Another team is the dataset engineering or dataset development group. It’s a generalization of the concept that in many companies is just called data annotation, but we think that building a dataset on which you train and test your AI, especially in the challenges of the medical world, is much beyond the notation, since you need a trained radiologist to do it. 

You need people who understand not only the medical conditions, not only the hemorrhages and where to find them, but also the data variability, the physical properties of the scans, and our customers’ needs, in order to choose the most important and most interesting scans for the algorithm.

Working Agile in AI 

Idan Bassuk: Being able to reproduce your algorithm, and the infrastructure around it, is a prerequisite for working agile in AI. What is agile? To be agile in AI, you need to ship it to the customer, get feedback, and then spend weeks or months resuming where you left off.

You don’t want to develop the algorithm that you will have five years from now and only then ship it to the customer. You want to get feedback from your customer on the premature version to see what they think should be improved the most. But if you ship it to the customer and get feedback, and then you need to invest several weeks or months, just to be able to continue where you left off, it’s an impediment to being agile in AI, because it will encourage you to make longer iterations because resuming is more painful. 

This specific aspect of agile working iteratively, I think is a highly important pattern; not having impediments that encourage you to make the iterations longer than what makes sense from the product perspective.

I think that generally it’s a good idea to take the framework of agile and scrum or something similar to that, but not stick to it too tightly, by the way, not necessarily only in AI, but in general, as it will be adapted to the needs of your organization.

Getting started in computer vision 

Idan Bassuk: I personally think that the most important thing for a junior engineer, someone relatively new to the industry, is not what project you are working on or what company you are working in, or what domain you’re working on, it’s the quality of the people that you will be working side by side with.

And by the way, even if you will not necessarily do the most sexy tasks in this team, you will have the opportunity to watch experienced engineers and how they tackle complex problems and discuss it with them.

I think you will learn much more from it in many cases. As talented as you are, I think it’s very valuable for most people to have someone to learn from, not necessarily a formal mentor. I think that just the people on the team can really change your pace of growth. 

Listen to the full podcast here.

Synthetic Data and Facial Landmark Detection

Can synthetic data lead to better performance on the computer vision task of facial landmark detection?

In our first two blogs here and here, we learned about our experiment with combining synthetic data (created with the Datagen platform) with real-world data in training a facial landmark detection model.

In our first blog, we discussed  domain gaps and laid the groundwork to proceed with our experiment. Next, we discussed the hypothesis that “a landmark detection model trained on a combined real and synthetic data set will outperform one trained solely on real data”.  To test that, we adopted a truly data-centric approach. Using fixed landmark detection model architecture, we experimented with five strategies of training the model; two of which (mixing and fine-tuning) involve combining real and synthetic datasets as seen in Table 1, below. 

 

Type Name Number of images Purpose
Synthetic Dataset DGFaces (Figure 1) 10k labeled images Training set
Microsoft Dataset  100k labeled images (Only 10k is used in this experiment) Training set
Real dataset 300-W Train (Figure 2) 6296 labeled images Training and validation set
300-W Common (Figure 3) 1108 labeled images Test set only

Table 1: Datasets used

 

Figure 1

Figure 2: Sampled images from 300-W Train that are used in the validation set

Figure 3: Sampled images from 300-W Common that are used in the test set.

 

Read the full white paper: Facial Landmark Detection Using Synthetic Data

 

To compare the strategies we used fairly, we had to ensure that the label adaptation model performs reasonably well. First, we trained a model with synthetic data only. Then, we used model label adaptation to overcome the label gap. The performance of models using model label adaptation is shown in Table 2.

Train Dataset Amount of data points used Pipeline implemented by Error  [NME] ↓
DGFaces 10k Datagen 3.93
Microsoft  10k Datagen 3.63
Microsoft 100k MS[1] 3.09
300-W Train 6296 Datagen 3.32
300-W Train 6296 MS[1] 3.37

Table 2: The performance of models using model label adaptation

Comparing the 5 Strategies 

Here, we compare the performance of these five strategies with different amounts of real data Nreal. (Figure 4)

Figure 4

We can extract three insights from Figure 4.

The model “fine-tuned” with 50 real images has approximately the same performance as the “real-data-only” model with 250 real images. This shows that the use of synthetic data reduces the amount of real data needed. (Figure 4A)

Figure 4A

The models trained on the combination dataset outperform those trained solely on real data alone. This demonstrates the value of combining real-world and synthetic data in landmark detection model training. (Figure 4B)

Figure 4B

The performance gap between the “real-world-only” model and the “fine-tuning” model is the largest when the amount of real data is small. This indicates that the value of synthetic data is potentially the largest when real data is limited. (Figure 4C)

Figure 4C

Conclusion

Synthetic data has the potential to drastically improve computer vision tasks. Yet, the hurdle of visual and label domain gaps fly in the face of such potential. Fortunately, these can be easily overcome with label adaptation and data augmentation. 

Having overcome such domain gaps, we find that combining synthetic and real-world data outperforms the strategy of using real data alone. We also learned that fine-tuning is the optimal strategy for creating the best landmark detection with real and synthetic data.

In the future, we are excited to replicate this experiment on other computer vision tasks. We also have plans to explore more methods of improving the landmark detection benchmark. Stay tuned!

Read the full white paper: Facial Landmark Detection Using Synthetic Data

Task2Sim: Towards Effective Pre-Training and Transfer from Synthetic Data

Review:

Neural network (NN) models pretraining on large datasets, like ImageNet, became a standard procedure in computer vision in the last years. Model pre-training is especially effective when only a small amount of data is available for training. In this case, training highly expressive models, for example large-scale neural networks, may lead to overfitting and model pre-training is one of the means to reduce it (overfitting).

Synthetic data, generated by a graphic simulator, can be leveraged to enrich datasets used for model pre-training. Naturally, downstream task performance (e.g. accuracy) of a model, pretrained with synthetic data, depends on the synthetic data characteristics. For example, synthetic data generated with different simulator parameters (e.g. lighting, object pose, backgrounds, etc.) usually leads to different downstream task performance. It is therefore worthwhile to customize pre-training data based on downstream tasks. 

Now the question is whether it is possible to optimize synthetic data simulator parameters to maximize pre-trained model performance for a specific downstream task? The paper addresses the aforementioned problem by proposing a technique, called Task2Sim, which is designated to determine the best simulator parameters for a given downstream task. Task2Sim is trained on a set of downstream tasks (called “seen” tasks in the paper) and was shown to be able to predict best simulation parameters for “unseen” tasks without need of additional training. 

Task2Sim Architecture:

Finding optimal simulator parameter configuration is a challenging problem due to a high number of possible simulator parameters. As a result, the number of possible parameter configurations is very large (and grows exponentially with every parameter added), making a brute force approach impractical. Instead, Task2Sim trains a neural network, mapping downstream task representation into an optimal configuration of simulator parameters.

To construct such mapping we need downstream task representation, aka task embedding vector. Such representation should reflect semantic similarity between tasks. For example, representation of a cat breed classification task should be much closer to a dog breed classification task than to a truck classification task. Fortunately, technique for constructing task representation was developed in the paper TASK2VEC: Task Embedding for Meta-Learning. This technique, called Task2Vec, is capable of building fixed-dimensional embedding for a variety of visual tasks, while the task is represented by its dataset with ground truth labels. A major advantage of Task2Vec embedding is the fact that it does not rely on any knowledge of class label semantics and doesn’t depend on the number of classes. Therefore Task2Vec can be leveraged to represent a wide spectrum of visual tasks.

Now our objective is to train a model predicting the simulator configuration, achieving maximal accuracy, for a Task2Vec representation of a given downstream task. Simply put, vector representation of a downstream task is fed into a NN, called Task2Sim, which maps this vector to synthetic data generator parameters (e.g. a blur amount, lighting direction, backgrounds, etc). Then the data simulator (the authors used Three-D-World platform) generates a dataset of synthetic images based on these parameters. A NN-based classifier is pretrained on these synthetic images, and its accuracy is used to update Task2Sim’s parameters based on the pre-trained classifier performance on the downstream task. 

Task2Sim NN outputs a distribution over quantized values (e.g. several blur amount levels or lightning directions) of each simulation parameter. Denoting the number of simulator parameters by M, Task2Sim output is M distributions  over the simulator parameters.

Task2Sim Training

As a key objective of Task2Sim is to maximize downstream task accuracy after pre-training, this accuracy serves as a measure of the training success. The paper assumes that downstream task accuracy is not a differentiable function of the simulator parameters, which enables usage of any graphic simulator as a black box. Therefore, gradient-based training methods cannot be used for Task2Sim optimization. Instead, the paper leverages the REINFORCE algorithm to estimate the gradient of downstream task performance with respect to Task2Sim parameters. 

REINFORCE is a classic reinforcement learning (RL) algorithm belonging to a family of policy-gradient methods. A RL algorithm goal is to determine the policy with a maximum reward, while policy is a strategy for picking a particular action at a given time. Policy gradient methods are iterative techniques for modeling and optimizing policy directly. However, what are rewards, policy, and action in our case?1

Naturally a reward is the downstream task accuracy after pre-training for a dataset generated with a given configuration of simulator parameters. A simulator parameter configuration is actually the “action” leading to a reward.  An “action” is a vector of simulator parameter configuration, where  is a value of i-th simulator parameter. This parameter configuration is fed to the simulator to build a synthetic dataset. This dataset is used to pretrain a model and then downstream task performance (reward!!) is estimated with the pretrained model.

A policy is M distributions of simulator parameters, generated by Task2Sim NN. i-th configuration parameter is generated from distribution .

In formal terms, an  action    can be generated by sampling from these M distributions.  Probability of action is defined by:  and is the distribution of the parameter of the i-th simulator. An “action” distribution is computed by feeding task-embedding Task2Vec to Task2Sim models. Then the following function is maximized with respect to Task2Sim NN parameters :

where Ω denotes space of all possible actions a and R(a) is the reward (downstream task accuracy) achieved when the training data is created with the action a (simulator parameters).

As we already mentioned, the authors used REINFORCE technique to update Task2Sim NN weights . REINFORCE updates these weights, using the gradient of reward R with respect to , which is estimated by the following expression: 

where R(a) is approximated by a single sample from distribution over configuration P(a) and is the set of configuration parameters corresponding to the mode (argmax) values of distributions 2

NN weights θ are updated using mini-batches of downstream tasks sampled from a set of “seen” tasks. In addition, the paper updates these weights via “replaying” high-reward actions encountered in the past (self-imitation learning).

Performance Evaluation and Results:

Task2Sim performance on a downstream task is estimated by applying a 5-nearest neighbors classifier on features generated by a backbone NN, on a dataset generated with the simulator parameters outputted by Task2Sim. In other words, the label for example x is determined by the labels of 5 labeled examples that are closest to x in the feature space. Apparently, this technique is faster than the commonly used techniques for transfer learning methods evaluation, such as linear probing or full network finetuning. 

The tables below contain the comparison of Task2Sim performance with several other performance methods.

Task2Sim significantly outperforms two “basic” model pretraining approaches: “Random” (simulator parameters chosen at random) and “Domain Randomization” (Uses all variations from simulation in each image) on 12 downstream tasks. This conclusion holds for both seen (pre-training is performed on downstream tasks) and for the unseen  tasks configurations (model is NOT pretrained on downstream tasks). In addition, Task2Sim performance is almost as good as that of ImageNet data pretraining on seen tasks, and it is even better on the unseen tasks. The latter is really impressive: it means that synthetic data, generated with Task2Sim allows for building (pretraining) more robust models, even than with the real data.

Footnotes:

  1.  The RL framework, proposed by the paper, does not involve any state and thus can be viewed as either a stateless RL or contextual bandits problem.
  2.  In fact, the authors propose to sample from a noised version of the simulator parameters(action) distribution to encourage exploration of the “action space”.

The Pace of Progress in Academia and Industry

In a recent Unboxing AI podcast episode, Gil Elbaz, Datagen co-founder and CTO, sat down to speak with Lihi Zelnik-Manor, an associate professor in the faculty of electrical engineering at the Technion, and the former general manager of Alibaba DAMO Israel Lab. ​​ 

Professor Zelnik-Manor holds a PhD and MSC with honors in computer science, from the Weizmann Institute of Science and a BSC in mechanical engineering from the Technion. Her main area of expertise is computer vision. Professor Zelnik-Manor has done extensive community contribution, serving as the general chair of CVPR 2021, and will serve as the general chair of ECCV 2022 and as the program chair of ICCB 2025.

This transcript has been edited for length and clarity. Listen to the full episode here.

Interplay between academia and industry

Zelnik-Manor: In industry, you have resources that you might not have in academia, but in academia, you have resources that you don’t have in industry. I think academia and industry have different roles. Industry picks up domains that are mature and have short term outcomes and short term impact. 

Academia has the option, but also, I think, the duty to think long-term. Deep learning grew in academia. Quantum grew in academia, and there are many other domains that grew there and then they shifted to industry. What I hope to see is that academia will find its focus on problems that need to be thought about or new things entirely.”

Teaching advice

Zelnik-Manor: When I started my career as a professor, I went to ask for advice. How should I choose my students? People gave me different advice. Some said to choose the ones that are best in math, all kinds of stereotypical things.

The advice I really liked a lot was: choose someone who is excellent in something unrelated to work. So it can be excellence in music or excellence in sports or excellence in whatever. Choose those people, they will nourish you better. This is a good guideline because people have different preferences and different goals. 

How to create a team

Zelnik-Manor: If you go back to the 1990s, teams were organized by profession. For experts in graphics, there was a graphics team and teams were generally organized by their expertise. If you wanted, for example, to create a product or a feature that required a mobile developer and a backend developer and an algorithm and product manager, you would take each of them from their individual teams. Each of these people would now have to communicate and align their plans. These are teams who have people with diverse expertise. They’re like an independent unit that can deliver the feature or the product.

A squad could have one person from each of those teams. A squad can be formed for a certain task and you could disassemble them when the task is complete. If the task is continuous, they can remain as a squad and that makes it easier to manage and tie all the loose ends together in order to deploy in a timely manner. Their OKRs are aligned and they are working together for the same goal. 

Getting started in CV

Zelnik-Manor: To get started in computer vision, I would recommend going through the traditional path, studying, getting a strong bachelor’s degree and then a PhD. 

A PhD gives you a chance to become a world expert in something. Then you have this unique knowledge base to take with you for life. From my perspective, of having a career for many years, I see myself using different aspects of the things that I’ve learned. It used to be geometry, then it’s statistics and probability, and optimization and programming and different languages.

You can work now until you’re 80. So we’re talking about 50, 60 years of work. How do you prepare yourself for that? I would recommend building a very strong infrastructure in math, computer science, and physics. Build the infrastructure of the tools, and then you can always learn and evolve and continue to remain relevant.

It’s about the people

Zelnik-Manor: I think there’s a citation from Steve Jobs, I don’t remember the accurate wording, but something like –  it’s not about the technology, it’s about the people. Put your faith in the people. Communication is very challenging because people often think it has happened and maybe it hasn’t. At the end of the day, I don’t think the main barriers for the success of our products and technologies are the technologies. They are the people who do it.

If you recruit the right people that like the vision and mission, they believe in it. They’re passionate about their profession. Doesn’t matter a mobile engineer or an algorithm or researcher. If they’re passionate about what they’re doing and they join a company that they love, its mission and vision, and they fit the culture, then you have everything in place.

Edge Cases in Autonomous Vehicle Production

“Because [the autonomous vehicle] is a product in the hands of customers, you are forced to go through the long tail. You cannot do just 95% and call it a day. The long tail brings all kinds of interesting challenges,” says Andrej Kaparthy, the director of artificial intelligence and Autopilot Vision at Tesla, at the 2020 CVPR Keynote.

Here, “long tail” refers to edge cases in autonomous vehicles (AV). Edge cases are probable scenarios that have a low probability of occurrence. These rare occurrences are easily missed and thus are often missing in datasets. While humans are naturally proficient at dealing with edge cases, the same cannot be said of AI. Thus, they have to be dealt with carefully.

Figure 1: A real image of a truck with a reflective surface (Source)

Figure 2: A real image of a chair flying across the road (Source)

The cost of ignoring edge cases is high. In 2018, a Google autonomous vehicle crashed into an oncoming bus when it detected sandbags surrounding a storm drain and had difficulty interpreting the situation correctly. In 2020, Toyota’s autonomous e-Palette collided with a visually impaired pedestrian. “It shows that autonomous vehicles are not yet realistic for normal roads,” Toyota CEO Akio Toyoda commented after the accident.

The pressure to deal with edge cases is mounting for car manufacturers. AVs which can better handle edge cases are safer and have a better prospect of becoming fully driverless quicker. 

Dealing with edge cases is not a trivial task. This blog post analyzes how synthetic data can handle edge cases in production. We will focus our attention on the development of autonomous vehicles, which remains a challenging feat due to the sheer number of possible edge cases.

Edge Cases Come in Multiple Forms

Kai Wang, the Director of Prediction at Zoox, shared that edge cases come in perceptual and behavioral forms. Perceptual edge cases involve rare sightings like trucks with protrusions. Behavioral edge cases are events that require more careful maneuvering of the vehicle, like making a tight three-point turn, navigating through a dense crowd, or dealing with jaywalking pedestrians. 

In the CVPR talk, Andrej Kaparthy illustrated examples of perceptual edge cases that Tesla observed when training their vehicle to stop. Some examples (Figure 3) of these include:

  • Conditional stop signs (e.g. “stop except right turn”)
  • Moving stop signs (e.g. on a school bus)
  • Temporary stop signs (e.g. held by a person)
  • Occluded stop signs (e.g. hidden by a vehicle)

Figure 3. A variety of stop signs that constitute edge cases (Source)

Challenges of Edge Cases in Production

Models in production are fallible – and can perform poorly when faced with edge cases. That is why Tesla has a “Data Engine” that empowers its development of autonomous vehicles (Figure 4).

The principle of the data engine is easy to understand. First, inaccuracies in existing models are detected. Tesla then adds such cases to its unit tests. It also collects more data of similar cases to retrain the model. Such an iterative approach allows it to catch as many edge cases as possible.

Conceivably, the act of collecting and labeling edge cases can be exorbitantly expensive, and in some cases dangerous and impossible. Here’s an example of a dangerous edge case to replicate (Figures 5A and 5B)

Figure 5A. A man emerging from a manhole remains undetected by the vehicle 

Figure 5B. A man emerging from a manhole is incorrectly classified. (Source: DRisk and BBC)

Testing model in production on edge cases using synthetic data

This is where synthetic data can help address the problem. As early as 2018, Bewley et. al. demonstrated that their autonomous vehicle trained only on simulation data drive in public urban roads in the UK (Figure 6). Since then, more researchers have explored the use of synthetic data for training autonomous vehicles.

Figure 6. Comparison of simulated training and actual test environments in “Learning to Drive from Simulation without Real World Labels” (by Bewley et. al.)

More recently, NVIDIA recently proposed a strategic approach named “imitation training” (Figure 7). In this approach, the failure cases of existing systems in the real world are replicated in a simulated environment. They are then used as training data for the autonomous vehicle. This cycle is repeated until the model’s performance converges. 

Figure 7. The imitation training approach involves the “train, evaluate and simulate” cycle (Source)

Consider the following edge case involving a partially visible truck (Figure 8). The original model produced an incorrect bounding box (left, yellow rectangle). After training on synthetic imitation data (Figure 9), the improved model could produce the correct bounding boxes on this edge case (Figure 8, right)

Figure 8: The baseline model prediction of the truck (left, yellow) compared against the improved model trained with synthetic data (right, blue). The ground truth is in gray.

Figure 9: Examples of synthetically generated images imitating the false truck detection in Figure 8

In general, the models trained using the synthetic imitation data strategy are found to have superior performance as compared to those trained on real data only. 

Testing model in production on edge cases using synthetic data

Apart from training models in production, one can also create synthetic data to test existing models on edge cases. 

As AI matures, more data teams find value in developing test cases for their AI models. Much like software engineering has test-driven development (TDD) as part of its best practices, AI practitioners are likely to develop unit tests for their AI models, argued Yaron Singer from Robust Intelligence. To extend that principle, Yaron also suggested the use of fuzzing for AI, where random inputs are fed to the model in hopes of uncovering its vulnerabilities. Having a robust testing protocol reduces the possibility of AI accidents or AI systems breaking unexpectedly.

Testing model performance

To thoroughly evaluate the robustness of an AI system, the unit tests must include both the general and edge cases. Yet, certain edge cases might not be available from existing real-world datasets. To that end, AI practitioners can use synthetic test data. 

One example is the ParallelEye-CS, a synthetic dataset for testing the visual intelligence of autonomous vehicles. The benefit of creating synthetic data over using real-world data is the high degree of control over the scene of each image. Figures 10 and 11 below illustrate how the authors tweaked the scenes and environments to include different edge cases in the data set.

Figure 10: The same scene but in different environments.  (top left: rainy; top right: dark;  bottom left: sunny; bottom right: flood)

Figure 11. Scenes with varying levels of occlusions. (top left: with railing; top right: with divider; bottom left: with a hidden bus; bottom right: with all occlusions)

Testing model for fairness in edge cases

Apart from optimizing model performance, AI practitioners must also ensure that the AI models in production are fair, even in edge cases. Such a sentiment is echoed by Martin Fowler, who called for AI practitioners to validate their models in production for potential biases.

For instance, the model should not perform poorly on minority classes (like the facial recognition software which misidentifies black faces). Neither should a model discriminate against minority groups (like how Amazon’s AI recruiting tool penalized female candidates).

Companies can no longer afford to turn a blind eye to AI fairness as regulations like GDPR start imposing fairness requirements on decisions made by machine learning models. Yet, there are limited toolkits and publicly available datasets for AI practitioners to evaluate their AI on fairness. 

Synthetic data can help. Datagen recently published our findings on the use of synthetic images in uncovering population biases in facial landmark detections (Figure 12). In particular, we discovered that the performance of a facial landmark detection model differs significantly between two populations of a synthetic dataset (e.g. men vs women). This is a telltale sign that the model in production is biased against a certain group. 

We can extend this concept to AVs. For example, we can test the model on pedestrians of different races and ages to uncover its population biases. 

Synthetic data for edge cases

Synthetic data is here to stay as a viable solution to edge cases in production AV models. It supplements real-world datasets with edge cases, ensuring that AVs remain robust even under freak events. It is also more scalable, less prone to errors, and cheaper than real-world data.

With the rise of the data-centric AI movement, AI practitioners are starting to improve on the value of the data. At Datagen, we believe that the use of synthetic data in solving edge cases will not be restricted to autonomous vehicles. It can be extended to facial recognition, medical imaging, advanced manufacturing, and many other industries. 

Using Synthetic Images To Uncover Biases

In January 2020, Robert Williams was arrested for shoplifting after examining the security footage. The investigators followed the lead of a facial recognition system that flagged Williams’ photo as a match against the grainy footage (Figure 1). The problem? Williams was nowhere near the crime scene when the incident happened.

Figure 1. A photo of the alleged shoplifting suspect from the security footage and the driver’s license photo of Robert Williams  

Williams’ wrongful arrest raised awareness of the fallibility of facial recognition, particularly in underrepresented minorities. The source of the problem is the biased and unbalanced dataset used to train many existing facial recognition AI today.

Image datasets are problematic 

Unfortunately, even the most established datasets today proliferate existing social biases. Pundits initially lauded the now-defunct Tiny Images Dataset as a comprehensive library of items, only to be disappointed later by revelations that it is rife with racist, misogynistic, and demeaning labels. Machines trained on such biased datasets perpetuate the biases in their outputs against vulnerable and marginalized individuals and communities. 

Even if datasets do not contain harmful labels, they are likely to under-represent minority groups. Unsurprisingly, models trained on unbalanced datasets perform poorer on the minority class in an effect dubbed bias amplification. In particular, Buolanmwini et al. found that commercial gender classification systems performed much poorer on dark-skinned females than light-skinned males. Another study found that face recognition and age estimation models were biased towards faces under 18 and over 60 years old. 

We cannot understate the harmful effects of such underrepresentation. A passport robot that tells Asian applicants to open their eyes is a PR nightmare (Figure 3). An autonomous vehicle that cannot recognize dark-skinned individuals can cause a life-threatening tragedy.

Figure 3. A screenshot of an erroneous passport photo rejection notice (Source)

Unfortunately, many real-world applications inherit biases encoded in training images. Computer vision practitioners ought to catch such biases before they cause harm. Yet, the question remains–how do we detect biased computer vision systems?

Detecting biased computer vision systems

One existing solution is to use balanced, richly annotated real-world datasets to detect the difference in performances across different groups. For example, KANFace features diverse ages, genders, and skin colors capable of uncovering biases for various face analysis tasks (Figure 4). 

Figure 4. KANFace is annotated for identity, exact age, gender, and kinship (Source)

Another solution is to use synthetic data for detecting biases against minority classes. This idea is not new. Tesla demonstrated its simulated environments (Figure 5) used to test their autonomous vehicles in edge cases.

Figure 5. An example of a rare scene simulated by Tesla (Source)

Detecting biases in facial landmark detection

The same idea can be applied to uncover biases in models trained for facial analysis tasks, as we demonstrated in our paper Using Synthetic Images to Uncover Population Biases in Facial Landmarks Detection

To validate this hypothesis, we found that the performance of a facial landmark detection model differs significantly between two populations of a synthetic dataset (e.g. men vs women). This leads us to conclude that synthetic images could be used to uncover population biases for landmark detection tasks. 

Similar results are found when evaluating the same model on real datasets. This signifies that both synthetic images and real images are capable of testing the bias of a landmark detection model.

Figure 6: Steps for the experiment

Here are the steps to validate this hypothesis (Figure 6)

  1. Use the DLIB package to detect facial landmarks on two datasets–CelebA (Figure 7) and FFHQ-Aging (Figure 8). These datasets are used because they come with facial landmarks annotations and labels of appearance attributes (like gender, age, and skin color)
  2. Compute the normalized mean error (NME) between the model output and the ground truth. The NME reflects the deviance between the detected position and the ground truth.
  3. Stratify the errors based on the attributes to identify any biases in the model. For example, the model is biased against women if the NME for men is lower than women on average.
  4. Repeat steps 1 to 3 using a synthetic dataset created by Datagen (Figure 9).
  5. Compare the biases identified using the synthetic dataset against those identified using real datasets.

Figure 7. Sample images from CelebA dataset

Figure 8. Annotated examples from FFHQ Dataset 

Figure 9. Example of face images from the Datagen synthetic dataset

Table 1 shows the results of the experiment. 

Table 1

1 FFHQ-Aging does not have the “beard” attribute annotation 

In summary, the model’s bias is apparent when tested on both real and synthetic datasets. For example, when tested on CelebA datasets, the model’s mean NME on young faces is 0.40% lower than that on old faces. When tested on FFHQ-Aging and Datagen’s synthetic dataset, the model’s NME on young faces is similarly lower than those on older faces. Simply put, the model is shown to be biased against older individuals when tested on both real and synthetic datasets.

We observe the same results when testing the model on other attributes. The model favors men (vs. women), those without a beard (vs. those with a beard), and white-skinned (vs. black-skinned) individuals.

Using synthetic images to uncover population biases

This set of results gave us confidence that the synthetic face images generated can uncover weaknesses in a facial landmark detection model. 

This is good news for computer vision practitioners who want to detect biases in their models. It is often challenging to set aside sufficient images in the test set to detect biases against certain groups. This is no longer a problem if developers have access to synthetic data.

Synthetic images for fair AI

Computer vision AI cannot be expected to make fair decisions when they are trained on datasets rife with existing social biases against underrepresented groups. AI practitioners have the onus to detect biased models and create fair AI models. Synthetic images will become a cornerstone in that process.

That is why we plan to test for biases against additional attributes in other trained models. Our future research aims to rectify such biases by adding synthetic data in the training process.

If you are interested in using synthetic data to detect potential biases in your systems, talk to us today. 

 

Real and Synthetic Data for Facial Landmark Detection

In part 1 of this series, we discussed  domain gaps and  laid the groundwork to proceed with our experiment.

This experiment hypothesizes that “a landmark detection model trained on a combined real and synthetic data set will outperform one trained solely on real data”.

To test that, we adopted a truly data-centric approach. Using fixed landmark detection model architecture, we experimented with five strategies of training the model; two of which (mixing and fine-tuning) involve combining real and synthetic datasets, as described below.

Strategy 1: Training the model on real data only

A face-to-landmark model is trained only on real data.

Strategy 2: Model with label adaptation

In this strategy, the face-to-landmark model is trained on synthetic data. The labels are then adapted after training.

Strategy 3: Model with data adaptation

For the synthetic data, a label adaption model (trained on real data) converts synthetic data labels to adapted labels. A face-to-landmark model is trained using only synthetic data and the trained face-to-landmark model predicts the label of real data. This strategy trains a face-to-landmark model on label-adapted synthetic data.

Read the Facial Landmark Detection Using Synthetic Data Whitepaper

Strategy 4: Mixing (Training on combined real-world and label-adapted synthetic data)

In the first setup, a combined dataset is used to train the model.

We first select Nsynthetic images from the Datagen synthetic dataset, which passes through the trained label adaptation model.

The resulting label-adapted synthetic images are combined with Nreal images from the real dataset in the set ratio of α, where α is the batch synthetic sampling ratio. 

This combined dataset is then used to train a face-to-landmark model. The predicted labels are then compared against the ground truth labels.

Strategy 5: Fine-tuning (Training on synthetic, fine-tuning on real data)

In this strategy, the model is first trained purely on the DGFaces synthetic dataset and then fine-tuned on real data. 

Experimental Design

Metric

The performance of each model is judged based on the normalized mean error (NME). To calculate NME, one would find the (normalized) distance between the ground-truth and predicted labels for each facial feature. These distances are summed up to give the NME.

Validation and test sets

The same real validation and test sets are used to compare each model. A 300-image subset of the 300-W Train dataset is used as the validation set, while the 300-W Common dataset is the test set.

In our next blog, we will talk about the results of the experiment.

VOS: Learning What You Don’t Know

Motivation:

Deep learning models’ safe deployment in real-life scenarios requires accurate detection of out-of-distribution (OOD) data. Deep neural networks (DNNs) are usually trained under the assumption that training and real world data distributions coincide. Real-world tasks, however, fail to uphold this assumption, leading to erroneous and high-confident predictions for OOD data. Simply put, the absence of OOD examples during training leaves the deployed model performance unpredictable to OOD inputs. Therefore, methods that can reliably identify OOD inputs are vital for safe deployment of DNN models in high-stakes applications such as healthcare and autonomous driving. This phenomenon is illustrated in Fig.1, where a moose is mistakenly identified as a pedestrian with a high confidence.

The paper develops an OOD detection mechanism which is trained during neural network optimization through constructing synthetic OOD samples. The authors propose to leverage these synthetic OOD samples to train a model to recognize OOD samples along with optimizing it (model) on an in-distribution (ID) task (classification and object detection).

Leveraging the developed OOD detection mechanism, we can specify and characterize edge cases and under-represented regions of the data. With this knowledge, we may be able to improve our coverage of these regions. For example, synthetic data can be generated to cover these problematic areas.

Main idea of the solution:

Previous approaches leverage generative models like Generative Adversarial Networks (GAN) models to generate OOD samples in the image space. Such models are often difficult to train, and the selection of an optimal training set can be challenging as well. In order to cope with these flaws, the authors propose to generate OOD samples in more tractable low-dimensional feature space rather than the original high-dimensional image space. Synthetic OOD samples are generated from low-likelihood regions in the feature space for every image category. Simply put, OOD samples are generated in the feature space areas located far away from the in-distribution samples for each class. 

Then the network is trained to distinguish between the generated OOD samples and in-distribution samples and is simultaneously optimized to perform the ID task (classification and object detection). Note that the OOD detection task is framed as a binary classification problem performed in the feature space. Thus, in addition to a bounding box and class scores distribution, a neural net produces a so-called uncertainty score for every detected object of an input image. This uncertainty score attains small values for ID samples and larger values for OOD samples allowing for an effective OOD samples detection. 

Learn more about the power of synthetic data.

Technical Explanation:

Let’s discuss the synthetic outliers generation procedure. As it was already mentioned, these outliers are constructed in the low dimensional feature (representation) space. To obtain an image representation, an image is first fed into a backbone neural network. Then the result is processed by a bounding box proposal generator to obtain feature (representation) vectors for every detected object. 

Next, in order to generate outliers in the feature space, the training set feature distribution should be estimated. Feature distribution is assumed Gaussian (discussed in the next section) and is estimated separately for each class (category). Finally synthetic outliers (OOD) are generated for each class in low-probability regions of the estimated class features distribution. In simple terms, feature space vectors located far away from the average feature value for the samples from a given category are sampled randomly to form synthetic outliers. 

Assuming clusters formed by features of samples from all in-distribution categories are sufficiently separated, these generated outliers “correspond” to samples that do not belong to any ID category in the original pixel space. Then DNN is trained to distinguish between the outliers and ID samples by giving low uncertainty scores to the outliers and high scores to the ID samples. A sample uncertainty score can be viewed as the reciprocal of a sum of the probabilities of this sample belonging to each ID category. If this sum is low, the sample is likely an outlier whereas ID samples are characterized by higher values.

Possible Shortcomings/Insights of VOS:

The main flaw of the proposed method is the Gaussian assumption of the feature vectors distribution for every category. Using this method for the cases when this assumption does not hold (e.g multimodal), may lead to generation of “false” synthetic outliers.  The use of “false” outliers to train the OOD detector can cause ID samples to be incorrectly identified as OOD, thereby degrading the performance of both the ID task and the OOD detector. A possible way to overcome this issue is to impose Gaussian distribution on category feature distributions. It can be achieved by adding distance between feature and Gaussian distribution, such as Kullback-Leibler divergence or Earth Mover distance to the loss function.

An important aspect of the proposed method is the dimension of the feature space. Being too low may affect the “representativity” of the feature space resulting in inadequate coverage of the generated outliers. On the other hand, if the feature space dimension is too high, a high amount of synthetic outliers will be required to cover the entire OOD region. If we generate an insufficient number of such outliers, the boundary learned by the classifier won’t be good enough to distinguish ID samples from OODs. 

Summary:

The proposed method constitutes an efficient tool for detection of under-represented regions in the test data by learning rejection areas with respect to each category. Furthermore, synthetic data generation process is optimized through focusing on the problematic under-represented regions of the data distribution. 

Final Note: Employing VOS approach for a trained network w/o retraining?

If you already have a trained network, synthetic outliers can be generated with the proposed approach. Then another model can be trained to differentiate between ID samples and these generated outliers.

Learn more about the power of synthetic data.

Q&A with Anthony Goldbloom, Founder and CEO, Kaggle

Gil Elbaz, Datagen’s co-founder and CTO, sat down with Anthony Goldbloom, the co-founder and CEO of Kaggle. Kaggle hosts machine learning competitions, where data scientists download data and upload solutions to difficult problems. Before Kaggle, Anthony worked as an econometrician at the Reserve Bank of Australia, and before that the Australian Treasury. He holds a first call honors degree in Econometrics from the University of Melbourne. 

This interview has been edited for length and clarity.

Q:  There’s so much amazing talent on Kaggle.  What was your original vision for Kaggle? Did you have something else in mind? 

Anthony Goldbloom: I was a journalism intern at the Economist magazine, and I wrote an article about predictive analytics, and machine learning applied to business problems. I was interviewing companies, and was thinking, I would love to work on some of the problems I was interviewing people about. The idea behind Kaggle was to give companies access to people like me, and people like me access to the interesting problems that companies post.

And I think it’s really largely fulfilled that goal. How interesting and exciting to try a problem, get to a certain level of performance, then see what the winners did that you didn’t. It’s a unique learning opportunity. With each challenge you participate in, your performance gets better and better. I think Kaggle has largely fulfilled what I had hoped originally it would do. 

Q: How has Kaggle evolved over the years, what it started from and what is the future of Kaggle?

Anthony Goldbloom: Some of the biggest changes we’ve made to Kaggle over the years have been the introduction to a hosted notebook product. We introduced that because we noticed that when people were competing in competitions, they were sharing. We introduced this so that people in our community can share code in our hosted notebook, and this was a huge change. Over time, it has evolved into a real, beautiful, hosted notebook environment; really stable, very powerful, a very nice environment.

The second big part is what we call our public data platform. We allowed anybody in our community to share any dataset with each other without a challenge. The public data platform has been really powerful, and allows our users to actually create their own competitions.

We look at ourselves as really trying to provide learning by doing. 

Listen to the entire interview on the Unboxing AI podcast

Q: How do you see Kaggle evolving with the need for very large compute in order to train, let’s say generative methods or various reinforcement learning methods? And the need for access to data where we’re only scratching the surface of what’s possible? 

Anthony Goldbloom: One of the very exciting things happening on massive datasets is when it comes to solving pragmatic, real world problems.

Kaggle considers ourselves as not the place where new machine learning gets invented, but where it gets sifted and sorted. We figure out what’s real and what isn’t on real pragmatic problems. And so I think, in most cases, being able to fine tune an existing model and not having to spend money on training from scratch, ends up being the dominant strategy. 

Q: How would you describe the community as a whole and its ability to solve problems together?

It’s open to anybody, right? If you have an internet connection, Kaggle is accessible to you as a learning opportunity and as a way to get credentialed. The fact that everybody is on the same playing field is another really nice feature of this community. 

Q: Do you see Kaggle as a community that can be put together for positive impacts?

Anthony Goldbloom: We work with the radiology industry in North America on challenges ranging from taking chest x-rays to diagnose COVID to CT scans, to diagnose lung cancer, to a large range of medical challenges.

Raising awareness on public good type challenges is definitely an area that Kaggle has done good work on in the past. And I expect us to continue.

Q: What would you recommend to new people starting out in the machine learning space or the computer vision space? 

Anthony Goldbloom: My answer is probably somewhat predictable, but get on Kaggle. I’m very much learning by doing type. I think it’s important to learn some basic Python as a starting point. Kaggle has some really nice courses where we try to teach you the basics of Python, the basics of supervised machine learning. They’re not supposed to be a really rigorous grounding in any of these topics, but they’re supposed to teach you just enough that you can start rolling up your sleeves and playing by yourself.

And challenges are a really good way to learn. You probably don’t want to spend more than half an hour or an hour a day on a challenge. And maybe one idea didn’t work, but then something you think of later makes an improvement. It’s a very nice way in my view to learn.

Listen to the entire interview on: 

Privacy Requirements Must Keep Up as Data Builds

In today’s world, upholding the right to privacy is challenging, to say the least. Data protection and privacy legislation exists in 70% of countries around the world. The need for data that feeds the voracious appetite of machine learning algorithms has made that data an indispensable part of doing business in the modern world. 

There have been many high-profile data breaches including the watershed 2015 Cambridge Analytica scandal where users’ data was collected without consent. Between 2017 and 2019 there were some 4,395 breaches which exposed over 830 trillion records. 

One way that countries have tried to ensure privacy is to limit the collection and processing of personal data at the same time the need for this data only grows. The US Congress passed privacy legislation as early as 1971 protecting “personally identifiable information” or PII. That legislation led to the creation of HIPAA for health data and FCRA for finance. Europe famously passed GDPR which contains stricter regulations and requires obtaining explicit consent before collecting personal data. Images are included in these laws, especially those that contain bio-identifiers.

But the data keeps building and the privacy requirements have a hard time keeping up. Companies have resorted to using anonymization or k-anonymization which removes all PII from a dataset. K-anonymity guarantees that each person in a dataset is indistinguishable from other people.  There are flaws with these anonymity approaches. One flaw is that anonymized datasets can be reidentified, meaning that you work backwards with publicly available data to re-identify the actual person in the dataset. There have also been breaches of re-identification, including the Human Genome Project where re-identification was used to identify thousands of individuals in 2004, 2008 and 2013. 

Read Our New eBook: Solving Privacy with Synthetic Data 

Another problem is that the utility of a dataset drops the more the information is redacted. Distinct characteristics are lost which begs the questions, how can we share useful information while preserving privacy? 

The answer in short, is synthetic data. Synthetic data holds the promise to be close enough to original data to be useful yet different enough from unique individuals to preserve privacy.  Synthetic data can be created in multiple ways; from comprehensive 3D modeling to frameworks such as Generative Adversarial Network (GAN). GANs have become more sophisticated with more control over image synthesis and flexibility of generation. If you are creating a face, you can control the identity of the person, the lighting and the background, for example. 

Synthetic data can alleviate privacy concerns because one can verify that generated data points are sufficiently different from original, real world data points. Because the synthetic data is not directly linked to any individual, the use of synthetic data points minimizes the risk of leaking PII.

A growing body of research attests to the effectiveness of synthetic data in addressing privacy concerns. For instance, when GAN is used to generate patient records, there are limited privacy risks in both identity and attributes. This is especially true when noise is added to training data points. 

Differential privacy – a rigorous mathematical definition of privacy – has changed what it means to be private. GANs that generate synthetic images can be differentially private. Research has demonstrated that one can guarantee that differentially private synthetic images do not contain any sensitive personal information. 

Synthetic data has benefits beyond privacy. With synthetic data there is no need to manually label data which is slow and painful; large, labeled datasets can be generated with minimal lead time. 

Real data can be inaccurate, incomplete and biased. The higher control one has over synthetic data generation creates higher quality data all around. Real data lacks diversity and doesn’t account for edge, or unusual cases, where synthetic data can supplement existing gaps in real-world datasets.

Synthetic data is the best bet for solving privacy concerns when it comes to data and PII. This, along with its other benefits is one of the reasons that synthetic data is pervasive in industries where privacy is a priority. 

Read the eBook: Solving Privacy with Synthetic Data