Types of Training Data Bias You May Not Know About

“AI algorithms are only as good as the data used to train them.” This common cliche is bandied about in almost every conversation involving AI. But, like many cliches, its overuse doesn’t affect its fundamental truth. In today’s world, the vast majority of AI applications are based on the deep learning algorithms and how these algorithms find patterns in training data. In many cases, these networks produce biased results, which in turn can stem from three main issues

  1. Framing the problem. The first thing computer scientists do when they create a deep-learning model is to decide what they want it to achieve. Given an unclear directive, the algorithm might come up with an unintended solution. 
  2. Collecting the data:  Either the data we collect does not represent reality or it reflects existing prejudices. 
  3. Preparing the data: It is possible to introduce bias during the data preparation stage, which consists of selecting the attributes we want our algorithm to take into account. 

Dealing with these biases require great amounts of awareness, and strong action to counterbalance them. Some of the biggest powerhouses in Tech have suffered from an inability to deal with and correct such biases. In particular, biases during the collecting and preparation of training data are hard to notice, offset, and correct. Specifically, racial and gender biases in the field of Computer Vision have been well documented. Today, I’d like to address some less known, but equally important biases. 

1.In a study of two popular public image datasets, researchers found that the U.S and Great Britain accounted for nearly 45% of images. By contrast, China and India together contribute just 3% of ImageNet data, even though these countries represent 36% of the world’s population. This geographic misrepresentation can affect the ability to train our algorithms in many fields.

  • Clothing: If our machines are trained by largely western images of clothing, their ability to equally recognize local clothing from areas underrepresented in the training data will be damaged. For example, algorithms were much less likely to recognize a bride and groom from Ethiopia or India than the U.S or Australia.
  • Home Layout: If we want to train home robotics to be able to see and navigate their way around homes, we need to expose them to different images. A home in the U.S or Britain doesn’t resemble a home in other regions of the world. Imagine training a robot to work and navigate its way around a London apartment, and then to be of use in a Mongolian Yurt.
  • Transportation: In countries where motor vehicles are less prevalent, and other methods of transportation are commonly used, the ability of an algorithm to correctly recognize a horse-drawn carriage, for instance, is suspect.
  • BMI: There are large discrepancies in BMI average by country and the percentage of overweight people in different populations. In a groundbreaking paper on using facial images to infer BMI, researchers tried to check against biases in their algorithm, but, by their own admission, didn’t have enough data to conclusively prove that there is no bias. If our datasets inaccurately represent real-life BMI, then our machines will have a harder time recognizing and accurately classifying outliers.

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022

2. Another major misrepresentation in training data is the weather. Many datasets skew towards fair weather, excluding rainy and overcast days. The lack of variety can impede creating sufficiently robust models. An obvious example of the real-life impact is the ability or inability of an autonomous car to drive in adverse weather conditions.

3. A little known type of bias is Object-based bias.

  • Duplicate Annotations: Double-labeling of images in datasets can lead to over-counting. Thus, the prevalence of certain objects can be misjudged by algorithms. For example, the same image is annotated both as “mug” and “cup”. This duplicate labeling can affect statistical distribution in a given dataset and is especially problematic in free-form annotation datasets. 
  • Object size:  Object size plays a key role in object recognition accuracy, and researchers have found examples of objects that comprise a larger fraction of any given image. Airplanes, for instance, are overrepresented as very large in images, and there are few images of smaller airplanes that are airborne.
  • Object co-occurrence: Object co-occurrence is a known contextual visual cue exploited by object detection models, but we can find surprising co-occurrence statistics. For example, we find that in COCO, a person appears in 43% of images containing the food category; however, a person appears in a smaller percentage of images containing broccoli (15%), carrot (21%), and orange (29%), and conversely a greater percentage of images containing cake (55%), donut (55%), and hot dog (56%). 

Daniel Kahneman and Amos Tversky are famous for exposing the myth that humans operate and make decisions based purely on reason.  Over their careers, they exposed dozens of cognitive biases in the way we compute data and make decisions. 

So too, the importance of bias in Artificial Intelligence must be understood and respected. The malicious effects of biases in our AI can be felt today. Whether a wrongful arrest based on AI or bad PR as a result of racially tinged algorithms, these biases are affecting our society.  And, more fundamentally, biased algorithms simply don’t work. They cannot deal with the endless variants that the real world conjures up. 

The solutions to these problems need to start with diverse geographic representation that accurately reflects the global population, understanding and analyzing existing datasets for problematic biases, and collecting data with enough edge-cases. 

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022

Corona is Making Training Data-Gathering Harder: Key Industries and Challenges

Covid-19 has affected every corner of the globe – taking lives, destroying livelihoods, and shaping our world for the foreseeable future. Passenger traffic at airports has plummeted and, with 81% of the global workforce living in countries with mandatory or recommended workplace closures, we have all seen the striking images of cities around the world empty and deserted. This seismic shift in human activity is making the acquisition of manual training data slower and harder, while also less representative than ever before. In this post, we’ll explore a few of the industries affected by these changes.

Autonomous vehicles

Autonomous vehicle companies rely on large teams of drivers and technicians to test vehicles and collect training data. First, stay-at-home orders and the related logistical challenges of these operations have made staffing difficult, forcing companies to lay off significant portions of their teams. They are simply unable to get their teams on the road, collecting data. Second, empty streets, with fewer cars and pedestrians, make driving data less representative and more biased. It has become nearly impossible for these companies to capture data of “normal” (aka pre-Corona) traffic volumes. Some of these companies have refocused on detailed data labeling, 3D mapping, and identification of overlooked scenarios from previous road sessions that can be used to train their systems. Out of necessity, they have shifted to more innovative approaches and long-term experimentation.


While car companies are struggling on the ground, drone companies are struggling in the air. Smart traffic analysis based on drone footage is becoming a reality in many cities and states. But in today’s world, with unprecedented drops in traffic volume, collecting training datasets with enough variance is a nearly impossible task. But, drone applications in other industrial settings – shipping, warehousing, airports, etc. – are also hindered by high-level economic slowdowns. When activity in these outdoor environments grinds to a halt, companies will have a hard time collecting unbiased data. With economists predicting a slow economic recovery, these challenges are likely to persist.

Human-Environment and Human-Human Interactions

The new reality of lockdowns has made life hard for companies which generate datasets by “Crowd Acting”. With this method, teams generate data by instructing workers, customers, models, or volunteers to record short video clips based on defined descriptions. But, asking them to do crowd acting outside their homes has become untenable and downright dangerous in some cases. Additionally, crowd acting of face-to-face interactions may be impossible while maintaining compliance with social distancing guidelines. 

Smart Stores

In February, Amazon opened its first Amazon Go Supermarket in Seattle using the retailer’s cashierless “Just Walk Out” technology that has previously powered 25 Amazon Go convenience stores in a handful of major U.S. metros. Expanding this type of technology requires sophisticated computer vision systems that can synthesize complex human-object interactions; grocery shoppers may examine produce carefully only to put it back, or discard items on shelves aisles where they don’t belong. In order for the technology to work optimally, huge amounts of training data with myriad variants and edge cases are needed. But with online grocery shopping skyrocketing, those variants and edge cases will be harder to capture.

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022


VR/AR is an exciting field of development, but one of the most important stages in the development of new hardware has been user trials. In a COVID-19 world, user trials suddenly become complicated affairs. Dozens of people donning the same headset in succession has become truly unimaginable. Recruiting and scanning hundreds of people on premises is out of the question. 

Age Bias

Covid – 19 has disproportionately affected certain segments of society. Across the world, the elderly have been significantly more at-risk and have been ordered in some countries to minimize leaving their homes. Even in countries without Government orders limiting movement for seniors, many public health organizations have recommended for them to stay at home if possible.  This raises the risk that manually-collected datasets collected during the Corona-period will have significant age biases.


Masks will be with us for a while, but many facial recognition computer vision systems have not been trained to recognize faces with masks. So, new training data must be collected, a process that is difficult for reasons mentioned above. But, what happens when mask use decreases? Systems trained on data collected during Corona, which are likely to show a huge number of people wearing masks, may not be able to operate well on mask-less faces.


In our follow-on post about the impacts of Corona on data-collection, we’ll dig deeper into the field of robotics and how some teams are combating the challenges we’ve discussed here.

A Path Forward

At its core, Corona is creating two intertwined challenges. First, it is making it hard to collect new data. Second, it is increasing the likelihood that any new data will be biased in a number of ways. The current challenge of acquiring data in the wild, offers a significant opportunity for simulated and synthetic data to fuel advances in computer vision. 

It allows for generation of new datasets without compromising the health and safety of workers or navigating the logistical hurdles erected by Corona. And, its customizability ensures that datasets will not contain “Corona biases”. As the challenges posed by the COVID-19 era continue to reshape a range of industries, simulated visual data can help accelerate breakthroughs in computer vision and make sure that this virus does not stop training data collection in its tracks. 

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022

Research Roundup: Training with Synthetic Data

At Datagen, we’re actively investing in academic collaborations and benchmarking research that can further establish the effectiveness of training computer vision algorithms with simulated or synthetic data. Fortunately, a large body of work already exists, exploring the advantages and challenges of this approach. We’ve collected some of them here, for anyone interested in learning more about this rapidly-expanding field of computer vision research.


  1. Learning from Simulated and Unsupervised Images Through Adversarial Training (2017)

From researchers at Apple, this paper was the company’s first research paper ever published and was named the best paper at CVPR 2017. The research shows how labeled synthetic data, domain adapted with unlabeled real data, solves eye gaze estimation. One year later, the iPhone 10 came out with eye gaze estimation embedded. It is highly likely that Apple used some of the synthetic data techniques detailed in the paper to develop this feature.

  1. Sim2real Transfer Learning for 3D Pose Estimation (2019)

Deepmind & Oxford University researchers detail how they achieved state-of-the-art results on 3D human mesh estimation by training exclusively with synthetic humans. It utilizes pose estimation and optical flow estimation to help bridge the domain gap between real and simulated data.

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022

  1. SIDOD: A Synthetic Image Dataset for 3D Object Pose Recognition with Distractors (2019)

Nvidia shows that using photorealistic synthetic data can lead to strong results in object detection, 3D pose estimation, and tracking applications of objects in real-world scenes. This is a powerful approach that was demonstrated on a small set of items. Datagen is attacking the next challenge in this area; working on ways to scale this approach to thousands of objects while integrating humans and dynamic motion.

  1. 6-DOF GraspNet: Variational Grasp Generation for Object Manipulation (2019)

Here, Nvidia presented a purely simulation-based approach to generating realistic grasping positions. The grasp distribution is diverse enough to find successful grasps even after discarding those that violate robot kinematics and collision constraints. The method’s diverse grasp sampling leads to higher success rates in comparison to a state-of-the-art 6- DOF grasp planner. More here about Nvidia’s research efforts in this field.

  1. 3D Face Reconstruction by Learning from Synthetic Data (2016)

A Technion research group, based in Israel, lays out how they achieved impressive results in reconstructing 3D face meshes from single face images by using synthetic data.


  1. Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization (2018)

NVIDIA is clearly a research leader in this field. Here, a team of researchers train an object detection network with synthetic data and prove synthetic data does not necessarily need to be as photorealistic as real data to work as effectively, when using a statistical technique called Domain Randomization. Domain Randomization trains the network on images from many kinds of simulated environments at the same time. While this approach requires a large amount of data – in this case, 100,000 data points – it can be valuable when the domain is hard to mimic. We recommend this paper for its insights into the potential of Domain Randomization methodology.

  1. GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB (2017)

In this paper, a Stanford University research team achieved impressive results in hand pose estimation by training a network on hand synthetic data. The synthetic hands are infused with textures from real-world hands captured by a style-transfer algorithm that uses a GAN-based architecture. The implications are particularly interesting for the VR/AR industry, where hand pose estimation may be able to eliminate the need for hand-held controls. For more on state-of-the-art hand pose estimation data please check out DataGen’s hands generation capabilities.

  1. Learning to Drive from Simulation without Real World Labels (2018)

Cambridge university researchers, working with a corporate team, teach a car to drive in a cartoon-like simulator. The novel idea was to teach the car to transcribe real-world data into its simulation-based understanding (real2sim) instead of attempting the reverse (sim2real).

We will continue to update and add to this post as new research becomes available. If there are relevant publications that you’d like to share with us, send them our way.

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022

Simulated Data is Synthetic Data 2.0

The promise of synthetic data has been clear since its inception.By creating a fully “synthetic” set of data that mimicked statistical patterns existent in “real” datasets, synthetic data could enable analysis and data science without compromising privacy. It also promised to be highly scalable and manipulable, faster to generate, and able to fill data gaps where real data would be impossible or expensive to collect.

Since then, synthetic data has shown promise across a range of verticals from medical research where patient privacy is tantamount to fraud detection where synthetic datasets can be used to test and increase the robustness of security solutions. More recently, synthetic data has been applied, with some success, to machine learning applications as a source of training data for algorithm systems.


But, widespread (and healthy) skepticism remains. Can synthetic data accurately reflect the real world? Can it capture nuanced statistical realities, especially when there are very few real dataset reference points? As the applications have become broader and more sophisticated, these questions are even more relevant.

If a synthetic alternative to manual collection and annotation of data can address these concerns and provide equal (or better!) data, it has the potential to materially change the process of developing algorithms. Saved time and increased effectiveness could bring the promise of computer vision to the market even sooner. This is especially true in a computer vision context – which relies on image or video data that is generally harder, slower, and more expensive to collect. 

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022

Datagen is among a number of teams, in a wide range of verticals, racing to apply synthetic data techniques to visual data simulation, to power the world of computer vision. To be sure, Datagen does create a type of visual synthetic data. But, because we take a fundamentally different approach to solving the challenges and proving the promise of synthetic data, we’ve decided to call it something different. At Datagen, we create Simulated Data, .

Synthetic data is usually defined by what it is not; at the most basic level, it is data that isn’t manually gathered. There are a range of methods for creating it, from sophisticated Generative Adversarial Neural Networks (GANs) to more basic strategies such as pasting a subject onto different backgrounds to create variation. Most of these approaches have serious limitations. Because they are fundamentally static, these datasets have to be constantly adjusted and regenerated to reflect new parameters, distributions, or training goals. They are heavy to play with, especially when the data is visual. They are generally designed to accomplish a single task or capture a single scenario at a time. They don’t enable machines to learn like humans do – responsively in real time. As our neural networks become more and more robust, they will be able to learn more intuitively. But, the data needs to keep up.

Simulated Data brings synthetic data to life. It is synthetic data captured by virtual “cameras” operating in photorealistic, physics-based simulations. These simulations have spatial dimensions and a full range of annotations (physics, lighting, etc.). Simulated Data is also based on real 3D data. Datagen is building one of the largest libraries of 3D assets and human-centered scans in the world, ensuring that our simulations have enormous variation without deviating from realism. Simulated Data merges this hyper-realism with a unique level of flexibility and customizability. By placing these realistic assets into a 3D environment, we can “virtually” photograph a room at different angles, with different lenses, at different times of day. Simulations allow us to swap in new furniture, change layouts, or adjust lighting. We can change a model’s skin tone or age, take photos in infrared or with depth maps, and simulate edge cases that are nearly impossible to capture manually. And, because all of this variation is built on a foundation of high-quality, scanned 3D data, it retains its realism and plausibility. This enables us to avoid the cost of manual gathering at scale, the privacy concerns associated with human-centered data, and the bias inherent in manually-gathered datasets. Thanks to advances in algorithms and computer graphics, our Simulated Data offers a realistic, fully customizable 3D sandbox that exposes a system to a dynamic environment of people, objects, and space.

Obviously, for some applications, traditional synthetic data will be enough. But,the way we see it , synthetic data is just a building block, a stepping stone to the next generation of computer vision training data. That is why we use the term Simulated Data instead of synthetic data. It represents the ultimate goal of fully simulating the world around us in hyper-realistic, synthetic detail. As we continue to build this next-generation technology, we look forward to sharing Simulated Data’s full potential.

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022