Types of Training Data Bias You May Not Know About

“AI algorithms are only as good as the data used to train them.” This common cliche is bandied about in almost every conversation involving AI. But, like many cliches, its overuse doesn’t affect its fundamental truth. In today’s world, the vast majority of AI applications are based on the deep learning algorithms and how these algorithms find patterns in training data. In many cases, these networks produce biased results, which in turn can stem from three main issues

  1. Framing the problem. The first thing computer scientists do when they create a deep-learning model is to decide what they want it to achieve. Given an unclear directive, the algorithm might come up with an unintended solution. 
  2. Collecting the data:  Either the data we collect does not represent reality or it reflects existing prejudices. 
  3. Preparing the data: It is possible to introduce bias during the data preparation stage, which consists of selecting the attributes we want our algorithm to take into account. 

Dealing with these biases require great amounts of awareness, and strong action to counterbalance them. Some of the biggest powerhouses in Tech have suffered from an inability to deal with and correct such biases. In particular, biases during the collecting and preparation of training data are hard to notice, offset, and correct. Specifically, racial and gender biases in the field of Computer Vision have been well documented. Today, I’d like to address some less known, but equally important biases. 

1.In a study of two popular public image datasets, researchers found that the U.S and Great Britain accounted for nearly 45% of images. By contrast, China and India together contribute just 3% of ImageNet data, even though these countries represent 36% of the world’s population. This geographic misrepresentation can affect the ability to train our algorithms in many fields.

  • Clothing: If our machines are trained by largely western images of clothing, their ability to equally recognize local clothing from areas underrepresented in the training data will be damaged. For example, algorithms were much less likely to recognize a bride and groom from Ethiopia or India than the U.S or Australia.
  • Home Layout: If we want to train home robotics to be able to see and navigate their way around homes, we need to expose them to different images. A home in the U.S or Britain doesn’t resemble a home in other regions of the world. Imagine training a robot to work and navigate its way around a London apartment, and then to be of use in a Mongolian Yurt.
  • Transportation: In countries where motor vehicles are less prevalent, and other methods of transportation are commonly used, the ability of an algorithm to correctly recognize a horse-drawn carriage, for instance, is suspect.
  • BMI: There are large discrepancies in BMI average by country and the percentage of overweight people in different populations. In a groundbreaking paper on using facial images to infer BMI, researchers tried to check against biases in their algorithm, but, by their own admission, didn’t have enough data to conclusively prove that there is no bias. If our datasets inaccurately represent real-life BMI, then our machines will have a harder time recognizing and accurately classifying outliers.

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022

2. Another major misrepresentation in training data is the weather. Many datasets skew towards fair weather, excluding rainy and overcast days. The lack of variety can impede creating sufficiently robust models. An obvious example of the real-life impact is the ability or inability of an autonomous car to drive in adverse weather conditions.

3. A little known type of bias is Object-based bias.

  • Duplicate Annotations: Double-labeling of images in datasets can lead to over-counting. Thus, the prevalence of certain objects can be misjudged by algorithms. For example, the same image is annotated both as “mug” and “cup”. This duplicate labeling can affect statistical distribution in a given dataset and is especially problematic in free-form annotation datasets. 
  • Object size:  Object size plays a key role in object recognition accuracy, and researchers have found examples of objects that comprise a larger fraction of any given image. Airplanes, for instance, are overrepresented as very large in images, and there are few images of smaller airplanes that are airborne.
  • Object co-occurrence: Object co-occurrence is a known contextual visual cue exploited by object detection models, but we can find surprising co-occurrence statistics. For example, we find that in COCO, a person appears in 43% of images containing the food category; however, a person appears in a smaller percentage of images containing broccoli (15%), carrot (21%), and orange (29%), and conversely a greater percentage of images containing cake (55%), donut (55%), and hot dog (56%). 

Daniel Kahneman and Amos Tversky are famous for exposing the myth that humans operate and make decisions based purely on reason.  Over their careers, they exposed dozens of cognitive biases in the way we compute data and make decisions. 

So too, the importance of bias in Artificial Intelligence must be understood and respected. The malicious effects of biases in our AI can be felt today. Whether a wrongful arrest based on AI or bad PR as a result of racially tinged algorithms, these biases are affecting our society.  And, more fundamentally, biased algorithms simply don’t work. They cannot deal with the endless variants that the real world conjures up. 

The solutions to these problems need to start with diverse geographic representation that accurately reflects the global population, understanding and analyzing existing datasets for problematic biases, and collecting data with enough edge-cases. 

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022