As Computer Vision Explodes, Data Collection Needs To Change, Too.

Nowadays, there’s a lot of talk about Computer Vision and its potential impact on a wide range of fields and applications. And it’s not just talk; spending on Computer vision R&D is rapidly growing, with an estimated CAGR of 31.6%.  The ability of cameras and computers to not just see but understand the world around them can transform fields from IoT to smart cars and smart cities to manufacturing

Imagine smart stores that understand what you’re grabbing from the shelf and calculate your purchases in the background. VR and AR technologies without clumsy hand-held controllers, responding more intuitively to your body. Systems that understand emotions and facial expressions, safety mechanisms that stop a car if there’s someone crossing the street, smart security systems that can understand when something’s amiss. The applications are unlimited and will change the way we live. 

But, despite enormous advances in computer vision research and GPU power, transformative technology seems to be arriving slowly and with plenty of serious challenges.

When we started Datagen, we began by asking, “Why? What is holding back progress in this field?” After conversations with engineers and researchers working on human-centered smart computer applications, a theme emerged. State-of-the art algorithms and GPU are just 2 of the 3 ingredients needed to bring this technology to market.

Without the third ingredient – data – these developments are destined to stall. Again and again, we heard that a lack of access to data is slowing the pace of development in computer vision. Teams simply don’t have enough high-quality data to train their growing neural networks.

Manual Data

Computer vision teams are spending too much time, money, and energy obtaining annotated datasets that, despite the effort required, are still deeply flawed. We call these datasets Manual Data; data points are manually captured from the real world and then, for the most part, manually annotated. Here are some of Manual Data’s main problems:

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022

Manual Data is slow 

When you’re reliant on event frequency in the real world or have to collect existing data from disparate sources with complicated privacy or access considerations it can take a long time and lots of resources to build a sufficiently large and representative dataset. It can take months to just collect the data, and even longer to annotate it correctly. For highly complex use cases and very specific training needs, the process can be so slow as to be infeasible. The alternative – manufacturing scenarios through play acting – presents a new level of logistical complication that many computer vision teams are ill-equipped to handle. And, as needs change and networks need to be retrained, this slow process must be repeated.

Manual Data is biased

High-level bias – over- or under-representation of key situations, items, or demographic groups – can limit the effectiveness of systems and is extremely hard to control when you are subject to resource and access limitations. Often you don’t know about these biases until you’ve collected your data or trained your system. But, computer vision systems that lose accuracy in key population segments, environmental conditions, or situations can not be put into production.

Manual Data cannot cover all existing variants and edge cases 

To train algorithms correctly we need to capture rare and potentially dangerous occurrences. For instance, when training autonomous vehicles we need training data with near accidents, and driving in dangerous conditions. But, these incidents do not happen frequently enough to supply enough data. Nor are they cases that we can pay actors to recreate. 

Manual Data is limited

Because annotations have to be added after data collection, and the process is often done manually at scale, it is often inconsistent, slow, and lacks 100% ground truth. This is especially true for annotation of motion/video data, where annotations must be added frame-by-frame. There are certain data layers – for instance, detailed depth maps – that are impossible to add by hand.

Furthermore, obstructions, complex poses, and camera blocking may make accurate annotation impossible. The need for complexity within computer vision training data has grown sharply, but human annotators are not developing super-powers in tandem. As object detection has given way to semantic segmentation, instance segmentation and panoptic segmentation,  successive levels of annotation become more complex and time consuming. New technologies are needed to keep up.

Datagen was founded to provide an alternative to the status quo. Let us explain.

Datagen creates high-quality, human-focused Simulated Data to fuel smart computer vision learning algorithms and unblock data bottlenecks. Our data is photorealistic, scalable, high-variance, unbiased, and has all kinds of superpowers. In other words, it’s a better, smarter alternative to Manual Data. When done well, Simulated Data can provide that third, missing ingredient to the computer vision puzzle and unleash the potential of this fast-growing field.

The promise of synthetic data has been clear since its inception for many of the reasons why Manual Data is flawed:

  • Speed: Simulated Data can be generated on demand with a full range of customization capabilities, eliminating the dependence on real-world event frequency. This allows for iterations to happen faster.
  • Bias: Simulated Data can provide full transparency into the contents of training data, meeting the exact distribution needs required to train an accurate network.
  • Edge cases: Whether simulating dangerous situations or hardware that’s not yet in product, Simulated Data can supplement and strengthen datasets by providing more variance.
  • Data richness: For the most advanced applications of computer vision, data needs to be annotated pixel-accurately. This cannot be done by humans. For visual recognition, manual annotation like class labels and bounding boxes can suffice. For dense tasks such as depth estimation or optical flow, manual annotation isn’t good enough. Using synthetic data, we have perfect knowledge and control of the data we create and therefore perfect ground truth annotations along with the images.  And, since they’re built into the simulations from the beginning, they’re consistent throughout.

An additional note. Increasingly, there are many concerns about privacy in the context of computer vision data. In a post GDPR world, companies need to be cautious about data they collect and use. By creating a fully simulated dataset that mimics real-world statistical patterns, synthetic data enables training without compromising privacy. 

Data Collection

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022

Already, outside the realm of computer vision, where networks rely on non-visual data, synthetic data has shown promise across a range of industries from medical research where patient privacy is tantamount to fraud detection where synthetic datasets can be used to test and increase the robustness of security solutions. 

Its use in a computer vision context is just beginning to catch up. Recently, researchers have successfully been proving the efficacy of synthetic data. In this paper, researchers at Google  trained an object detection model on synthetic data — supermarket items — that outperformed one trained on real data. Using the SYNTHIA dataset for autonomous vehicles, training proved that a combination of manual data and synthetic data performed better than manual data alone. Engineers at Unity have made the case for synthetic data and shown it’s ability and advantages. Leading companies in different fields have recognized the promise in synthetic data. Waymo drives 20 million miles a day in it’s carcraft simulation platform comparable to a 100 years of real-world driving. Nvidia has trained robots to pick up objects using synthetic data. DermGan, a generator of synthetic skin images with pathologies has helped diagnose skin diseases. 

We, at Datagen, are contributing to this research and validation as well.

Skepticism of synthetic visual data is still prevalent. The question is whether these visual datasets can effectively capture and reflect the physical world with enough accuracy and variance to achieve comparable results. Can simulated data express statistical minutiae especially with few physical reference points? With Computer Vision applications becoming more common and more sophisticated these questions are ever more pressing. 

If synthetic data can present an alternative to manual data by addressing these concerns, and provide datasets that are at least equal (if not superior), it has the potential to dramatically affect the process of developing algorithms. Faster and more efficient algorithms can help deliver the promise of computer vision to our world even sooner. 

That’s why, at Datagen, we are building Simulated Data solutions for computer vision network training.

We’ve explained the advantages and promise of synthetic data and the drawbacks of manual data. As you might have noticed, we refer to our work as Simulated Data. So, let’s conclude with an explanation of Simulated Data and how we distinguish it from synthetic data. 

Synthetic data is usually defined by negation; data that is not manually gathered. There are numerous methods to create it, from advanced Generative Adversarial Neural Networks (GANs) to more simple methods such as copying an image on to different templates to generate variety. These methods are not equal and some are very limited. The visual datasets generated are huge, and cumbersome to play with. Generally, they are designed for a specific task or single scene. They don’t allow our models to learn like we do – responsively and rapidly. Fundamentally, they are static and have to constantly be updated and generated to mirror new characteristics, targets, and domains. With the development of better and better Neural Networks, our data needs to be on pace. 

Simulated data is synthetic data brought to life. Rapid developments in CGI enable more efficient creation of photorealistic imagery. Our simulations leverage the most sophisticated rendering techniques throughout the entire 3d production pipeline. Virtual cameras let us “photograph” physics-based photorealistic simulations, along with their ground truth annotations. We integrate layers of algorithmic work and custom models which allow us to create at scale without sacrificing realism. 

Simulated data isn’t detached from reality. On the contrary, it is based on real-world physics and 3d modeling. To this end, Datagen is compiling and creating one of the largest libraries of 3d scans, with a special focus on humans. This helps us make sure that together with variance and breadth of examples we don’t lose realism. Simulations let us swap features, like the lighting, backdrop, and time of day. We can simulate edge cases that cannot be captured manually. We can change the ethnicity and age of our models. Our images can be displayed as infrared, or with depth maps from different angles or with different capturing lenses. These graphic tools are coupled with avoiding the pitfalls of manual data; the cost of data collection, privacy issues arising from human-focused data, and the inherent bias of manual datasets. Advances in Computer Graphics allow our simulated data to create a customizable, realistic 3d environment. This environment has a dynamic mix of people, objects, and space, all based on high quality, photogrammetric 3d data. 

At Datagen, synthetic data is an initial stage towards the next generation of computer vision training datasets. Simulated Data is the future. We intend to continue our work, and truly simulate the real world with photorealistic recreations of the environment around us. For us, synthetic data isn’t enough. We are excited by the potential that Simulated Data has, and look forward to sharing it.

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022

Quality at Scale: How Advances in Computer Graphics Tools are Enabling Data Generation

In recent years, advances in synthetic data generation have promised a solution to the data bottleneck in machine learning network training. Rather than collecting and annotating data by hand, we’re getting better at creating it programmatically. But, while synthetic data has shown huge promise in fields like medical imaging, autonomous vehicles, and fraud detection, it has not yet proved its full value in Computer Vision contexts that require large datasets of human-focused data. Thanks to advances in computer graphics, 3D modeling, animating, and rendering technologies, this is beginning to change. Finally, we have the tools needed to create high-quality, scalable, photorealistic training data at scale.

In this article, we will explore how advances in visual synthetic data generation, especially in the areas of humans and environments, have been unlocked by the ongoing development of incredibly powerful computer graphics and 3D-rendering software. This growth may be able to solve some of synthetic data’s key challenges and make this data broadly available.

Quality vs. Quantity

At its core, generating synthetic computer vision training data has always confronted the trade-off between scale and quality. The training data must be photorealistic enough to effectively mimic the real world. At the same time, we need a large amount of data for supervised learning. Unfortunately, these needs are often in direct conflict with each other.

Historically, photorealism has been achieved with huge amounts of time and labor from 3D artists, painstakingly recreating scenes and manually adding variation to datasets using computer graphics tools.

The process is similar to creation of high-quality CGI in the Film and TV industry. A single frame can have millions of parts and requires a huge level of detail to generate high-quality images. For example, the Sully character from Monster Inc. had over 2 million individually named hair on his body, and a single frame involving Sully took 11-12 hours on average. CGI production is also extremely computer resource intensive. A single frame can take hours to render and incurs computing costs as well. The average cost per second for 3d Feature Films can reach $25,000 USD.

In the computer vision context, when you need hundreds, thousands, or millions of data points for training, this process is impossible to repeat at scale. Training data does not necessarily need to be perfectly photorealistic to the extent that it can fool the human eye, but it does need to be close enough to achieve sufficient accuracy; researchers have shown that non-realistic lighting, shading, and texturing damage the ability to train algorithms effectively. . But, even if you sacrifice some level of detail in the service of scale, the process often remains cost-prohibitive.

This trade-off between the need for scalable data on the one hand, and high-quality photorealistic data on the other hand is exacerbated when we are dealing with sophisticated datasets. Synthetically creating a single human hand, for instance, is entirely different from creating a complex scene with humans, motion, and complex environments. And, even if you are decreasing photorealistic detail, you will still need to manually add enormous variation in race, age, body type, and other features. And we haven’t even talked about things like lighting conditions, skin coloration changes caused by grabbing, or muscle movement.

In short, when you rely on manual, artist-based generation of photorealistic training data, you need time and money. Lots of them.

For the vast majority of teams, this is simply not a feasible way to generate synthetic data. An ideal solution is one that would:

  1. Accelerate or automate some of the artist’s work without significantly sacrificing photorealism, decreasing the cost per image.
  2. Generate new images at scale without requiring an artist to craft each one, decreasing the number of images an artist must create.

Here, we’ll discuss how improvements in computer graphics tools are powering the first part of this solution. In a future post, we’ll talk about the second.

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022

Increasingly Powerful Tools

Early on at Datagen, we saw the ability of emerging advances in 3D rendering technology to solve this quality vs. quantity trade-off. In other words, we identified the power of a solution that combines the work of artists with the scale of automated, algorithmically-governed image generation.

This hybrid approach would not be possible without open-source 3d computer graphics software like Blender, and game engines like Unreal Engine and Unity, which have made some key advances in the entire 3D pipeline – modeling, rigging, animating, simulating, rendering, compositing and video editing. They also allow teams to integrate layers of automation and customization into the software’s core capabilities. With the proper configuration, this enables teams to automatically change variables at scale, while retaining high levels of photorealism that were previously only achievable with manual, custom work.

Let’s explore some key elements of photorealism that must exist in high-quality synthetic data and how Blender and other tools have made them much more attainable. 


For photorealistic 3D data, objects and humans must have both the correct 3D shape and the correct 2D color and textural detail. Basically put, texturing is fitting a 2D image onto a 3D model to give it color and detail. The texture is a pixel image, which is laid on an editor in the graphics engine to give it color, as well as parameters for reflection, refraction, and translucency. The first and foremost aim of texture use is to show the material from which the real object is made as well as definite physical light properties of the modeled object like smoothness or roughness and to recreate the physical material. With advances in texturing tools, artists can more easily achieve stunning detail. In a cinematic context, these amazing advances in texturing, driven by developments in Machine Learning, have allowed teams to create increasingly realistic, digital humans with better simulation of pores, new representations of the human eye and its place in the eye socket, and much more. With more powerful computer graphics tools, artists can achieve better photorealism, faster.


In order to create a realistic object, shading approximates the behavior of light on the object’s surface. HIgh-quality shading creates depth perception in 3d models by varying the level of darkness and brightness. Advances in shading algorithms have enabled us to create increasingly photorealistic imagery with less manual manipulation. The newest shading models in Unreal Engine, have managed to add yet another level of realism and accuracy. Specifically improving translucent materials, anisotropic materials, and clear coat materials. The results are truly astounding. 


Computer graphics lighting is the collection of techniques used to simulate light in computer graphics scenes. While lighting techniques offer flexibility in the level of detail and functionality available, they also operate at different levels of computational demand and complexity. At its most basic level, digital lighting can be done by using a Point Light. A point light emits light from a fixed point in virtual space equally in all directions (this is also called omnidirectional lighting). A point light is characterized by its position relative to the object, and its lighting power. A light bulb approximates virtual point lights.

Another way to simulate real illumination is the use of HDRIs, an image format that contains a broader spectrum of light and shadow. While a regular digital image contains only 8 bits of information per color (red, green, blue) which gives you 256 gradations per color, the HDR image format stores RGB colors with floating-point accuracy. Thus the variation from dark to light per color is virtually unlimited. Using HDR images in a 3D environment makes very realistic and convincing shadows, reflections, and brightness.

The key advance of recent years has been the ability to achieve realistic lighting algorithmically. In particular, Global Illumination, a group of algorithms used in 3d graphics, has been able to make images much more realistic. These algorithms take into account not only the direct light mentioned above but also subsequent cases in which light rays from the same source are reflected and refracted by other surfaces in the image. These are incredibly difficult to add manually in photorealistic ways.

Additionally, computer graphics tools like Blender have given teams the ability to build their own add-ons, designed for their specific needs. For instance, the ability to control and modify lighting Post-Render for fast, flexible, and largely-automated modification of images.


3d rigging is the process of creating an invisible skeleton that defines how objects move. A 3d rig is a system of invisible objects that are seen in different viewports but are concealed in the final render. Rigging at its most basic consists of nulls and joints. Joints are the bones of the skeleton, while the nulls are the cartilage that defines the range of movement. When correctly built, the joints and nulls create a logical moving element, and when bound to a humanoid mesh it can create a realistic range of movements.

Inverse and forward kinematics define the direction a rig works when being animated. Forward kinematics are controlled from the base joint of the process and are relatively simple. Inverse kinematics begin with the endpoint of the motion and work backward through the rigging. Advances in these techniques have enabled us to move humans in realistic ways, without jerky and cartoonish movement. And, creating animation friendly rigs has become simpler than ever. Unreal Engine’s rigging system Control Rig can be used to create complex animation rigs for any character type.

The result is that teams can generate simulated data with motion without investing in a huge team of specialized animators.

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022


In addition to rigging, more scalable animating techniques have advanced tremendously. One great example is Motion Capture, which is the process of recording the movements of objects or people and then using that information to animate digital models. The advances in MoCap allow us to capture the subtlest of movements and transfer these captured motions into computer graphics simulations. Not only are production times being slashed, but a future in which Motion Capture is possible without any markers is fast approaching. 


3D rendering is the process of translating the 3D models back into 2d images that can be used for training. The images are generated based on sets of data defining the color, texture, and material of given objects. Photography is an apt analogy: a rendering engine points a camera towards a 3D object and takes a photo. Advances in GPU power have made the technique of Ray Tracing Engines more powerful, omnipresent, and much, much faster and even real-time. Previously, most real-time renderings used Rasterization, in which objects are created from a mesh of polygons, connected to form a mesh, turned into a 3d model, and then converted into pixels on our 2d screen. Each pixel can be assigned an initial color value from the data stored in the polygonal vertices.

Ray Tracing is different. In the physical world, objects are illuminated by different light sources, and photons can bounce between objects before reaching our eyes. Light is blocked, by different things, creating shadows. Light also is reflected from object to object, like when we see an image of an object reflected in the surface of another object.

There are also refractions, when the light changes as a result of passing through transparent or translucent objects, like liquids or clouds. Ray Tracing captures those effects by working back from the view camera. This technique was first described by Arthur Appel in a seminal work. Further developments explained how ray tracing could incorporate common film-making techniques – — including motion blur, depth perception, translucency, and fuzzy reflections — that until then were the province solely of cameras. The combination of this research with the power of modern GPUs has led to computer-generated images that capture shadow, refraction, and reflection that is truly photorealistic and indistinguishable from the real world. 

Generating Quality at Scale

Primarily, these advances in computer graphics have enabled video game creators, architectural rendering firms, and the movie industry to create mind-blowing visual simulations. But, the scale of these simulations (even in epic gaming multiverses) are generally just a fraction of what’s needed for large-scale computer vision training.

Now, it’s time for the world of simulated generation to harness these tools. Tools like Blender are designed to be integrated with and built upon, even with scripts and addons in Python.

As we mentioned before, we need to:

  1. Accelerate or automate some of the artist’s work without significantly sacrificing photorealism, decreasing the cost per image.
  2. Generate new images at scale without requiring an artist to craft each one, decreasing the number of images an artist must create.

Taken together, the remarkable advances in Computer Graphics over the past 40 years from the Famed Utah teapot, to modern-day real-time photorealistic rendering are truly breathtaking. Ray-tracing and Path-Tracing rendering that used to take hours of computing time has sped up to become real-time. The incredible rise of GPU power is enabling us to create truly photorealistic images in real-time without cost-prohibitive artist labor requirements.

But, even though artist-created images can be generated more efficiently and with higher photorealism, it still is not feasible for any team of artists to cost-effectively generate them in the hundreds of thousands or millions without additional tools to solve challenges of scale. Soon, we’ll address challenge #2: how we can take these efficiently generated, photorealistic images, and multiply them at scale.

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022

Robotics in the Age of Corona: More Interest, Less Data

We are only halfway through 2020 and, already, COVID-19 is has reshaped how societies around the world navigate their dependence on face-to-face, close-proximity human interactions for everything from providing medical care to operating restaurants to offering social interaction to isolated populations. In an effort to minimize the risks to front-line workers while providing these critical services, companies and governments are increasingly looking to robots for help. From disinfecting hospitals to restocking grocery stores and more, AI-powered robotics can take over a wide range of critical tasks while maintaining social distancing. There has long been interest in the automation of work by robots – a result of both long-term technological development and acute historical events – but COVID-19’s global impact and severity has
accelerated interest in devising new ways that robots can help.

In this post, we explore key areas in which COVID-19, while increasing demand for robot solutions, is simultaneously making these improvements harder to achieve. We’ll also look at some of the ways that teams are overcoming these challenges to meet the needs of the moment.

Computer Vision and Training Data

The ability of robotics to supplement or replace human workers in high-risk environments will largely depend on their level of sophistication. Some may serve as interfaces for humans operating them at a distance – for instance, robots that can wheel up to a patient’s bedside to perform simple tasks or allow a physician to speak to the patient through video chat without risking exposure. These robots are often remote-controlled and don’t operate autonomously. Others integrate some form of AI and computer vision technology – for instance, robots that detect human heads for infrared temperature screening. But, the most sophisticated robots use computer vision to enable navigation of complex environments and interaction with both the objects and humans in them.

Corona aside, these computer vision systems that can handle real-world situations and dynamic settings are difficult to build, in part because of the challenges associated with collecting training data. The environments where robotics can be most impactful – hospitals, airports, grocery stores, factories – are highly complex environments. They contain a wide range of objects, a lot of variation in layouts and conditions, and large numbers of humans simultaneously interacting with the environment. First, this makes data messy;  occlusion highly likely and the spaces are often large and difficult to capture cleanly with camera arrays. In turn, the messiness of the data and complexity of the environments can make the data difficult to annotate manually. Plus, the dynamism of these environments means that motion (video) data is likely necessary for sophisticated development, which is harder to collect and manually annotate.

Another challenge is that data collection in hospitals, schools, and other non-commercial spaces is complicated by privacy concerns and the intrusiveness of data collection activities, leading some researchers to search for privacy-friendly ways to collect data

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022

Collecting sufficient amounts of data to capture these environments has always been a difficult and expensive task. And, a task that must be done well; computer vision system errors and biases in hospitals can be matters of life and death. But, the opportunities for transformative innovation in these contexts are so great that teams continue to invest millions of dollars in building datasets big enough, diverse enough, unbiased enough, and annotated enough to effectively train their algorithms. All of this is to say that, even before Corona, the process of building computer vision systems designed to navigate these environments was a challenging, expensive, and crucial undertaking.

Corona Complications

Corona is creating a new sense of urgency in the development of smart robot “workers” at the same time as it making this data collection even harder. Due to restrictions on movement and global lockdowns, teams may have a harder time accessing the locations where data can be collected. If teams are able to access these environments, they are likely to be operating abnormally, creating a risk of biased data. This bias trains robotics on data that does not reflect what these environments look like when social-distancing rules are not in effect. For instance, if lower occupancy limits at a supermarket lead to empty aisles and allow stores to maintain well-organized shelves, the data collected may be unable to train algorithms to operate in crowded or messier environments. For teams that are building wearables or other technologies that demand capture of human-object interaction, they may be out of luck completely. There is simply no way to reach and scan, photograph, or record enough people in a time when face-to-face interactions are prohibited.

This is pushing teams to find creative solutions to meet demand. Here are 3 ways that teams are forging ahead with data gathering, even during the COVID-19 pandemic:

Automated Data Augmentation

In this research, teams at Google and Waymo present ways to augment point cloud data automatically using evolutionary-based search algorithms. They point to prior research showing the efficacy of manually-designed augmentation methods, but suggest that these are not efficient enough to be practical, especially with point cloud object detection from data sources such as LiDAR.

Many companies and teams possess data gathered before the onset of COVID-19. Their ability to augment this data efficiently and effectively can allow them to proceed with training at times when manual collection of new data is impossible. However, there are a number of challenges. First, if key scenarios or edge cases don’t exist at all in a dataset, they can not be created through augmentation. For instance, if you have thousands of hours of driving data but none of your cameras ever captured a car crash or an active construction site, no level of augmentation will be able to introduce these important events into the data. Second, adopting this tactic is dependent on the ability of teams to adapt these algorithmic methods to their specific use cases, a process which can divert critical R&D resources.

Re-visit, Re-label, Re-use

Unable to collect new data, some companies are trying to extract even more value out of existing data. As this article reveals, this often requires a massive investment of time, either in-house or outsourced, in order to add not annotations or manually uncover new edge cases.

Again, this approach seems most common in the automotive industry, where companies are likely to have enormous amounts of data collected from cameras mounted on vehicles. With no new data, these efforts are necessarily limited. Occurrences that were never recorded or situations that can’t be labeled effectively are simply not accounted for.

Simulated Data

Simulated data generation empowers companies to create wholly new sets of data containing all of the necessary parameters and metadata without a reliance on new data collection efforts. Simulations are already in widespread use in the automotive sector. But, as the above-referenced articles imply, other sectors are lagging behind in their ability to obtain their necessary training data during the COVID-19 pandemic.

In robotics, for instance, teams are less likely to have huge datasets of complex indoor environments (especially with humans in them) that they can augment or re-visit. Additionally, it is likely that automotive companies will be able to get back on the road with cameras well before robotics-focused companies are able to collect new data in homes, factories, hospitals, or retail environments. Fortunately, simulated data relies on just a tiny sample of basis references to generate entire datasets with hundreds of asset classes and full inclusion of edge cases. This makes it a feasible option for many teams, not just those already sitting on huge mountains of data.

COVID-19 has not suddenly created interest in robotic solutions in a range of contexts, but it has certainly accelerated it. And, future events are likely to apply additional pressure in this direction. With this sense of urgency, especially as Corona continues to disrupt data-collection efforts, we expect Simulated Data to provide an increasingly-attractive solution to teams rushing to usher in the next generation of computer-vision powered automated robots.

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022

Types of Training Data Bias You May Not Know About

“AI algorithms are only as good as the data used to train them.” This common cliche is bandied about in almost every conversation involving AI. But, like many cliches, its overuse doesn’t affect its fundamental truth. In today’s world, the vast majority of AI applications are based on the deep learning algorithms and how these algorithms find patterns in training data. In many cases, these networks produce biased results, which in turn can stem from three main issues

  1. Framing the problem. The first thing computer scientists do when they create a deep-learning model is to decide what they want it to achieve. Given an unclear directive, the algorithm might come up with an unintended solution. 
  2. Collecting the data:  Either the data we collect does not represent reality or it reflects existing prejudices. 
  3. Preparing the data: It is possible to introduce bias during the data preparation stage, which consists of selecting the attributes we want our algorithm to take into account. 

Dealing with these biases require great amounts of awareness, and strong action to counterbalance them. Some of the biggest powerhouses in Tech have suffered from an inability to deal with and correct such biases. In particular, biases during the collecting and preparation of training data are hard to notice, offset, and correct. Specifically, racial and gender biases in the field of Computer Vision have been well documented. Today, I’d like to address some less known, but equally important biases. 

1.In a study of two popular public image datasets, researchers found that the U.S and Great Britain accounted for nearly 45% of images. By contrast, China and India together contribute just 3% of ImageNet data, even though these countries represent 36% of the world’s population. This geographic misrepresentation can affect the ability to train our algorithms in many fields.

  • Clothing: If our machines are trained by largely western images of clothing, their ability to equally recognize local clothing from areas underrepresented in the training data will be damaged. For example, algorithms were much less likely to recognize a bride and groom from Ethiopia or India than the U.S or Australia.
  • Home Layout: If we want to train home robotics to be able to see and navigate their way around homes, we need to expose them to different images. A home in the U.S or Britain doesn’t resemble a home in other regions of the world. Imagine training a robot to work and navigate its way around a London apartment, and then to be of use in a Mongolian Yurt.
  • Transportation: In countries where motor vehicles are less prevalent, and other methods of transportation are commonly used, the ability of an algorithm to correctly recognize a horse-drawn carriage, for instance, is suspect.
  • BMI: There are large discrepancies in BMI average by country and the percentage of overweight people in different populations. In a groundbreaking paper on using facial images to infer BMI, researchers tried to check against biases in their algorithm, but, by their own admission, didn’t have enough data to conclusively prove that there is no bias. If our datasets inaccurately represent real-life BMI, then our machines will have a harder time recognizing and accurately classifying outliers.

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022

2. Another major misrepresentation in training data is the weather. Many datasets skew towards fair weather, excluding rainy and overcast days. The lack of variety can impede creating sufficiently robust models. An obvious example of the real-life impact is the ability or inability of an autonomous car to drive in adverse weather conditions.

3. A little known type of bias is Object-based bias.

  • Duplicate Annotations: Double-labeling of images in datasets can lead to over-counting. Thus, the prevalence of certain objects can be misjudged by algorithms. For example, the same image is annotated both as “mug” and “cup”. This duplicate labeling can affect statistical distribution in a given dataset and is especially problematic in free-form annotation datasets. 
  • Object size:  Object size plays a key role in object recognition accuracy, and researchers have found examples of objects that comprise a larger fraction of any given image. Airplanes, for instance, are overrepresented as very large in images, and there are few images of smaller airplanes that are airborne.
  • Object co-occurrence: Object co-occurrence is a known contextual visual cue exploited by object detection models, but we can find surprising co-occurrence statistics. For example, we find that in COCO, a person appears in 43% of images containing the food category; however, a person appears in a smaller percentage of images containing broccoli (15%), carrot (21%), and orange (29%), and conversely a greater percentage of images containing cake (55%), donut (55%), and hot dog (56%). 

Daniel Kahneman and Amos Tversky are famous for exposing the myth that humans operate and make decisions based purely on reason.  Over their careers, they exposed dozens of cognitive biases in the way we compute data and make decisions. 

So too, the importance of bias in Artificial Intelligence must be understood and respected. The malicious effects of biases in our AI can be felt today. Whether a wrongful arrest based on AI or bad PR as a result of racially tinged algorithms, these biases are affecting our society.  And, more fundamentally, biased algorithms simply don’t work. They cannot deal with the endless variants that the real world conjures up. 

The solutions to these problems need to start with diverse geographic representation that accurately reflects the global population, understanding and analyzing existing datasets for problematic biases, and collecting data with enough edge-cases. 

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022

Corona is Making Training Data-Gathering Harder: Key Industries and Challenges

Covid-19 has affected every corner of the globe – taking lives, destroying livelihoods, and shaping our world for the foreseeable future. Passenger traffic at airports has plummeted and, with 81% of the global workforce living in countries with mandatory or recommended workplace closures, we have all seen the striking images of cities around the world empty and deserted. This seismic shift in human activity is making the acquisition of manual training data slower and harder, while also less representative than ever before. In this post, we’ll explore a few of the industries affected by these changes.

Autonomous vehicles

Autonomous vehicle companies rely on large teams of drivers and technicians to test vehicles and collect training data. First, stay-at-home orders and the related logistical challenges of these operations have made staffing difficult, forcing companies to lay off significant portions of their teams. They are simply unable to get their teams on the road, collecting data. Second, empty streets, with fewer cars and pedestrians, make driving data less representative and more biased. It has become nearly impossible for these companies to capture data of “normal” (aka pre-Corona) traffic volumes. Some of these companies have refocused on detailed data labeling, 3D mapping, and identification of overlooked scenarios from previous road sessions that can be used to train their systems. Out of necessity, they have shifted to more innovative approaches and long-term experimentation.


While car companies are struggling on the ground, drone companies are struggling in the air. Smart traffic analysis based on drone footage is becoming a reality in many cities and states. But in today’s world, with unprecedented drops in traffic volume, collecting training datasets with enough variance is a nearly impossible task. But, drone applications in other industrial settings – shipping, warehousing, airports, etc. – are also hindered by high-level economic slowdowns. When activity in these outdoor environments grinds to a halt, companies will have a hard time collecting unbiased data. With economists predicting a slow economic recovery, these challenges are likely to persist.

Human-Environment and Human-Human Interactions

The new reality of lockdowns has made life hard for companies which generate datasets by “Crowd Acting”. With this method, teams generate data by instructing workers, customers, models, or volunteers to record short video clips based on defined descriptions. But, asking them to do crowd acting outside their homes has become untenable and downright dangerous in some cases. Additionally, crowd acting of face-to-face interactions may be impossible while maintaining compliance with social distancing guidelines. 

Smart Stores

In February, Amazon opened its first Amazon Go Supermarket in Seattle using the retailer’s cashierless “Just Walk Out” technology that has previously powered 25 Amazon Go convenience stores in a handful of major U.S. metros. Expanding this type of technology requires sophisticated computer vision systems that can synthesize complex human-object interactions; grocery shoppers may examine produce carefully only to put it back, or discard items on shelves aisles where they don’t belong. In order for the technology to work optimally, huge amounts of training data with myriad variants and edge cases are needed. But with online grocery shopping skyrocketing, those variants and edge cases will be harder to capture.

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022


VR/AR is an exciting field of development, but one of the most important stages in the development of new hardware has been user trials. In a COVID-19 world, user trials suddenly become complicated affairs. Dozens of people donning the same headset in succession has become truly unimaginable. Recruiting and scanning hundreds of people on premises is out of the question. 

Age Bias

Covid – 19 has disproportionately affected certain segments of society. Across the world, the elderly have been significantly more at-risk and have been ordered in some countries to minimize leaving their homes. Even in countries without Government orders limiting movement for seniors, many public health organizations have recommended for them to stay at home if possible.  This raises the risk that manually-collected datasets collected during the Corona-period will have significant age biases.


Masks will be with us for a while, but many facial recognition computer vision systems have not been trained to recognize faces with masks. So, new training data must be collected, a process that is difficult for reasons mentioned above. But, what happens when mask use decreases? Systems trained on data collected during Corona, which are likely to show a huge number of people wearing masks, may not be able to operate well on mask-less faces.


In our follow-on post about the impacts of Corona on data-collection, we’ll dig deeper into the field of robotics and how some teams are combating the challenges we’ve discussed here.

A Path Forward

At its core, Corona is creating two intertwined challenges. First, it is making it hard to collect new data. Second, it is increasing the likelihood that any new data will be biased in a number of ways. The current challenge of acquiring data in the wild, offers a significant opportunity for simulated and synthetic data to fuel advances in computer vision. 

It allows for generation of new datasets without compromising the health and safety of workers or navigating the logistical hurdles erected by Corona. And, its customizability ensures that datasets will not contain “Corona biases”. As the challenges posed by the COVID-19 era continue to reshape a range of industries, simulated visual data can help accelerate breakthroughs in computer vision and make sure that this virus does not stop training data collection in its tracks. 

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022

Research Roundup: Training with Synthetic Data

At Datagen, we’re actively investing in academic collaborations and benchmarking research that can further establish the effectiveness of training computer vision algorithms with simulated or synthetic data. Fortunately, a large body of work already exists, exploring the advantages and challenges of this approach. We’ve collected some of them here, for anyone interested in learning more about this rapidly-expanding field of computer vision research.


  1. Learning from Simulated and Unsupervised Images Through Adversarial Training (2017)

From researchers at Apple, this paper was the company’s first research paper ever published and was named the best paper at CVPR 2017. The research shows how labeled synthetic data, domain adapted with unlabeled real data, solves eye gaze estimation. One year later, the iPhone 10 came out with eye gaze estimation embedded. It is highly likely that Apple used some of the synthetic data techniques detailed in the paper to develop this feature.

  1. Sim2real Transfer Learning for 3D Pose Estimation (2019)

Deepmind & Oxford University researchers detail how they achieved state-of-the-art results on 3D human mesh estimation by training exclusively with synthetic humans. It utilizes pose estimation and optical flow estimation to help bridge the domain gap between real and simulated data.

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022

  1. SIDOD: A Synthetic Image Dataset for 3D Object Pose Recognition with Distractors (2019)

Nvidia shows that using photorealistic synthetic data can lead to strong results in object detection, 3D pose estimation, and tracking applications of objects in real-world scenes. This is a powerful approach that was demonstrated on a small set of items. Datagen is attacking the next challenge in this area; working on ways to scale this approach to thousands of objects while integrating humans and dynamic motion.

  1. 6-DOF GraspNet: Variational Grasp Generation for Object Manipulation (2019)

Here, Nvidia presented a purely simulation-based approach to generating realistic grasping positions. The grasp distribution is diverse enough to find successful grasps even after discarding those that violate robot kinematics and collision constraints. The method’s diverse grasp sampling leads to higher success rates in comparison to a state-of-the-art 6- DOF grasp planner. More here about Nvidia’s research efforts in this field.

  1. 3D Face Reconstruction by Learning from Synthetic Data (2016)

A Technion research group, based in Israel, lays out how they achieved impressive results in reconstructing 3D face meshes from single face images by using synthetic data.


  1. Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization (2018)

NVIDIA is clearly a research leader in this field. Here, a team of researchers train an object detection network with synthetic data and prove synthetic data does not necessarily need to be as photorealistic as real data to work as effectively, when using a statistical technique called Domain Randomization. Domain Randomization trains the network on images from many kinds of simulated environments at the same time. While this approach requires a large amount of data – in this case, 100,000 data points – it can be valuable when the domain is hard to mimic. We recommend this paper for its insights into the potential of Domain Randomization methodology.

  1. GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB (2017)

In this paper, a Stanford University research team achieved impressive results in hand pose estimation by training a network on hand synthetic data. The synthetic hands are infused with textures from real-world hands captured by a style-transfer algorithm that uses a GAN-based architecture. The implications are particularly interesting for the VR/AR industry, where hand pose estimation may be able to eliminate the need for hand-held controls. For more on state-of-the-art hand pose estimation data please check out DataGen’s hands generation capabilities.

  1. Learning to Drive from Simulation without Real World Labels (2018)

Cambridge university researchers, working with a corporate team, teach a car to drive in a cartoon-like simulator. The novel idea was to teach the car to transcribe real-world data into its simulation-based understanding (real2sim) instead of attempting the reverse (sim2real).

We will continue to update and add to this post as new research becomes available. If there are relevant publications that you’d like to share with us, send them our way.

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022

Simulated Data is Synthetic Data 2.0

The promise of synthetic data has been clear since its inception.By creating a fully “synthetic” set of data that mimicked statistical patterns existent in “real” datasets, synthetic data could enable analysis and data science without compromising privacy. It also promised to be highly scalable and manipulable, faster to generate, and able to fill data gaps where real data would be impossible or expensive to collect.

Since then, synthetic data has shown promise across a range of verticals from medical research where patient privacy is tantamount to fraud detection where synthetic datasets can be used to test and increase the robustness of security solutions. More recently, synthetic data has been applied, with some success, to machine learning applications as a source of training data for algorithm systems.


But, widespread (and healthy) skepticism remains. Can synthetic data accurately reflect the real world? Can it capture nuanced statistical realities, especially when there are very few real dataset reference points? As the applications have become broader and more sophisticated, these questions are even more relevant.

If a synthetic alternative to manual collection and annotation of data can address these concerns and provide equal (or better!) data, it has the potential to materially change the process of developing algorithms. Saved time and increased effectiveness could bring the promise of computer vision to the market even sooner. This is especially true in a computer vision context – which relies on image or video data that is generally harder, slower, and more expensive to collect. 

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022

Datagen is among a number of teams, in a wide range of verticals, racing to apply synthetic data techniques to visual data simulation, to power the world of computer vision. To be sure, Datagen does create a type of visual synthetic data. But, because we take a fundamentally different approach to solving the challenges and proving the promise of synthetic data, we’ve decided to call it something different. At Datagen, we create Simulated Data, .

Synthetic data is usually defined by what it is not; at the most basic level, it is data that isn’t manually gathered. There are a range of methods for creating it, from sophisticated Generative Adversarial Neural Networks (GANs) to more basic strategies such as pasting a subject onto different backgrounds to create variation. Most of these approaches have serious limitations. Because they are fundamentally static, these datasets have to be constantly adjusted and regenerated to reflect new parameters, distributions, or training goals. They are heavy to play with, especially when the data is visual. They are generally designed to accomplish a single task or capture a single scenario at a time. They don’t enable machines to learn like humans do – responsively in real time. As our neural networks become more and more robust, they will be able to learn more intuitively. But, the data needs to keep up.

Simulated Data brings synthetic data to life. It is synthetic data captured by virtual “cameras” operating in photorealistic, physics-based simulations. These simulations have spatial dimensions and a full range of annotations (physics, lighting, etc.). Simulated Data is also based on real 3D data. Datagen is building one of the largest libraries of 3D assets and human-centered scans in the world, ensuring that our simulations have enormous variation without deviating from realism. Simulated Data merges this hyper-realism with a unique level of flexibility and customizability. By placing these realistic assets into a 3D environment, we can “virtually” photograph a room at different angles, with different lenses, at different times of day. Simulations allow us to swap in new furniture, change layouts, or adjust lighting. We can change a model’s skin tone or age, take photos in infrared or with depth maps, and simulate edge cases that are nearly impossible to capture manually. And, because all of this variation is built on a foundation of high-quality, scanned 3D data, it retains its realism and plausibility. This enables us to avoid the cost of manual gathering at scale, the privacy concerns associated with human-centered data, and the bias inherent in manually-gathered datasets. Thanks to advances in algorithms and computer graphics, our Simulated Data offers a realistic, fully customizable 3D sandbox that exposes a system to a dynamic environment of people, objects, and space.

Obviously, for some applications, traditional synthetic data will be enough. But,the way we see it , synthetic data is just a building block, a stepping stone to the next generation of computer vision training data. That is why we use the term Simulated Data instead of synthetic data. It represents the ultimate goal of fully simulating the world around us in hyper-realistic, synthetic detail. As we continue to build this next-generation technology, we look forward to sharing Simulated Data’s full potential.

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022