Simulated Data is Synthetic Data 2.0

By Datagen, April 14, 2020

The promise of synthetic data has been clear since its inception.By creating a fully “synthetic” set of data that mimicked statistical patterns existent in “real” datasets, synthetic data could enable analysis and data science without compromising privacy. It also promised to be highly scalable and manipulable, faster to generate, and able to fill data gaps where real data would be impossible or expensive to collect.

Since then, synthetic data has shown promise across a range of verticals from medical research where patient privacy is tantamount to fraud detection where synthetic datasets can be used to test and increase the robustness of security solutions. More recently, synthetic data has been applied, with some success, to machine learning applications as a source of training data for algorithm systems.


But, widespread (and healthy) skepticism remains. Can synthetic data accurately reflect the real world? Can it capture nuanced statistical realities, especially when there are very few real dataset reference points? As the applications have become broader and more sophisticated, these questions are even more relevant.

If a synthetic alternative to manual collection and annotation of data can address these concerns and provide equal (or better!) data, it has the potential to materially change the process of developing algorithms. Saved time and increased effectiveness could bring the promise of computer vision to the market even sooner. This is especially true in a computer vision context – which relies on image or video data that is generally harder, slower, and more expensive to collect. 

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022

Datagen is among a number of teams, in a wide range of verticals, racing to apply synthetic data techniques to visual data simulation, to power the world of computer vision. To be sure, Datagen does create a type of visual synthetic data. But, because we take a fundamentally different approach to solving the challenges and proving the promise of synthetic data, we’ve decided to call it something different. At Datagen, we create Simulated Data, .

Synthetic data is usually defined by what it is not; at the most basic level, it is data that isn’t manually gathered. There are a range of methods for creating it, from sophisticated Generative Adversarial Neural Networks (GANs) to more basic strategies such as pasting a subject onto different backgrounds to create variation. Most of these approaches have serious limitations. Because they are fundamentally static, these datasets have to be constantly adjusted and regenerated to reflect new parameters, distributions, or training goals. They are heavy to play with, especially when the data is visual. They are generally designed to accomplish a single task or capture a single scenario at a time. They don’t enable machines to learn like humans do – responsively in real time. As our neural networks become more and more robust, they will be able to learn more intuitively. But, the data needs to keep up.

Simulated Data brings synthetic data to life. It is synthetic data captured by virtual “cameras” operating in photorealistic, physics-based simulations. These simulations have spatial dimensions and a full range of annotations (physics, lighting, etc.). Simulated Data is also based on real 3D data. Datagen is building one of the largest libraries of 3D assets and human-centered scans in the world, ensuring that our simulations have enormous variation without deviating from realism. Simulated Data merges this hyper-realism with a unique level of flexibility and customizability. By placing these realistic assets into a 3D environment, we can “virtually” photograph a room at different angles, with different lenses, at different times of day. Simulations allow us to swap in new furniture, change layouts, or adjust lighting. We can change a model’s skin tone or age, take photos in infrared or with depth maps, and simulate edge cases that are nearly impossible to capture manually. And, because all of this variation is built on a foundation of high-quality, scanned 3D data, it retains its realism and plausibility. This enables us to avoid the cost of manual gathering at scale, the privacy concerns associated with human-centered data, and the bias inherent in manually-gathered datasets. Thanks to advances in algorithms and computer graphics, our Simulated Data offers a realistic, fully customizable 3D sandbox that exposes a system to a dynamic environment of people, objects, and space.

Obviously, for some applications, traditional synthetic data will be enough. But,the way we see it , synthetic data is just a building block, a stepping stone to the next generation of computer vision training data. That is why we use the term Simulated Data instead of synthetic data. It represents the ultimate goal of fully simulating the world around us in hyper-realistic, synthetic detail. As we continue to build this next-generation technology, we look forward to sharing Simulated Data’s full potential.

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022