Quality at Scale: How Advances in Computer Graphics Tools are Enabling Data Generation

Ofir Zuk (Chakon)

3/08/2020

12 Min read

In recent years, advances in synthetic data generation have promised a solution to the data bottleneck in machine learning network training. Rather than collecting and annotating data by hand, we’re getting better at creating it programmatically. But, while synthetic data has shown huge promise in fields like medical imaging, autonomous vehicles, and fraud detection, it has not yet proved its full value in Computer Vision contexts that require large datasets of human-focused data. Thanks to advances in computer graphics, 3D modeling, animating, and rendering technologies, this is beginning to change. Finally, we have the tools needed to create high-quality, scalable, photorealistic training data at scale.

In this article, we will explore how advances in visual synthetic data generation, especially in the areas of humans and environments, have been unlocked by the ongoing development of incredibly powerful computer graphics and 3D-rendering software. This growth may be able to solve some of synthetic data’s key challenges and make this data broadly available.

Quality vs. Quantity

At its core, generating synthetic computer vision training data has always confronted the trade-off between scale and quality. The training data must be photorealistic enough to effectively mimic the real world. At the same time, we need a large amount of data for supervised learning. Unfortunately, these needs are often in direct conflict with each other.

Historically, photorealism has been achieved with huge amounts of time and labor from 3D artists, painstakingly recreating scenes and manually adding variation to datasets using computer graphics tools.

The process is similar to creation of high-quality CGI in the Film and TV industry. A single frame can have millions of parts and requires a huge level of detail to generate high-quality images. For example, the Sully character from Monster Inc. had over 2 million individually named hair on his body, and a single frame involving Sully took 11-12 hours on average. CGI production is also extremely computer resource intensive. A single frame can take hours to render and incurs computing costs as well. The average cost per second for 3d Feature Films can reach $25,000 USD.

In the computer vision context, when you need hundreds, thousands, or millions of data points for training, this process is impossible to repeat at scale. Training data does not necessarily need to be perfectly photorealistic to the extent that it can fool the human eye, but it does need to be close enough to achieve sufficient accuracy; researchers have shown that non-realistic lighting, shading, and texturing damage the ability to train algorithms effectively. . But, even if you sacrifice some level of detail in the service of scale, the process often remains cost-prohibitive.

This trade-off between the need for scalable data on the one hand, and high-quality photorealistic data on the other hand is exacerbated when we are dealing with sophisticated datasets. Synthetically creating a single human hand, for instance, is entirely different from creating a complex scene with humans, motion, and complex environments. And, even if you are decreasing photorealistic detail, you will still need to manually add enormous variation in race, age, body type, and other features. And we haven’t even talked about things like lighting conditions, skin coloration changes caused by grabbing, or muscle movement.

In short, when you rely on manual, artist-based generation of photorealistic training data, you need time and money. Lots of them.

For the vast majority of teams, this is simply not a feasible way to generate synthetic data. An ideal solution is one that would:

Accelerate or automate some of the artist’s work without significantly sacrificing photorealism, decreasing the cost per image.
Generate new images at scale without requiring an artist to craft each one, decreasing the number of images an artist must create.

Here, we’ll discuss how improvements in computer graphics tools are powering the first part of this solution. In a future post, we’ll talk about the second.

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022

Increasingly Powerful Tools

Early on at Datagen, we saw the ability of emerging advances in 3D rendering technology to solve this quality vs. quantity trade-off. In other words, we identified the power of a solution that combines the work of artists with the scale of automated, algorithmically-governed image generation.

This hybrid approach would not be possible without open-source 3d computer graphics software like Blender, and game engines like Unreal Engine and Unity, which have made some key advances in the entire 3D pipeline – modeling, rigging, animating, simulating, rendering, compositing and video editing. They also allow teams to integrate layers of automation and customization into the software’s core capabilities. With the proper configuration, this enables teams to automatically change variables at scale, while retaining high levels of photorealism that were previously only achievable with manual, custom work.

Let’s explore some key elements of photorealism that must exist in high-quality synthetic data and how Blender and other tools have made them much more attainable.

Texturing

For photorealistic 3D data, objects and humans must have both the correct 3D shape and the correct 2D color and textural detail. Basically put, texturing is fitting a 2D image onto a 3D model to give it color and detail. The texture is a pixel image, which is laid on an editor in the graphics engine to give it color, as well as parameters for reflection, refraction, and translucency. The first and foremost aim of texture use is to show the material from which the real object is made as well as definite physical light properties of the modeled object like smoothness or roughness and to recreate the physical material. With advances in texturing tools, artists can more easily achieve stunning detail. In a cinematic context, these amazing advances in texturing, driven by developments in Machine Learning, have allowed teams to create increasingly realistic, digital humans with better simulation of pores, new representations of the human eye and its place in the eye socket, and much more. With more powerful computer graphics tools, artists can achieve better photorealism, faster.

Shading

In order to create a realistic object, shading approximates the behavior of light on the object’s surface. HIgh-quality shading creates depth perception in 3d models by varying the level of darkness and brightness. Advances in shading algorithms have enabled us to create increasingly photorealistic imagery with less manual manipulation. The newest shading models in Unreal Engine, have managed to add yet another level of realism and accuracy. Specifically improving translucent materials, anisotropic materials, and clear coat materials. The results are truly astounding.

Lighting

Computer graphics lighting is the collection of techniques used to simulate light in computer graphics scenes. While lighting techniques offer flexibility in the level of detail and functionality available, they also operate at different levels of computational demand and complexity. At its most basic level, digital lighting can be done by using a Point Light. A point light emits light from a fixed point in virtual space equally in all directions (this is also called omnidirectional lighting). A point light is characterized by its position relative to the object, and its lighting power. A light bulb approximates virtual point lights.

Another way to simulate real illumination is the use of HDRIs, an image format that contains a broader spectrum of light and shadow. While a regular digital image contains only 8 bits of information per color (red, green, blue) which gives you 256 gradations per color, the HDR image format stores RGB colors with floating-point accuracy. Thus the variation from dark to light per color is virtually unlimited. Using HDR images in a 3D environment makes very realistic and convincing shadows, reflections, and brightness.

The key advance of recent years has been the ability to achieve realistic lighting algorithmically. In particular, Global Illumination, a group of algorithms used in 3d graphics, has been able to make images much more realistic. These algorithms take into account not only the direct light mentioned above but also subsequent cases in which light rays from the same source are reflected and refracted by other surfaces in the image. These are incredibly difficult to add manually in photorealistic ways.

Additionally, computer graphics tools like Blender have given teams the ability to build their own add-ons, designed for their specific needs. For instance, the ability to control and modify lighting Post-Render for fast, flexible, and largely-automated modification of images.

Rigging

3d rigging is the process of creating an invisible skeleton that defines how objects move. A 3d rig is a system of invisible objects that are seen in different viewports but are concealed in the final render. Rigging at its most basic consists of nulls and joints. Joints are the bones of the skeleton, while the nulls are the cartilage that defines the range of movement. When correctly built, the joints and nulls create a logical moving element, and when bound to a humanoid mesh it can create a realistic range of movements.

Inverse and forward kinematics define the direction a rig works when being animated. Forward kinematics are controlled from the base joint of the process and are relatively simple. Inverse kinematics begin with the endpoint of the motion and work backward through the rigging. Advances in these techniques have enabled us to move humans in realistic ways, without jerky and cartoonish movement. And, creating animation friendly rigs has become simpler than ever. Unreal Engine’s rigging system Control Rig can be used to create complex animation rigs for any character type.

The result is that teams can generate simulated data with motion without investing in a huge team of specialized animators.

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022

Animation

In addition to rigging, more scalable animating techniques have advanced tremendously. One great example is Motion Capture, which is the process of recording the movements of objects or people and then using that information to animate digital models. The advances in MoCap allow us to capture the subtlest of movements and transfer these captured motions into computer graphics simulations. Not only are production times being slashed, but a future in which Motion Capture is possible without any markers is fast approaching.

Rendering

3D rendering is the process of translating the 3D models back into 2d images that can be used for training. The images are generated based on sets of data defining the color, texture, and material of given objects. Photography is an apt analogy: a rendering engine points a camera towards a 3D object and takes a photo. Advances in GPU power have made the technique of Ray Tracing Engines more powerful, omnipresent, and much, much faster and even real-time. Previously, most real-time renderings used Rasterization, in which objects are created from a mesh of polygons, connected to form a mesh, turned into a 3d model, and then converted into pixels on our 2d screen. Each pixel can be assigned an initial color value from the data stored in the polygonal vertices.

Ray Tracing is different. In the physical world, objects are illuminated by different light sources, and photons can bounce between objects before reaching our eyes. Light is blocked, by different things, creating shadows. Light also is reflected from object to object, like when we see an image of an object reflected in the surface of another object.

There are also refractions, when the light changes as a result of passing through transparent or translucent objects, like liquids or clouds. Ray Tracing captures those effects by working back from the view camera. This technique was first described by Arthur Appel in a seminal work. Further developments explained how ray tracing could incorporate common film-making techniques – — including motion blur, depth perception, translucency, and fuzzy reflections — that until then were the province solely of cameras. The combination of this research with the power of modern GPUs has led to computer-generated images that capture shadow, refraction, and reflection that is truly photorealistic and indistinguishable from the real world.

Generating Quality at Scale

Primarily, these advances in computer graphics have enabled video game creators, architectural rendering firms, and the movie industry to create mind-blowing visual simulations. But, the scale of these simulations (even in epic gaming multiverses) are generally just a fraction of what’s needed for large-scale computer vision training.

Now, it’s time for the world of simulated generation to harness these tools. Tools like Blender are designed to be integrated with and built upon, even with scripts and addons in Python.

As we mentioned before, we need to:

Accelerate or automate some of the artist’s work without significantly sacrificing photorealism, decreasing the cost per image.
Generate new images at scale without requiring an artist to craft each one, decreasing the number of images an artist must create.

Taken together, the remarkable advances in Computer Graphics over the past 40 years from the Famed Utah teapot, to modern-day real-time photorealistic rendering are truly breathtaking. Ray-tracing and Path-Tracing rendering that used to take hours of computing time has sped up to become real-time. The incredible rise of GPU power is enabling us to create truly photorealistic images in real-time without cost-prohibitive artist labor requirements.

But, even though artist-created images can be generated more efficiently and with higher photorealism, it still is not feasible for any team of artists to cost-effectively generate them in the hundreds of thousands or millions without additional tools to solve challenges of scale. Soon, we’ll address challenge #2: how we can take these efficiently generated, photorealistic images, and multiply them at scale.

Read our Survey on Synthetic Data: The Key to Production-Ready AI in 2022