Neural Radiance Field (NeRF): A Gentle Introduction
A neural radiance field (NeRF) is a fully-connected neural network that can generate novel views of complex 3D scenes, based on a partial set of 2D images. It is trained to use a rendering loss to reproduce input views of a scene. It works by taking input images representing a scene and interpolating between them to render one complete scene. NeRF is a highly effective way to generate images for synthetic data.
A NeRF network is trained to map directly from viewing direction and spatial location (5D input) to opacity and color (4D output), using volume rendering to render new views. NeRF is a computationally-intensive algorithm, and processing of complex scenes can take hours or days. However, new algorithms are available that dramatically improve performance.
In this article
- NeRF and Neural Rendering: Basic Concepts
- How Neural Radiance Fields Work
- Improving NeRF Performance
NeRF and Neural Rendering: Basic Concepts
In order to understand how NeRF works, let’s cover some basic concepts.
Rendering is the process of creating an image from a 3D model. The model will contain features such as textures, shading, shadows, lighting, and viewpoints, and the role of the rendering engine is to process these features to create a realistic image.
Three common types of rendering algorithms are rasterization, which projects objects geometrically based on information in the model, without optical effects; ray casting, which calculates an image from a specific point of view using basic optical laws of reflection; and ray tracing, which uses Monte Carlo techniques to achieve a realistic image in a far shorter time. Ray tracing is used to improve rendering performance in NVIDIA GPUs.
Volume rendering enables you to create a 2D projection of a 3D discretely sampled dataset.
For a given camera position, a volume rendering algorithm obtains the RGBα (Red, Green, Blue, and Alpha channel) for every voxels in the space through which rays from the camera are casted. The RGBα color is converted to an RGB color and recorded in the corresponding pixel of the 2D image. The process is repeated for every pixel until the entire 2D image is rendered.
View synthesis is the opposite of volume rendering—it involves creating a 3D view from a series of 2D images. This can be done using a series of photos that show an object from multiple angles, create a hemispheric plan of the object, and place each image in the appropriate place around the object. A view synthesis function attempts to predict the depth given a series of images that describe different perspectives of an object.
How Neural Radiance Fields Work
A NeRF uses a sparse set of input views to optimize a continuous volumetric scene function. The result of this optimization is the ability to produce novel views of a complex scene. You can provide input for NeRF as a static set of images.
A continuous scene is a 5D vector-valued function with the following characteristics:
- Its input is a 3D location x = (x; y; z) and 2D viewing direction (θ; Φ)
- Its output is an emitted color c = (r; g; b) and volume density (α).
Image Source: NeRF Paper (Mildenhall1, Srinivasan1, Tancik, et. al)
Here is how you can generate a NeRF from a specific viewpoint:
- Generate a sampled set of 3D points—by marching camera rays through the scene.
- Produce an output set of densities and colors—by inputting your sampled points with their corresponding 2D viewing directions into the neural network.
- Accumulate your densities and colors into a 2D image—by using classical volume rendering techniques.
Image Source: Matthew Tancik
The process above optimizes a deep, fully-connected, multi-layer perceptron (MLP) but does not require using convolutional layers. It uses gradient descent to minimize errors between each observed image and all corresponding views rendered from the representation.
Improving NeRF Performance
The original NeRF model had several drawbacks—it was slow to train and render, only able to handle static scenes. It is also inflexible, because a NeRF model trained on one scene cannot be used for other scenes.
Here are several methods that build on NeRF and attempt to resolve some of its problems.
Neural Sparse Voxel Fields (NSVF)
NSVF is a neural scene representation that enables fast, high-quality rendering that is not dependent on a specific viewpoint. It works by defining voxel-bounded implicit fields organized in a sparse network of cells, and progressively learns voxel structures in each cell of the network. It can render new views much faster by skipping voxels with no scene content—this technique makes NSVF over ten times faster than the original NeRF.
Mip-NerF extends the original NeRF model with the objective of reducing blurring effects and visual artifacts. NeRF used a single ray per pixel, which often caused blurring or aliasing at different resolutions. Mip-NeRF uses a geometrical shape known as a conical frustum to render each pixel, instead of a ray, which reduces aliasing, makes it possible to show fine details in an image, and reduces error rates by between 17-60%. The model is also 7% faster than NeRF.
KiloNeRF addresses the problem of slow rendering in NeRF, which is mainly related to the need to query a deep MLP network millions of times. KiloNeRF separates the workload among thousands of small MLPs, instead of one large MLP which needs to be queried many times. Each small MLP represents part of a scene, enabling 3X performance improvement with lower storage requirements, and comparable visual quality.
Plenoptic voxels (Plenoxels) replace the MLP in the center of NeRF with a sparse 3D grid. Each query point is interpolated from its surrounding voxels. New 2D views are hence rendered without running a neural network, which greatly reduces complexity and computational requirements. Plenoxels provide a similar visual quality to NeRF while being two orders of magnitude faster.