Since the release of NeRF at ECCV 2020, there’s been more and more NeRF papers published each year. Let’s have a look at two papers about rendering performance
In this article, I’ll detail two NeRF articles dealing with Rendering Performance that caught my attention during the CVPR 2023 conference.
Please note that the selection I have compiled is highly subjective, and I want to emphasize that it was a challenging task to choose only two of them. Explaining too many papers in the same article could overwhelm the reader. As a result, I regretfully had to exclude notable papers, despite their substantial contributions in the field.
I’m assuming you’re already familiar with NeRF. If not, I recommend referring to my Medium article for an introduction to NeRF (5 things you must know about Neural radiance fields ⚡).
1. MobileNeRF (award candidate)
(Zhiqin Chen, Thomas Funkhouser, Peter Hedman, Andrea Tagliasacchi)
Storing learned features
NeRF is highly effective in generating novel views, but its volumetric approach based on ray-marching fundamentally differs from traditional rendering pipelines and cannot take advantage of the accelerations offered by the widely used graphics hardware.
To overcome this, grid-based methods such as Instant-NGP or TensoRF speed up the rendering by caching learned features at discrete voxel coordinates and tri-linearly interpolating between them to access continuous coordinates. This enables a smaller, and thus faster, MLP to predict density and radiance, as a significant amount of information is already stored in the features.
MobileNeRF builds upon this idea and extends it to leverage the parallel Polygon Rasterization Pipeline fully. The image below illustrates this by introducing a novel NeRF representation that effectively stores features within a textured mesh instead of a voxel grid. A small MLP is then called once per pixel to transform the rasterized features into color. From a memory perspective, storing features within 2D surfaces rather than in a sparse 3D volume that contains empty or unreachable space is also more logical.
Implementation details
The z-buffer algorithm used to render polygons in parallel does not handle semi-transparency well because it only stores the closest depth value per pixel, making it unable to accurately blend partially transparent objects or surfaces. Consequently, the opacity texture must be binarized, i.e., each texel (pixel in the texture) is either fully transparent or opaque. However, training directly with binary opacities would need help to converge, and applying binarization after training would significantly degrade performance. The authors were mindful of this and proceeded in multiple optimization steps.
To store the texture as a PNG, the floating-point features have also been discretized on 8bits inside [0,1].
The authors found that supersampling provides a convincing anti-aliasing scheme against the high frequencies introduced by the binary opacities, with reasonable computational overhead. The features are rasterized at twice the image resolution so that each pixel in the final image has its corresponding 2×2 patch of features. Instead of inputting these four features individually into the MLP and averaging the colors, the authors propose pre-averaging the features and invoking the MLP only once. This approach reduces the number of MLP calls by a factor of four while achieving comparable visual outcomes.
The topology of the mesh is fixed. It starts as a regular Euclidean lattice, and only the vertex positions are optimized. Caution: the final mesh should not be interpreted as actual surface geometry. It primarily focuses on rendering accurate features and colors rather than representing the underlying geometry.
Each mesh triangle will have a texture of KxK pixels, with K being a fixed hyperparameter shared for all the triangles. This means that regardless of what occurs during training, each triangle will have the same representational capacity.
Device compatibility
Since it uses a standard GPU Rasterization pipeline, it runs in real-time on various devices on a simple HTML webpage, as illustrated in the image below. It only requires the following files:
- an OBJ for the mesh
- PNGs for the textures (features + binary opacities)
- a JSON for the weights of the small view-dependent MLP that will be implemented in a fragment shader
It runs at 20 FPS on a Chromebook and a Surface Pro, which are not particularly known for their impressive hardware performance.
2. NeRFLight
FPS/MB
One remarkable aspect of the original NeRF paper is the incredibly small memory footprint needed by the model to store an entire scene: approximately 5MB, i.e., only the MLP weights. It’s also incredibly slow, and accelerated variants tend to trade off speed for memory efficiency by storing explicit grids of features.
NeRFLight proposed a novel metric that measures the ratio between rendering speed (FPS) and memory footprint (MB), aiming to address the need for optimizing both aspects simultaneously. This optimization becomes crucial for deploying NeRF on applications with limited bandwidth. The diagram below illustrates that NeRFLight can run at 180FPS using only 14MB.
At first glance, TensoRF appears to be a promising contender for this metric since it’s a feature-based NeRF that manages memory footprint through tensor decomposition. However, it introduces added complexity to feature retrieval, resulting in a slowdown in rendering speed.
Similarly, Instant-NGP is penalized in speed by the multiple linear interpolations required by the multiresolution hash grid.
Shared feature grid
Since a scene is usually filled with much space, allocating memory for every voxel would be inefficient. An acceleration structure like an octree works fine but requires prior knowledge of the scene geometry.
The main idea driving the use of a compact hash table in Instant-NGP is to let the model learn itself the optimal distribution of representational capacity across the scene under a fixed memory allocation constraint.
NeRFLight also relies on an explicit feature grid, but a smaller shared dense feature grid is used to tile the entire scene, compelling the model to reuse the same features multiple times.
The image below illustrates NeRFLight splits the scene into eight regular regions. Conflicts are resolved by learning a different density encoder for each sub-region, providing predictions for density and an additional feature vector (used to predict the color). The color encoder, however, remains the same for the entire scene.
Instead of repeating the same feature-grid side by side eight times, they arrange it symmetrically to avoid discontinuities at the boundaries. This creates a sense of central symmetry around the origin.
Note that this can’t be easily generalized to any number of sub-regions other than 8, even-though an arbitrary number N is used the paper.
As in DIVeR, Deterministic Volume Integration is performed to improve the overall accuracy without memory overhead. This approach integrates trilinear interpolation along the intervals formed by the intersection of a ray with the voxel grid instead of randomly sampling points for feature interpolation.
Conclusion
I hope you enjoyed reading this article and that it gave you more insights on NeRF!
See more of my code on GitHub.
NeRF at CVPR23: Rendering Performance was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.