Realtime capabilities of DSNeRF

ABout

Software Project for the TUM course “Machine Learning for 3D Geometry”

Depth-supervised NeRF (DSNeRF) is a state of the art deep neural network method for implicit 3D scenes representation from multi-view inputs.

Recent research has show that passing input points through high-frequency functions before feeding the data to the network enables the network to accurately depict highfrequency regions of a scene. We investigate the performance of different such embeddings on the quality of the final output. Specifically, we analyse gaussian fourier feature mappings, and approaches that leverage periodic activation functions like SIREN and SINONE.

Furthermore, we want to examine the possibility to enhance DSNeRF’s real time capabilities by implementing the concepts of FastNeRF.

Goal

The overall goal is to produce a DSNeRF based application that is more applicable to real world use by trying to get closer to realtime rendering performance, achieving higher quality reconstruction of difficult scenes and needing less viewpoints.

Theoretical Background

Neural networks are biased to learn lower frequency functions easier than higher frequency functions. Consequently, most scene representation networks, such as NeRF, have a hard time capturing the high frequency variations in color and geometry of a scene and thus produce blurry scene renderings. To counteract this issue, the authors of the original NeRF algorithm propose to leverage high frequency functions to map the original input to a higher dimensional space before passing it to the network. We are interested in comparing the quality of the Fourier feature mapping approaches of Positional Encoding and Gauss Mapping, as well as networks that mimic this high-frequency mapping by introducing periodic activation functions, like SIREN and SINONE.

Gauss Mapping produces smoother renderings with fewer artifacts and captures the overall object shapes better than the original Positional Encoding. The periodic activation functions of SIREN and SINONE behave unstable during training in combination with DSNeRF and are unable to capture the original 3D scene well.

FastNeRF proposes a network split that separates the original DSNeRF into two networks. This new network architecture requires a three-dimensional and a two-dimensional input instead of a five-dimensional one and thereby enables feasible caching. Since a volume renderer is able to fall back on the cache and does not have to query the network at rendering time, interactive rendering speeds at 30fps can be achieved.

The Team