One of the most intriguing challenges in computer vision is understanding dynamic scenes through the snapshots of a single moving camera. Imagine trying to digitally reconstruct a 3-D scene of a lively street scene or the subtle movements of a dancer in full flow, all from a video or a series of snapshots taken from different angles. This would enable the model to generate views from unseen camera angles, zoom in and out of the view, and create snapshots of 3-D models at different time instances, unlocking a deeper understanding of the world around us in three dimensions.
Neural radiance fields (NeRFs), which use machine learning to map 3-D scenes to 3-D color and density fields, have become a central technology for producing 3-D models from 2-D images. Even NeRFs, however, struggle to model dynamic scenes, because the problem is highly underconstrained: for a given set of snapshots, multiple dynamic scenes may be mathematically plausible, although some of them may not be realistic.
In a recent breakthrough presented at the annual meeting of the Association for the Advancement of Artificial Intelligence (AAAI), we introduce a novel approach that significantly advances our ability to capture and model scenes with complex dynamics. Our work not only addresses previous limitations but also opens doors to new applications ranging from virtual reality to digital preservation.
Our method displays a remarkable ability to factorize time and space in dynamic scenes, allowing us to more efficiently model 3-D scenes with changing lighting and texture conditions. In essence, we treat dynamic 3-D scenes as high-dimensional time-varying signals and impose mathematical constraints on them to produce realistic solutions. In tests, we’ve seen improvements in motion localization and the separation of light and density fields, enhancing the overall quality and fidelity of the 3-D models we can produce relative to existing technologies.
Bandlimited radiance fields
The radiance field of a 3-D scene can be decomposed into two types of lower-dimensional fields: light fields and density fields. The light field describes the direction, intensity, and energy of light at every point in the visual field. The density field describes the volumetric density of whatever is reflecting or emitting light at the relevant points. It is similar to assigning a color value and a probability of an object being placed at each 3-D location of a scene. Then, classical rendering techniques can easily be used to create a 3-D model from this representation.
In essence, our approach models the light and density fields of a 3-D scene as bandlimited, high-dimensional signals, where “bandlimited” means that signal energy outside of particular bandwidths is filtered out. A bandlimited signal can be represented as a weighted sum of basis functions, or functions that describe canonical waveforms; the frequency bands of Fourier decompositions are the most familiar basis functions.
Imagine that the state of the 3-D scene changes over time due to the dynamics of the objects within it. Each state can be reconstructed as a unique weighted sum of a particular set of basis functions. By treating the weights as functions of time, we can obtain a time-varying weighted sum, which we use to reconstruct the state of the 3-D scene.
In our case, we learn both the weights and the basis functions end-to-end. Another key aspect of our approach is that, rather than modeling the radiance field as a whole, as NeRFs typically do, we model the light and density fields separately. This allows us to model changes in object shapes or movements and in light or texture independently.
In our paper, we also show that traditional NeRF technology, while providing exceptional results for static scenes, often falters with dynamics, conflating aspects of the signal such as lighting and movement. Our solution draws inspiration from the established field of nonrigid structure from motion (NRSFM), which has been refining our grasp of moving scenes for decades.
Specifically, we integrate robust mathematical priors from NRSFM, such as the temporal clustering of motion to restrict it to a low-dimensional subspace. Essentially, this ensures that the state of the 3-D scene changes smoothly over time, along very low-dimensional manifolds, instead of undergoing erratic changes unlikely to occur in real-world scenarios.
In our experiments, across a variety of dynamic scenes that feature complex, long-range movements, light changes, and texture changes, our framework has consistently delivered models that are not just visually stunning but also rich in detail and faithful to their sources. We’ve observed reductions in artifacts, more accurate motion capture, and an overall increase in realism, with improvements in texture and lighting representation that significantly elevate the models’ quality. We rigorously tested our model in both synthetic and real-world scenarios, as can be seen in the examples below.
Synthetic scenes
A comparison of BLIRF (Ours), ground truth (GT), and several NeRF implementations on synthetic dynamic scenes.
Real-world scene
A comparison of BLIRF (Ours) and several NeRF implementations on real-world images of a cat in motion.
As we continue to refine our approach and explore its applications, we’re excited about the potential to revolutionize how we interact with digital worlds, making them more immersive, lifelike, and accessible.