Poster
DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features
Letian Wang · Seung Wook Kim · Jiawei Yang · Cunjun Yu · Boris Ivanovic · Steven Waslander · Yue Wang · Sanja Fidler · Marco Pavone · Peter Karkus
We propose DistillNeRF, a self-supervised learning framework addressing the challenge of understanding 3D environments from limited 2D observations in autonomous driving. Our method is a generalizable feedforward model that predicts strong neural scene representations only from single-frame sparse multi-view camera inputs. To help learning 3D geometry from sparse unlabeled observations, our insight is to exploit per-scene optimized Neural Radiance Fields (NeRFs) by generating dense depth and virtual camera targets, and train our model with differentiable rendering. To learn a generally useful 3D representation, we further propose distilling features from pre-trained 2D foundation models, such as CLIP or DINO, thereby eliminating the need for costly 3D human annotations. By combining an offline per-scene optimization stage with a distillation stage that trains a shared encoder, our DistillNeRF predicts rich 3D feature volumes that support various downstream tasks. Extensive experimental results on the NuScenes dataset demonstrate that DistillNeRF significantly outperforms existing comparable methods for scene reconstruction, novel view synthesis, and depth estimation; and it allows for competitive zero-shot 3D semantic occupancy prediction, as well as open-world scene understanding through distilled foundation model features. Demos and code will be available on the anonymous project page.
Live content is unavailable. Log in and register to view live content