Poster
in
Workshop: Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning
Efficient Fine-Tuning of Image-Conditional Diffusion Models for Depth and Surface Normal Estimation
Gonzalo Martin Garcia · Karim Abou Zeid · Christian Schmidt · Daan de Geus · Alexander Hermans · Bastian Leibe
Recent work showed that large diffusion models can be reused as highly precise monocular depth estimators by casting depth estimation as an image-conditional image generation task. While the proposed model achieved state-of-the-art results, high computational demands due to multi-step inference limited its use in many scenarios. We show that the inefficiency was caused by a flaw in the inference pipeline that has so far gone unnoticed. The fixed model performs comparably to the best previously reported configurations while being more than 200× faster. Furthermore, we show that end-to-end finetuning with task-specific losses enables deterministic single-step inference, outperforming previous diffusion-based depth and normal estimation models on common zero-shot benchmarks. This fine-tuning scheme works similarly well on Stable Diffusion directly.