Track: Oral Session 4B: Diffusion-based Models

Oral

Oral Session 4B: Diffusion-based Models

West Exhibition Hall C, B3

Thu 12 Dec 3:30 p.m. PST — 4:30 p.m. PST

Abstract:

Chat is not available.

Thu 12 Dec. 15:30 - 15:50 PST

CAT3D: Create Anything in 3D with Multi-View Diffusion Models

Ruiqi Gao · Aleksander Holynski · Philipp Henzler · Arthur Brussee · Ricardo Martin Brualla · Pratul Srinivasan · Jonathan Barron · Ben Poole

Advances in 3D reconstruction have enabled high-quality 3D capture, but require a user to collect hundreds to thousands of images to create a 3D scene. We present CAT3D, a method for creating anything in 3D by simulating this real-world capture process with a multi-view diffusion model. Given any number of input images and a set of target novel viewpoints, our model generates highly consistent novel views of a scene. These generated views can be used as input to robust 3D reconstruction techniques to produce 3D representations that can be rendered from any viewpoint in real-time. CAT3D can create entire 3D scenes in as little as one minute, and outperforms existing methods for single image and few-view 3D scene creation.

Thu 12 Dec. 15:50 - 16:10 PST

Stylus: Automatic Adapter Selection for Diffusion Models

Michael Luo · Justin Wong · Brandon Trabucco · Yanping Huang · Joseph Gonzalez · zhifeng Chen · Ruslan Salakhutdinov · Ion Stoica

Beyond scaling base models with more data or parameters, fine-tuned adapters provide an alternative way to generate high fidelity, custom images at reduced costs. As such, adapters have been widely adopted by open-source communities, accumulating a database of over 100K adapters—most of which are highly customized with insufficient descriptions. To generate high quality images, this paper explores the problem of matching the prompt to a Stylus of relevant adapters, built on recent work that highlight the performance gains of composing adapters. We introduce Stylus, which efficiently selects and automatically composes task-specific adapters based on a prompt's keywords. Stylus outlines a three-stage approach that first summarizes adapters with improved descriptions and embeddings, retrieves relevant adapters, and then further assembles adapters based on prompts' keywords by checking how well they fit the prompt. To evaluate Stylus, we developed StylusDocs, a curated dataset featuring 75K adapters with pre-computed adapter embeddings. In our evaluation on popular Stable Diffusion checkpoints, Stylus achieves greater CLIP/FID Pareto efficiency and is twice as preferred, with humans and multimodal models as evaluators, over the base model.

Thu 12 Dec. 16:10 - 16:30 PST

Best Paper Runner-up

Guiding a Diffusion Model with a Bad Version of Itself

Tero Karras · Miika Aittala · Tuomas Kynkäänniemi · Jaakko Lehtinen · Timo Aila · Samuli Laine

The primary axes of interest in image-generating diffusion models are image quality, the amount of variation in the results, and how well the results align with a given condition, e.g., a class label or a text prompt. The popular classifier-free guidance approach uses an unconditional model to guide a conditional model, leading to simultaneously better prompt alignment and higher-quality images at the cost of reduced variation. These effects seem inherently entangled, and thus hard to control. We make the surprising observation that it is possible to obtain disentangled control over image quality without compromising the amount of variation by guiding generation using a smaller, less-trained version of the model itself rather than an unconditional model. This leads to significant improvements in ImageNet generation, setting record FIDs of 1.01 for 64x64 and 1.25 for 512x512, using publicly available networks. Furthermore, the method is also applicable to unconditional diffusion models, drastically improving their quality.