Poster
GrounDiT: Grounding Diffusion Transformers via Noisy Image Patch Transplantation
Phillip Lee · Taehoon Yoon · Minhyuk Sung
We propose a novel zero-shot bounding-box-based spatial grounding technique for text-to-image generation, leveraging recent Transformer-based diffusion models. Zero-shot spatial grounding has gained significant attention in recent literature due to its diverse applications. Previous methods have attempted to manipulate the latent image during the intermediate reverse diffusion process for grounding. However, a persistent challenge has been creating a latent image patch with the same noise level that fully encodes the target object within the given bounding box size. For the first time, we demonstrate that a Transformer-based diffusion model can generate such a noisy image patch in a separate branch of the generation process. This patch is then transplanted into the original image being generated in the main branch. By utilizing the flexibility of Transformers, we generate both a smaller image and a conventional-sized image simultaneously, allowing them to become semantic clones while maintaining realism. At each denoising step, the noisy image patch is copied into the bounding box of the original image, resulting in robust spatial grounding. In our experiments with the HRS and DrawBench benchmarks, we achieve state-of-the-art performance compared to previous zero-shot spatial grounding methods.
Live content is unavailable. Log in and register to view live content