Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning

Prompt Learning Based Adaptor for Enhanced Video Editing with Pretrained Text-to-Image Diffusion Models

Yangfan He · Sida Li · Jianhui Wang


Abstract:

The rapid advancement of text-to-image generation technologies based on diffusion models has produced remarkable results, driving the emergence of video editing applications built upon these pretrained models. To achieve temporal consistency with independent framewise text-to-image generation, existing video editing models either fine-tune temporal layers or propagate temporal features at test time without additional training. While these approaches show promise, the frame independence of text-to-image generation creates a bottleneck in delivering consistent and high-quality video outputs.In this paper, we propose a lightweight adaptor utilizing prompt learning to enhance video editing performance with minimal training cost. Our approach introduces shared prompt tokens to improve editing capabilities and unshared frame-specific tokens to impose consistency constraints across frames. The adaptor seamlessly integrates into existing video editing pipelines, offering significant improvements in temporal coherence and overall video quality, benefiting a broad spectrum of downstream video editing algorithms.

Chat is not available.