We demonstrate the first diffusion-based video editing model running on a smartphone powered by Qualcomm Technologies’ latest Snapdragon Mobile Platform. Given an input video, at 512x384 resolution, and a textual prompt instructing the edit, we generate the edited video at 5 frames per seconds on a smartphone by using full-stack AI optimizations to run on the Qualcomm Hexagon NPU for accelerated and efficient inference.
Our model is built on top of an efficient image generation backbone fine-tuned on editing instructions. The image generation backbone is extended to video by introducing cross-frame attentions from key-frames to enforce temporal consistency while being efficient in terms of memory and computation overhead. We further increase the frame rate by using a novel extension of classifier free guidance distillation to multi-modal setting, where the 3 denoising functions (unconditional, text conditioned, and frame conditioned) are all distilled into a single denoising reducing the diffusion sampling cost by a factor of 3. Additionally, we extend the adversarial distillation of diffusion models to editing while preserving the guidance scale which is essential to control the editing strength. This novel extension allows us to perform diffusion sampling with a single step. Finally, we rely on a distilled autoencoder to efficiently extract the latents and pixels required for latent diffusion models. For further acceleration, we shrink the model from FP32 to INT8 with the post-training quantization technique, AdaRound, using the AI Model Efficiency Toolkit (AIMET) from the Qualcomm AI Stack. Our quantization scheme is iteration/denoising stage agnostic with ‘Int16’ bit-width for activations.
"This Proposal is provided for review and evaluation purposes only. Do not redistribute to any third party without the express prior written consent of Qualcomm Technologies, Inc."