In this demo, we show the world’s fastest on-device inference of Stable Diffusion (SD), a text-to-image large-scale generative model with 1.1 billion parameters, on a smartphone powered by Qualcomm Technologies’ latest Snapdragon Mobile Platform. The overall latency is less than 0.6 seconds on a smartphone by using full-stack AI optimizations to run on the Qualcomm Hexagon NPU for accelerated and efficient inference.
SD poses significant challenges for mobile and edge devices, due to its large size (both model parameters & activations) and iterative inference. The standard SD-1.5 model has an 860M-parameter UNet with 803 GMACs, a 50M-parameter VAE decoder with 1257 GMACs, and a 123M-parameter text encoder with 6 GMACs. Iterative denoising needs multiple forward passes to generate one final image.
To efficiently run SD on device, we develop a well-integrated multi-stage distillation approach, which includes pruning the UNet to reduce the GMACs from 803 to 609, step distillation to reduce inference iterations, and guidance conditioning to combine conditional and unconditional inference steps. Our proposed techniques for distillation yield significant compute efficiency, while largely retaining the generation capacity of the original model. We further demonstrate that the proposed approach can be extended from baseline SD to ControlNet, SD-based inpainting models, and 360° panorama generation.
To fit all modules of SD on a mobile device, we shrink the model from FP32 to INT8 with the post-training quantization technique, AdaRound, using the AI Model Efficiency Toolkit (AIMET) from the Qualcomm AI Stack. Our quantization scheme is iteration/denoising stage agnostic with ‘Int16’ bit-width for activations. Our distillation combined with end-to-end software and architecture optimization yields fast inference under 0.6 seconds.
"This Proposal is provided for review and evaluation purposes only. Do not redistribute to any third party"