Expo Demonstration
West Exhibition Hall A

As the demand for deploying fine-tuned and customized generative models on edge devices grows, the challenge of fully fine-tuning generative models remains due to its high cost and computational intensity. Parameter Efficient Fine-Tuning (PEFT) provides an effective solution by minimizing the number of fine-tuned parameters and reducing memory usage.  This demo showcases Low Rank Adaptation (LoRA), an efficient PEFT technique for a Large Vision Model (LVM) on an Android smartphone powered by Qualcomm Technologies’ latest Snapdragon Mobile Platform.

Scientific Challenge that we tackle

Efficiently running large generative models on a resource-constrained mobile device requires methods that manage compute complexity and memory usage. Further, users need to be able to switch adapters quickly on the device. Given the size and complexity of the generative models and many on-target optimizations used, it is a challenge to perform this switch quickly while retaining the required on-target performance and accuracy.

How we solve it

To accommodate all modules of Stable Diffusion and LoRA adapters on a mobile device, they are efficiently quantized using the post-training quantization technique, AdaRound, with the AI Model Efficiency Toolkit (AIMET) from the Qualcomm AI Stack. To efficiently run Stable Diffusion with adapters on device, and to support rapid switching of the adapters, we statically compile and quantize the model and the adapters once, and support switching with fast parameter updates directly to a small fraction of the model parameters. This retains the optimization of the model execution, supports fast switching, and retains the low memory footprint of the adapters on device. To ensure we retain model accuracy across adapter switches, we also enable updates of certain metadata (such as quantization parameters) to a small fraction of the on-device model to best match the requirements of each adapter.

Chat is not available.