In this demo, we show an on-device inference of a Large Multi-modal Model (LMM) that interactively answers questions asked by the user on high-resolution images, on an Android smartphone powered by Qualcomm Technologies’ latest Snapdragon Mobile Platform. The overall latency to process a high-resolution image of 768*768 pixels and prefill KV-cache of the LLaMA-3 language model is just 0.2 seconds on a smartphone, that uses Qualcomm’s AI Stack’s optimizations and runs on the Qualcomm Hexagon NPU for accelerated and efficient inference. Scientific Challenge that we tackle
Efficiently running a large multimodal model requires concurrently on-boarding multiple deep models, including a LLaMA-3-8B LLM, a vision encoder, speech to text, and text to speech models. Large model sizes (both model parameters & activations) and high-resolution visual data pose significant challenge towards efficient execution, and enabling interactive conversational experience for the user, with all required computation to be done on the edge-device itself.
How we solve it
To efficiently run the interactive LMM on a mobile device, we design, develop, train and on-board an LMM, with efficient visual backbone, streaming Automatic Speech Recognition (ASR), streaming Text-to-Speech (TTS), and LLaMA-3 language model. Unlike most existing methods that split a high-resolution image to multiple sub-images, we can directly ingest a high-resolution image in a single forward pass. Further, our visual backbone is based upon a hierarchical network design, that efficiently runs on a mobile device, and provides a compact 144 visual tokens per 768*768 image for the LMM. Our on-device LLM can tackle a context length upto 4096 tokens, thus enabling multi-turn conversations from interleaved multiple images. The streaming ASR and TTS further provide a natural speech I/O interface with reduced end-to-end latency.