Poster
in
Workshop: ML with New Compute Paradigms
Enabling On-Device Large Language Models with 3D-Stacked Memory
Lita Yang · Kavya Sreedhar · Huichu Liu · Edith Beigne
In this paper, we address the growing need for new types of memories to enable deployment of on-device large language models (LLMs) to resource-constrained augmented reality (AR) edge devices. We evaluate the memory power and area savings using 3D-stacked memory (3D-DRAM, 3D-SRAM) versus conventional 2D memory (LPDDR-DRAM, SRAM). At target inference rate of 5-100 inferences per second, 3D-DRAM consumes the least memory power across all the memory options, achieving ∼7-15x improvement in memory power consumption compared with conventional 2D memory across our benchmark suite of on-device LLMs (Distilled GPT-2, GPT-2, BART Base, and BART Large). While 3D-SRAM can reduce memory dynamic power, the leakage power consumption for storing such a large model becomes prohibitively costly, hence why 3D-DRAM becomes a better option than 3D-SRAM for on-device LLMs. Additionally, since 3D-DRAM significantly reduces the memory power consumption for on-device LLMs to 10’s of mWs, 3D-DRAM enables the deployment of much larger LLMs that previously could not be deployed with conventional DRAM and 2D SRAM solutions.