Skip to yearly menu bar Skip to main content


Poster

B'MOJO: Realizations of Hybrid State Space Models with Eidetic and Fading Memory

Luca Zancato · Arjun Seshadri · Yonatan Dukler · Aditya Golatkar · Yantao Shen · Benjamin Bowman · Matthew Trager · Alessandro Achille · Stefano Soatto

[ ]
Wed 11 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

We develop a building block for architectures to support transductive inference by allowing memory to grow to a finite but a-priori unknown bound while using finite resources for fast inference. Current architectures use such resources to represent data either eidetically over a finite span ('context' in Transformers), or fading over an infinite span (in State Space Models, or SSMs). Recent hybrid architectures have combined eidetic and fading memory, but with limitations that do not allow the designer or the learning process to seamlessly modulate the two, nor to extend the eidetic memory span. We leverage ideas from Stochastic Realization Theory to develop a class of models called B'MOJO to seamlessly combine eidetic and fading memory within an elementary composable module. The overall architecture can be used to implement a hybrid system that can access short-term eidetic memory 'in-context,' permanent structural memory 'in-weights,' fading memory 'in-state,' and long-term eidetic memory 'in-storage' by natively incorporating retrieval from an asynchronously updated long-term memory. We show that Transformers, existing SSMs such as Mamba, and hybrid architectures such as Jamba are special cases of B'MOJO and describe a basic implementation, to be open sourced, that can be stacked and scaled efficiently in hardware. We test B'MOJO on synthetic recall tasks where it outperforms existing SSMs and Hybrid models; as a baseline, we test ordinary language modeling tasks where B'MOJO achieves perplexity comparable to similarly sized Transformers and SSMs up to 1.4B parameters, while being up to 10% faster to train. Finally, we test whether models trained inductively on a-priori bounded sequences (up to 8K tokens) can still perform transductive inference on sequences many-fold longer. B'MOJO's ability to modulate eidetic and fading memory results in better inference on longer sequences tested up to 32K tokens, four-fold the length of the longest sequences seen during training.

Live content is unavailable. Log in and register to view live content