Poster
On Feature Learning in Structured State Space Models
Leena Chennuru Vankadara · Jin Xu · Moritz Haas · Volkan Cevher
Structured state space models (SSMs), such as Mamba, have recently been recognized as powerful alternatives to Transformer-based architectures, primarily due to their ability to model long contexts and efficiency during inference. This paper examines the scaling behavior of SSMs, an aspect previously unexplored. We focus on the capability of these models to learn features as their network width approaches infinity. Our findings reveal that established scaling rules like maximal update parameterization fail to support feature learning due to the non-representability of these models in the form of Tensor Programs. Additionally, we demonstrate that spectral scaling conditions, shown to be effective for feature learning in various architectures, do not hold the same implications for SSMs. Through a detailed analysis of signal propagation—both forward and backward—in SSMs, we identify the appropriate scaling necessary for non-trivial feature evolution in the infinite-width regime. Our proposed scaling shows behavior akin to maximal update parameterization, including the transfer of hyperparameters (HP) from smaller to larger scale SSMs.
Live content is unavailable. Log in and register to view live content