Poster
in
Workshop: Machine Learning for Systems
V“Mean”ba: Visual State Space Models only need 1 hidden dimension
TienYu Chi · Hung-Yueh Chiang · Chi-Chih Chang · Ning-Chi Huang · Kai-Chiang Wu
Abstract:
Vision transformers dominate image processing tasks due to their superior performance. However, the quadratic complexity of self-attention limits the scalability of these systems and their deployment on resource-constrained devices. State Space Models (SSMs) have emerged as a solution by introducing a linear recurrence mechanism, which reduces the complexity of sequence modeling from quadratic to linear. Recently, SSMs have been extended to high-resolution vision tasks. Nonetheless, the linear recurrence mechanism struggles to fully utilize matrix multiplication units on modern hardware, resulting in a computational bottleneck. We address this issue by introducing $\textit{VMeanba}$, a training-free compression method that eliminates the channel dimension in SSMs using mean operations. Our key observation is that the output activations of SSM blocks exhibit low variances across channels. Our $\textit{VMeanba}$ leverages this property to optimize computation by averaging activation maps across the channel to reduce the computational overhead without compromising accuracy.Evaluations on image classification and semantic segmentation tasks demonstrate that $\textit{VMeanba}$ achieves up to a 1.12x speedup with less than a 3\% accuracy loss. When combined with 40\% unstructured pruning, the accuracy drop remains under 3\%.
Chat is not available.