Poster
A Comprehensive Investigation of Sparse Rate Reduction in Transformer-like Models
Yunzhe Hu · Difan Zou · Dong Xu
Deep neural networks have been criticized for their black-box nature for a long period of time. A recent work proposed an information-theoretic objective function called Sparse Rate Reduction (SRR) and interpreted its unrolled optimization as a Transformer-like model called Coding Rate Reduction Transformer (CRATE), taking steps to unveil the inner mechanism of modern neural architecture. However, they only consider the simplest implementation, and whether this objective is optimized in practice and its causal relationship to generalization remain elusive. To this end, we derive different implementations by analyzing layer-wise behaviors of CRATE, both theoretically and empirically. In order to reveal the predictive power of SRR on model generalization, we collect a set of model variants induced from different implementations and hyperparameters and evaluate SRR as a complexity measure based on its correlation with generalization. Surprisingly, we find out that SSR has positive correlation with generalization and outperforms other baseline measures such as path-norm and sharpness-based ones. Furthermore, we show that model generalization can be improved using SRR as regularization on various benchmark datasets. We hope this paper can pave the way for using SSR to design principled models and study their generalization ability.
Live content is unavailable. Log in and register to view live content