Poster
in
Workshop: Fine-Tuning in Modern Machine Learning: Principles and Scalability
E-Tamba: Efficient Transformer-Mamba Layer Transplantation
DAZHI PENG · Hangrui Cao
With the growing popularity of Transformer and State Space Models (SSMs), hybrid designs like Jamba and RecurrentGemma have gained significant attention for their abilities to integrate the long-context processing strengths of Transformers with the low-memory demands of SSMs. However, most hybrid models require extensive pre-training, making them inaccessible to researchers with limited resources who want to experiment with different model architectures. To address this challenge, we introduce E-Tamba, a novel method for constructing hybrid models through only fine-tuning pre-trained Transformer and SSM models. Using layer-wise importance analysis, E-Tamba-1.1B replaces the non-critical upper Transformer layers of Pythia-1.4B with key layers from Mamba-1.4B. Following only 0.9B tokens of fine-tuning, E-Tamba-1.1B delivers excellent results in perplexity scores and various NLP downstream tasks. Additionally, it achieves a 3X reduction in inference memory compared to the baseline Pythia-1.4B, while offering superior long-context retrieval capabilities over Mamba-1.4B.