NeurIPS E-Tamba: Efficient Transformer-Mamba Layer Transplantation

Poster
in
Workshop: Fine-Tuning in Modern Machine Learning: Principles and Scalability

E-Tamba: Efficient Transformer-Mamba Layer Transplantation

DAZHI PENG · Hangrui Cao

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

With the growing popularity of Transformer and State Space Models (SSMs), hybrid designs like Jamba and RecurrentGemma have gained significant attention for their abilities to integrate the long-context processing strengths of Transformers with the low-memory demands of SSMs. However, most hybrid models require extensive pre-training, making them inaccessible to researchers with limited resources who want to experiment with different model architectures. To address this challenge, we introduce E-Tamba, a novel method for constructing hybrid models through only fine-tuning pre-trained Transformer and SSM models. Using layer-wise importance analysis, E-Tamba-1.1B replaces the non-critical upper Transformer layers of Pythia-1.4B with key layers from Mamba-1.4B. Following only 0.9B tokens of fine-tuning, E-Tamba-1.1B delivers excellent results in perplexity scores and various NLP downstream tasks. Additionally, it achieves a 3X reduction in inference memory compared to the baseline Pythia-1.4B, while offering superior long-context retrieval capabilities over Mamba-1.4B.

Chat is not available.

Poster in Workshop: Fine-Tuning in Modern Machine Learning: Principles and Scalability

E-Tamba: Efficient Transformer-Mamba Layer Transplantation

DAZHI PENG · Hangrui Cao

Poster
in
Workshop: Fine-Tuning in Modern Machine Learning: Principles and Scalability