Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Third Workshop on Efficient Natural Language and Speech Processing (ENLSP-III): Towards the Future of Large Language Models and their Emerging Descendants

BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model

Nolan Dey · Daria Soboleva · Faisal Al-Khateeb · Bowen Yang · Ribhu Pathria · Hemant Khachane · Shaheer Muhammad · Zhiming (Charles) Chen · Robert Myers · Jacob Robert Steeves · Natalia Vassilieva · Marvin Tom · Joel Hestness


Abstract: We study recent techniques targeted to improve the parameter efficiency and modeling quality of large language models (LLMs). We experiment with recently-proposed training approaches, such as overtraining for a large number of tokens-per-parameter on a high-quality dataset, carefully tuning hyperparameters with maximal update parameterization (\textmu P), and adjusting learning rate and batch size. We also test recent state-of-the-art model features, namely, rotary and ALiBi position embeddings, and the Swish-gated linear unit (SwiGLU). We find a pretraining recipe that improves over Cerebras-GPT \textmu P validation loss by 12.7\% for the same parameter budget.With this recipe, we train the state-of-the-art 3B parameter foundation model, called the Bittensor Language Model ("BTLM-3B-8K"), which is sized to deploy easily on memory or compute-constrained devices. Over a broad set of downstream tasks, BTLM beats all other 3B foundation models by 2-5.5\%, making it competitive with some 7B parameter models that are 2.5$\times$ larger. BTLM-3B-8K is available under an Apache 2.0 license on Hugging Face: \url{https://huggingface.co/cerebras/btlm-3b-8k-base}.

Chat is not available.