Skip to yearly menu bar Skip to main content


Poster

Enhancing Large Language Models through Adaptive Tokenizers

Mengyu Zheng · Hanting Chen · Tianyu Guo · Chong Zhu · Binfan Zheng · Chang Xu · Yunhe Wang

[ ]
Wed 11 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract:

Tokenizers serve as crucial interfaces between models and linguistic data, substantially influencing the efficacy and precision of large language models (LLMs). Traditional tokenization methods often rely on static frequency-based statistics and are not inherently synchronized with LLM architectures, which may limit model performance. In this study, we propose a simple but effective method to learn tokenizer specifically engineered for seamless integration with LLMs. Initiating with a broad initial lexicon, we refine our tokenizer by monitoring changes in the model’s perplexity during training, allowing for the selection of a tokenizer that is closely aligned with the model’s evolving dynamics. Through iterative refinement, we develop an optimized tokenizer. Our empirical evaluations demonstrate that this adaptive approach significantly enhances accuracy compared to conventional methods, maintaining comparable vocabulary sizes and affirming its potential to improve LLM functionality.

Live content is unavailable. Log in and register to view live content