Poster
Enhancing Large Language Models through Adaptive Tokenizers
Mengyu Zheng · Hanting Chen · Tianyu Guo · Chong Zhu · Binfan Zheng · Chang Xu · Yunhe Wang
Tokenizers serve as crucial interfaces between models and linguistic data, substantially influencing the efficacy and precision of large language models (LLMs). Traditional tokenization methods often rely on static frequency-based statistics and are not inherently synchronized with LLM architectures, which may limit model performance. In this study, we propose a simple but effective method to learn tokenizer specifically engineered for seamless integration with LLMs. Initiating with a broad initial lexicon, we refine our tokenizer by monitoring changes in the model’s perplexity during training, allowing for the selection of a tokenizer that is closely aligned with the model’s evolving dynamics. Through iterative refinement, we develop an optimized tokenizer. Our empirical evaluations demonstrate that this adaptive approach significantly enhances accuracy compared to conventional methods, maintaining comparable vocabulary sizes and affirming its potential to improve LLM functionality.
Live content is unavailable. Log in and register to view live content