Poster
in
Workshop: 5th Workshop on Self-Supervised Learning: Theory and Practice
For Perception Tasks: The Cost of LLM Pretraining by Next-Token Prediction Outweigh its Benefits
Randall Balestriero · Hai Huang
We question the usefulness of next-token prediction pretraining onto Large Language Models (LLMs)' ability to solve perception tasks, e.g., sentiment analysis, spam detection, or toxicity detection. In fact, while companies spend tremendous resources to increase the scale of their pretraining, and users flock to pretrained LLMs to solve their downstream tasks, it remains unclear how beneficial pretraining really is at making LLMs apt to solve perception tasks.In fact, we propose empirical evidences that training from scratch, i.e., with a randomly initialized LLM actually closely competes--and sometimes exceeds--performances of a LoRA fine-tuned pretrained LLM.Those findings shed some first limitations on the validity of next-token prediction tasks as a universal pretraining strategy, as its benefits in terms of final performances are often minimal. A surprising takeaway also concerns the ability to train LLM with billions of parameters on very small datasets of a few thousand samples while being able to produce highly accurate predictions on unseen data, i.e., the implicit bias of the architecture seems far exceeding what was commonly known in the community. We hope that our findings will serve as motivation to consider LLM training from random weights as viable solution even in small data regimes.