Poster
in
Workshop: Statistical Frontiers in LLMs and Foundation Models
MisMo: More is More in Alignment
Benjamin Feuer · Micah Goldblum · Teresa Datta · Raz Besaleli · Samuel Dooley · Max Cembalest · John Dickerson
Keywords: [ llms ] [ alignment ] [ bias ] [ benchmark ]
Since the release of ChatGPT in November 2022, there has been an explosion of interest in post-training, and an avalanche of new methods have been introduced. In this work, we attempt to answer the simple question -- does progress on alignment with human values translate to other, more concrete metrics? While the foundational works in the area, which predate ChatGPT, included extensive evaluations, more recent works often evaluate exclusively on LLM-judge benchmarks such as MT-Bench, Alpaca Eval and Arena-Hard-Auto. In this paper, we provide new evidence that LLM judges have powerful implicit biases, prioritizing style over factuality and safety. In order to better gauge progress on the alignment problem, we introduce MisMo-Bench, a new meta-benchmark, and conduct the largest meta-analysis of post-training methods to date, and show that the SFT stage of post-training has a far greater impact on metrics than preference optimization, with data scaling and prompt diversity as the driving factors.