Poster
MixEval: Fast and Dynamic Human Preference Approximation with LLM Benchmark Mixtures
Jinjie Ni · Fuzhao Xue · Xiang Yue · Yuntian Deng · Mahir Shah · Kabir Jain · Graham Neubig · Yang You
Evaluating large language models (LLMs) in real-world scenarios is challenging. Traditional human-annotated benchmarks fail to capture the diversity and subtlety of real-world queries, while LLM-as-judge benchmarks suffer from preference biases and limited query quantity. Both approaches also become contaminated over time due to their static nature. Crowdsourcing human preferences, such as in the widely-used Chatbot Arena, provides valuable insights but is costly and time-consuming. To address these challenges, we propose MixEval, which bridges the gap between real-world human queries and efficient, reproducible evaluation by leveraging user queries mined from the web and matching them with similar queries from existing benchmarks. MixEval is highly aligned with Chatbot Arena, achieving a 0.96 Spearman correlation, significantly exceeding existing singular benchmarks. Moreover, MixEval only needs to be run locally and quickly (1/15 of the time required for MMLU), eliminating the need for slow and costly human preference data collection. The data points of MixEval can be effortlessly updated within 1 minute, with merely 0.36 Std. (on a 0-100 scale) between versions, effectively mitigating the benchmark contamination issue. We comprehensively analyze MixEval and other popular LLM benchmarks to deepen the community's understanding of LLM evaluation and guide future research directions. We will release and periodically update the benchmark data and its related code.
Live content is unavailable. Log in and register to view live content