Poster
in
Workshop: Safe Generative AI
Large Language Model Benchmarks Do Not Test Reliability
Joshua Vendrow · Edward Vendrow · Sara Beery · Aleksander Madry
When deploying large language models (LLMs) in real-world applications, it is important to ensure that these models are not only capable, but also reliable: consistently accurate on a task such that we can be confident in their outputs. We contend that while many benchmarks have been created to track models' growing capabilities, we lack benchmarks that can properly assess whether models can perform any of their capabilities, even simpler ones, reliably. We suggest that this gap in benchmarking stems from the practice of discarding benchmarks once performance on them is "saturated," with remaining errors dismissed as mistakes in the benchmarks themselves; consequently, there might still be model failures lost in the noise, hiding unreliable behavior. To better evaluate the reliability of LLMs, we propose the construction of platinum benchmarks: benchmarks on which a reliable and performant model should achieve 100\% accuracy, making perfect performance the criterion for success. As an initial attempt at constructing such a benchmark, we carefully clean examples from 10 existing benchmarks to minimize label errors and ambiguity, turning their ground truth labels from "gold" to "platinum." Evaluating current frontier LLMs on our benchmark, we find that indeed, these models still exhibit failures on simple tasks, but benchmarks are too noisy to catch these lingering failures.