Skip to yearly menu bar Skip to main content


In-person presentation
in
Workshop: Attributing Model Behavior at Scale (ATTRIB)

Evaluation Beyond Task Performance (Milad Nasr)


Abstract:

As we increasingly release and productionize machine learning models, we focus primarily on their performance on a suite of downstream benchmarking tasks. However, improved performance on these benchmarks does not equate to universal improvement. In this talk, we discuss evaluations that live on a whole separate axis. In particular, we show that as models get larger there are more memorized training examples in the model outputs. These issues are not random artifacts that can be solved by scaling models or can be prevented in production models easily.

Chat is not available.