Lightning Talk
in
Workshop: Data Centric AI
Towards Systematic Evaluation in Machine Learning through Automated Stress Test Creation
Many machine learning (ML) models that perform well on canonical benchmarks are nonetheless brittle. This has led to a broad assortment of alternative benchmarks for ML evaluation, each relying on their own distinct process of generation, selection, or curation. In this work, we look towards organizing principles for a systematic approach to measuring model performance. We introduce a framework unifying the literature on stress testing and discuss how specific criteria can shape the inclusion or exclusion of samples from a test. As a concrete example of this framework, we present NOOCh: a suite of scalably generated, naturally-occurring stress tests, and show how varying testing criteria can be used to probe specific failure modes. Experimentally, we explore the tradeoffs between various learning approaches on these tests and demonstrate how test design choices can yield varying conclusions.