Poster
No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery
Alexander Rutherford · Michael Beukman · Timon Willi · Bruno Lacerda · Nick Hawes · Jakob Foerster
What data or environments to use for training to improve downstream performance is a longstanding and very topical question. In particular, Unsupervised Environment Design (UED) methods have gained recent attention as their adaptive curricula enable agents to be robust to in and out of distribution tasks. We ask to what extent these methods are themselves robust when applied to a novel setting, closely inspired by a real-world problem. Surprisingly, we find that none of the state-of-the-art UED methods improve upon the naive baseline of Domain Randomisation (DR). Our analysis shows that this is due to their underlying scoring functions failing to predict intuitive measures of learnability, i.e. the settings that the agent sometimes solves, but not always. This is also supported by our finding that using true regret as a scoring function results in the most robust performance in settings where we can compute it. Based on this we instead directly upsample levels with high levels of learnability and find that this simple and intuitive approach outperforms not just UED methods but also DR on our domain and the standard UED goal-oriented domain of MiniGrid. We had tried our best to make current UED methods work for our setting before going back to the basics and developing our new scoring function, which took a few days and worked out-of-the-box. We hope our lessons and learnings will help others spend their time more wisely. We further introduce a new adversarial evaluation procedure for directly measuring robustness, closely mirroring the conditional value at risk (CVaR). We will open-source all our code and present visualisations of final policies here: https://anonymous.4open.science/r/sampling-for-learnability-3594.
Live content is unavailable. Log in and register to view live content