Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Machine Learning Safety

Adversarial Policies Beat Professional-Level Go AIs

Tony Wang · Adam Gleave · Nora Belrose · Tom Tseng · Michael Dennis · Yawen Duan · Viktor Pogrebniak · Joseph Miller · Sergey Levine · Stuart Russell


Abstract:

We attack the state-of-the-art Go-playing AI system, KataGo, by training an adversarial policy that plays against a frozen KataGo victim. Our attack achieves a >99\% win-rate against KataGo without search, and a >50% win-rate when KataGo uses enough search to be near-superhuman. To the best of our knowledge, this is the first successful end-to-end attack against a Go AI playing at the level of a top human professional. Notably, the adversary does not win by learning to play Go better than KataGo---in fact, the adversary is easily beaten by human amateurs. Instead, the adversary wins by tricking KataGo into ending the game prematurely at a point that is favorable to the adversary. Our results demonstrate that even professional-level AI systems may harbor surprising failure modes. Our results demonstrate that AI systems which are normally superhuman may still be less robust than humans. Example games are available at https://goattack.alignmentfund.org/

Chat is not available.