Poster
in
Workshop: Red Teaming GenAI: What Can We Learn from Adversaries?
Algorithmic Oversight for Deceptive Reasoning
Ege Onur Taga · Mingchen Li · Yongqi Chen · Samet Oymak
Keywords: [ alignment ] [ oversight ] [ deceptive reasoning ] [ LLMs ]
This paper investigates the oversight problem where a large language model (LLM) provides output that may contain deliberate adversarial errors and an oversight LLM/agent aims to detect them. We study this question in the context of mathematical reasoning. Our study is conducted in two primary steps: Firstly, we develop attack strategies aimed at inducing deliberate reasoning errors that could deceive the oversight agent. Here, we find that even strong models can be deceived, which highlights a need for defense mechanisms. Secondly, we propose a list of defense mechanisms to protect against these attacks by augmenting oversight capabilities. Through these, we find that structured prompting, fine-tuning, and greybox access can noticeably improve detection accuracy. Specifically, we introduce ProbShift, a novel algorithm utilizing token-probabilities of the generated text for detection. We find that ProbShift can outperform GPT-3.5 and can be boosted with LLM-based oversight. Overall, this work demonstrates the feasibility and importance of developing algorithmic oversight mechanisms for LLMs, with emphasis on complex tasks requiring logical/mathematical reasoning.