Poster
in
Workshop: Towards Safe & Trustworthy Agents
Algorithmic Oversight for Deceptive Reasoning
Ege Onur Taga · Mingchen Li · Yongqi Chen · Samet Oymak
This paper examines the detection of adversarial or deliberate reasoning errors in large language models (LLMs) within the context of mathematical reasoning. Our research proceeds in two steps: first, we develop strategies to induce reasoning errors in LLMs, revealing vulnerabilities even in strong models. Second, we introduce defense mechanisms including structured prompting, fine-tuning, and greybox access, significantly improving detection accuracy. We also present ProbShift, a novel algorithm that uses token probabilities for improved deceptive reasoning detection, outperforming GPT-3.5 when coupled with LLM-based oversight. Our findings underscore the importance and effectiveness of algorithmic oversight mechanisms for LLMs in complex reasoning tasks.