Poster
in
Workshop: Foundation Model Interventions
Algorithmic Oversight for Deceptive Reasoning
Ege Onur Taga · Mingchen Li · Yongqi Chen · Samet Oymak
Keywords: [ alignment ] [ oversight ] [ deceptive reasoning ] [ LLMs ]
This paper examines the detection of adversarial or deliberate reasoning errors in large language models (LLMs) within the context of mathematical reasoning. Our research proceeds in two steps: first, we develop strategies to induce reasoning errors in LLMs, revealing vulnerabilities even in strong models. Second, we introduce defense mechanisms including structured prompting, fine-tuning, and greybox access, significantly improving detection accuracy. We also present ProbShift, a novel algorithm that uses token probabilities for improved deceptive reasoning detection, outperforming GPT-3.5 when coupled with LLM-based oversight. Our findings underscore the importance and effectiveness of algorithmic oversight mechanisms for LLMs in complex reasoning tasks.