NeurIPS Activation Monitoring: Advantages of Using Internal Representations for LLM Oversight

Poster
in
Workshop: Attributing Model Behavior at Scale (ATTRIB)

Activation Monitoring: Advantages of Using Internal Representations for LLM Oversight

Oam Patel · Rowan Wang

[ Abstract ]

Abstract:

Deployed Large Language Models (LLMs) sometimes output harmful or dangerous content even after safety training. Monitoring systems, typically specialized safety-tuned LLMs, act as a second layer of defense and are critical for safe deployment. However, these systems easily break under adversarial pressure and introduce inference overhead to the deployment stack. In this paper, we show that activation-based monitors, such as simple probes, achieve competitive outcomes with strong text-classifier baselines in accuracy, low false positive rate, and generalization. Additionally, we find that activation monitors are more robust to adversarial pressure across all levels of access indicating that activation monitoring may be especially promising in high-stakes settings. Finally, probe error profiles are uncorrelated with text classifier error profiles, highlighting the potential for a combined approach to deployment oversight. Our analysis demonstrates the viability of activation monitoring and advocates for a multi-layered defense strategy to reduce the risks of deployed LLMs.

Chat is not available.

Poster in Workshop: Attributing Model Behavior at Scale (ATTRIB)

Activation Monitoring: Advantages of Using Internal Representations for LLM Oversight

Oam Patel · Rowan Wang

Poster
in
Workshop: Attributing Model Behavior at Scale (ATTRIB)