Poster
in
Workshop: Safe Generative AI
GuardFormer: Guardrail Instruction Pretraining for Efficient SafeGuarding
James O' Neill · Santhosh Subramanian · Eric Lin · Abishek Satish · Vaikkunth Mugunthan
Sun 15 Dec 9 a.m. PST — 5 p.m. PST
Large language models (LLMs) have shown promise in guardrailing against undesired behaviors, but their high inference costs, memory consumption, and unstructured outputs can be prohibitive. In this work we propose guardrail-specific instruction pretraining using a synthetic data generation pipeline. The data generation process is tailored towards generating policies that define the scope of the guardrail, compliant and non-compliant prompts, rationales when non-compliant and the output binary compliant or non-compliant label. From this, we propose a new guardrail model called \texttt{Guardformer} and show when further few-shot fine-tuned it significantly outperforms current state of the art (SoTA) while being orders of magnitude smaller.Empirical evaluation across 7 public datasets and 4 novel guardrail benchmarks demonstrates our efficient classifiers' superiority over state-of-the-art LLMs and third-party APIs. Our models achieve average F1 score improvements of \textbf{29.64} and \textbf{21.07} points compared to \text{Aegis-LlamaGuard} and \texttt{gpt-4o}, respectively, in distinguishing safe from unsafe behaviors. Notably, models trained on our synthetic data consistently outperform those trained on real data, even when evaluated against custom-defined guardrailing policies, underscoring the efficacy of our approach.