NeurIPS A Closer Look at System Message Robustness

Poster
in
Workshop: Safe Generative AI

A Closer Look at System Message Robustness

Norman Mu · Jonathan Lu · Michael Lavery · David Wagner

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

System messages have emerged as critical control surface for specifying the behavior of LLMs in chat applications. Developers frequently rely on the precedence of the system message over user messages, and use it to specify important guardrails, content policies, and safety countermeasures to the model. However in practice, models may fail to fully adhere to the system message, as the result of adversarial attacks such as prompt injection or simply through unforced errors when responding to benign queries. In this work we assemble a suite of benchmarks to quantify an LLM's system message robustness. We then collect a novel fine-tuning dataset starting from a diverse set of system prompts from real-world LLM applications, generating challenging synthetic user messages both benign and adversarial, and collecting high-quality model responses. Our experiments show that fine-tuning on our dataset yields considerable gains on a variety of benchmarks, compared to both the starting model as well as fine-tuning on other similarly sized datasets targeted at improving system message compliance.

Chat is not available.

Poster in Workshop: Safe Generative AI

A Closer Look at System Message Robustness

Norman Mu · Jonathan Lu · Michael Lavery · David Wagner

Poster
in
Workshop: Safe Generative AI