Poster
Rule Based Rewards for Fine-Grained Safety Behavior in LLMs
Tong Mu · Alec Helyar · Johannes Heidecke · Joshua Achiam · Andrea Vallone · Ian Kivlichan · Molly Lin · Alex Beutel · John Schulman · Lilian Weng
Fine-tuning large language models (LLMs) on human preferences has been shown to enhance both their capabilities and safety behavior. However, if instructions are underspecified, annotators may have to rely on personal biases, producing unintended behaviors. This is especially the case for safety, where desired model responses are complex, requiring nuance on whether and how to respond to requests. In these cases, without precise instructions to annotators, the model may become overly cautious, or it may respond in an undesirable style, such as being judgmental. Additionally, as model capabilities and usage patterns evolve, there may be a need to revise existing safety instruction or add new ones. This requires relabeling or collecting new data, which is often expensive and time consuming. We propose a novel preference modeling approach that, with minimal human data, enables fine-grained behavior control and rapid updating through Rule-Based Rewards (RBRs). An RBR is a collection of rules for desired or undesired behaviors (ex."refusals should not be judgmental"). We use an RBR with rules scored by a LLM grader as additional reward signals on top of our human-preference reward model (RM) during reinforcement learning training. We show that RBRs are an effective training method, resulting in safety performance comparable to human-feedback baseline while reducing over-refusals. Additionally, we demonstrate that our RBRs can correct safety behavior across RMs with varying tendencies towards unsafe or overcautious behavior. We also discuss the impact of different design considerations, such as grader model size and optimal integration of RBRs with the RM. Overall, we show RBRs are a fast, adaptive and scalable method for training safety behavior in LLMs.
Live content is unavailable. Log in and register to view live content