Poster
in
Workshop: Pluralistic Alignment Workshop
Chain of Alignment
Andrew Konya · Aviv Ovadya · K. J. Kevin Feng · Quan Ze Chen · Lisa Schirch · Colin Irwin · Amy Zhang
Abstract:
We introduce a method to measure the alignment between public will and language model (LM) behavior that can be applied to fine-tuning, online oversight, and pre-release safety checks. Our ``chain of alignment'' (CoA) approach produces a rule based reward (RBR) by creating model behavior rules aligned to normative objectives aligned to public will. This factoring enables a nonexpert public to directly specify their will through the normative objectives, while expert intelligence is used to figure out rules that cause a model to best achieve those objectives. We validate our approach by applying it to three different domains of LM prompts related to mental health. We demonstrate a public input process built on collective dialogues and bridging-based ranking that reliably produces normative objectives supported by at least $96\% \pm 2\%$ of the US public. We then show that rules developed by mental health experts to achieve those objectives enable RBRs that evaluate an LM response's alignment with the objectives similarly to experts (Pearson's $r=0.841$, $AUC=0.964$). By measuring alignment with objectives that have near unanimous public support, our RBRs thus provide an approximate measure of alignment between an LM response and public will.
Chat is not available.