NeurIPS Is Free Self-Alignment Possible?

Poster
in
Workshop: Foundation Model Interventions

Is Free Self-Alignment Possible?

Dyah Adila · Changho Shin · Yijing Zhang · Frederic Sala

Keywords: [ representation engineering ] [ self-alignment ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Aligning pretrained language models (LMs) is a complex and resource-intensive process, often requiring access to large amounts of ground-truth preference data and substantial compute. Are these costs necessary? That is, it is possible to align using only inherent model knowledge and without additional training? We tackle this challenge with AlignEZ, a novel approach that uses (1) self-generated preference data and (2) representation editing to provide nearly cost-free alignment. During inference, AlignEZmodifies LM representations to reduce undesirable and boost desirable components using subspaces identified via self-generated preference pairs. Our experiments reveal that this nearly cost-free procedure significantly narrows the gap between base pretrained and tuned models by an average of 31.6%, observed across six datasets and three model architectures. Additionally, we explore the potential of using AlignEZ as a means of \emph{expediting} more expensive alignment procedures. Our experiments show that AlignEZ improves DPO models tuned only using a small subset of ground-truth preference data.

Chat is not available.

Poster in Workshop: Foundation Model Interventions

Is Free Self-Alignment Possible?

Dyah Adila · Changho Shin · Yijing Zhang · Frederic Sala

Poster
in
Workshop: Foundation Model Interventions