Poster
in
Workshop: Workshop on Responsibly Building Next Generation of Multimodal Foundation Models
Aligning to What? Limits to RLHF Based Alignment
Logan Barnhart · Reza Akbarian Bafghi · Maziar Raissi · Stephen Becker
Keywords: [ alignment ] [ model safety ] [ reinforcement learning from human feedback ] [ red teaming ] [ rlhf ]
Reinforcement Learning from Human Feedback (RLHF) is increasingly used to align large language models (LLMs) with human preferences. However, the effectiveness of RLHF in addressing underlying biases remains unclear. This study investigates the relationship between RLHF and both covert and overt biases in LLMs, particularly focusing on biases against African Americans. We applied various RLHF techniques (DPO, ORPO, RLOO) to Llama 3 8B and evaluated the resulting models using matched-guise probing and explicit bias testing.Our findings suggest that RLHF may not effectively align LLMs as intended. In most cases, RLHF either worsened both covert and overt biases or left them relatively unchanged compared to the base model. These results indicate that current RLHF techniques fail to address underlying biases introduced during pretraining, particularly for ambiguous objectives like harmlessness. Our study highlights the need for the development of improved techniques to ensure genuine alignment of LLMs with abstract alignment goals.