Spotlight Poster
Value Imprint: A Technique for Auditing the Human Values Embedded in RLHF Datasets
Ike Obi · Rohan Pant · Srishti Shekhar Agrawal · Maham Ghazanfar · Aaron Basiletti
LLMs are increasingly being fine-tuned using RLHF datasets to align them with human preferences and values. However, little research has investigated which specific human values are being operationalized through these datasets. In this paper, we introduce an approach for auditing RLHF datasets to examine the human values and ethical paradigms embedded within them. Our approach involves two phases. During the first phase, we developed a taxonomy of human values through a systematic review of prior works from philosophy, axiology, and ethics and then used this taxonomy to manually annotate a section of the RLHF preferences to gain foundational insight into the kinds of human values and ethical paradigms embedded within the RLHF dataset. During the second phase, we employed the labels generated from the first phase as ground truth labels to train transformer-based models and, using that approach, conducted an automated human values audit of more than 100K RLHF preferences that we contribute via this study. Overall, through these approaches, we discovered the most dominant human values and ethical orientations within the RLHF preference dataset. These findings have significant implications for developing LLMs and AI systems that align with societal values and norms.
Live content is unavailable. Log in and register to view live content