NeurIPS Poster CLAVE: An Adaptive Framework for Evaluating Values of LLM Generated Responses

Poster

CLAVE: An Adaptive Framework for Evaluating Values of LLM Generated Responses

Jing Yao · Xiaoyuan Yi · Xing Xie

[ Abstract ]

[ Paper]

Abstract:

The rapid progress in Large Language Models (LLMs) poses potential risks such as generating unethical content. Assessing the values embedded in LLMs' generated responses can help expose their misalignment, but this relies on reference-free value evaluators, e.g. fine-tuned LLMs or closed-source models like GPT-4. Nevertheless, two key challenges emerge in open-ended value evaluation: the evaluator should adapt to changing human value definitions with minimal annotation, against their own bias (adaptability); and remain robust across varying value expressions and scenarios (generalizability). To handle these challenges, we introduce CLAVE, a novel framework that integrates two complementary LLMs: a large model to extract high-level value concepts from diverse responses, leveraging its extensive knowledge and generalizability, and a small model fine-tuned on these concepts to adapt to human value annotations. This dual-model framework enables adaptation to any value system using <100 human-labeled samples per value type. We also present ValEval, a comprehensive dataset comprising 13k+ (text,value,label) tuples across diverse domains, covering three major value systems. We benchmark the performance of 15+ popular LLM evaluators and fully analyze their strengths and weaknesses. Our findings reveal that CLAVE combining a large prompt-based model and a small fine-tuned one serves as an optimal balance in value evaluation.

Chat is not available.