Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Socially Responsible Language Modelling Research (SoLaR)

ReFeR: A Hierarchical Framework of Models as Evaluative and Reasoning Agents

Yaswanth Narsupalli · Abhranil Chandra · Sreevatsa Muppirala · Manish Gupta · Pawan Goyal

Keywords: [ Reasoning ] [ Large Language Models ] [ Evaluations of LMs ]


Abstract: Assessing the quality of Natural Language Generation (NLG) outputs, such as those produced by large language models (LLMs), poses significant challenges. Human evaluations are not scalable, and traditional automatic metrics exhibit low correlation with human judgment. In this study, we propose Review-Feedback-Reason (ReFeR), a novel evaluation framework for NLG using LLM agents. The proposed framework enhances the accuracy of NLG evaluation, surpassing previous benchmarks by $\sim$20\%. Moreover, feedback collected from our framework is then leveraged to instruction fine-tune smaller models like Mistral-7B, yielding a better correlation with human evaluations and performance nearly on par with GPT-3.5. We highlight another ancillary benefit of our methodology through its application on reasoning benchmarks, outperforming most of the state-of-the-art methods and also beating GPT-3.5 Turbo by $\sim$11.67\% and GPT-4 by $\sim$1\% on an average.

Chat is not available.