Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Evaluating Evaluations: Examining Best Practices for Measuring Broader Impacts of Generative AI

Motivations for Reframing Large Language Model Benchmarking for Legal Applications

Riya Ranjan · Megan Ma

Keywords: [ preference rating ] [ heterogeneity ] [ qualitative evaluation ] [ domain-specific benchmarking ]


Abstract:

With the continued release of increasingly performant large language models, benchmarking LLMs remains a critical space for research. Informative benchmarks in domain-specific areas, however, are limited. Particularly, benchmarks of LLMs for legal applications are insufficient, as they are often confined to a narrow set of tasks that do not imitate true legal workflows, or are difficult to replicate, with a lack of transparency about how criteria are discerned and outputs scored. We propose a new framework for benchmarking legal LLMs based on tasks that accurately reflect real legal workflows: lawyer preference. We argue that benchmarking for preference can capture nuances in how legal practitioners evaluate their own work, and thus provides a more suitable metric for the quality of LLMs for legal work.

Chat is not available.