IBM

Evaluation of large language model (LLM) outputs requires users to make critical judgments about the best outputs across various configurations. This process is costly and takes time given the large amounts of data. LLMs are increasingly used as evaluators to filter training data, evaluate model performance, detect harms and risks, or assist human evaluators with detailed assessments. To support this process, effective front-end tools are critical for evaluation. EvalAssist abstracts the llm-as-a-judge evaluation process into a library of parameterize-able evaluators (the criterion being the parameter), allowing the user to focus on criteria definition. EvalAssist consists of a web-based user experience, an API, and a Python toolkit and is based on the UNITXT open-source library. The user interface provides users with a convenient way of iteratively testing and refining LLM-as-a-judge criteria, and supports both direct (rubric-based) and pairwise assessment paradigms, the two most prevalent forms of LLM-as-a-judge evaluation available. In our demo, we will showcase different types of evaluator LLMs for general purpose evaluation and also the latest Granite Guardian model (released October 2024) to evaluate harms and risks.

Chat is not available.