NeurIPS A STEP TOWARDS MIXTURE OF GRADER: STATISTICAL ANALYSIS OF EXISTING AUTOMATIC EVALUATION METRICS

Poster
in
Workshop: Statistical Frontiers in LLMs and Foundation Models

A STEP TOWARDS MIXTURE OF GRADER: STATISTICAL ANALYSIS OF EXISTING AUTOMATIC EVALUATION METRICS

Yun Joon Soh · Jishen Zhao

[ Abstract ] [ Project Page ]

[ OpenReview]

Sat 14 Dec noon PST — 12:45 p.m. PST

Abstract:

The explosion of open-sourced models and Question-Answering (QA) datasets emphasizes the importance of automated QA evaluation.We studied the statistics of the existing evaluation metrics for a better understanding of their limitations.By measuring the correlation coefficients of each evaluation metric concerning human-like evaluation score, we observed the following: (1) existing metrics have a high correlation among them concerning the question type (e.g., single word, single phrase, etc.), (2) no single metric can adequately estimate the human-like evaluation.As a potential solution, we discuss how a Mixture Of Grader could potentially improve the auto QA evaluator quality.

Chat is not available.

Poster in Workshop: Statistical Frontiers in LLMs and Foundation Models

A STEP TOWARDS MIXTURE OF GRADER: STATISTICAL ANALYSIS OF EXISTING AUTOMATIC EVALUATION METRICS

Yun Joon Soh · Jishen Zhao

Poster
in
Workshop: Statistical Frontiers in LLMs and Foundation Models