Skip to yearly menu bar Skip to main content


Poster

StackEval: Benchmarking LLMs in Coding Assistance

Zulkuf Genc · Nidhish Shah · Dogu Araci

West Ballroom A-D #5109
[ ] [ Project Page ]
Fri 13 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

We present a comprehensive set of benchmarks to evaluate the performance of Large Language Models (LLMs) in coding assistance tasks, covering code writing, debugging, code review, and answering conceptual questions. Our main contribution includes three curated benchmarks: a coding assistance (StackEval) benchmark with 925 Stack Overflow questions, a recent coding assistance (StackEval-Recent) benchmark with 300 questions from the most recent Stack Overflow content, and an LLM-as-a-Judge benchmark featuring 136 LLM-generated answers validated by domain experts. These benchmarks offer novel insights into the capabilities and limitations of LLMs, particularly in handling new and emerging content. To ensure reproducibility and ongoing relevance, we publicly share our datasets and evaluation code, with plans to update the recent dataset biannually. Our findings underscore the potential of these benchmarks to advance LLM development and application in coding assistance.

Chat is not available.