NeurIPS Poster EffiBench: Benchmarking the Efficiency of Automatically Generated Code

Poster

EffiBench: Benchmarking the Efficiency of Automatically Generated Code

Dong HUANG · Yuhao QING · Weiyi Shang · Heming Cui · Jie Zhang

West Ballroom A-D #7000

[ Abstract ]

[ Paper]

Thu 12 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract:

Code generation models have increasingly become integral to aiding software development. Although current research has thoroughly examined the correctness of the code produced by code generation models, a vital aspect that plays a pivotal role in greencomputing and sustainability efforts — the efficiency of the generated code — has often been neglected. This paper presents Effibench, a benchmark with 1,000 efficiency-critical coding problems to assess the efficiency of code generated by code generation models. EffiBench contains a diverse set of LeetCode coding problems. Each problem is paired with an executable human-written canonical solution, which obtains the SOTA efficiency on the LeetCode solution leaderboard. With EffiBench, we empirically examine the ability of 42 large language models (35 open-source and 7 closed-source) to generate efficient code. Our evaluation results demonstrate that the efficiency of the code generated by LLMs is generally worse than the efficiency of human-written canonical solutions. For example, GPT-4 generated code has an average \textbf{3.12} times execution time that of the human-written canonical solutions. In the most extreme cases, the execution time and total memory usage of GPT-4 code are \textbf{13.89} and \textbf{43.92} times that of the canonical solutions. The source code of EffiBench is released on https://github.com/huangd1999/EffiBench. We also provide the LeaderBoard in https://huggingface.co/spaces/EffiBench/effibench-leaderboard.

Chat is not available.