Poster
in
Workshop: Generative AI for Education (GAIED): Advances, Opportunities, and Challenges
Paper 46: Improving the Coverage of GPT for Automated Feedback on High School Programming Assignments
Shubham Sahai · Umair Ahmed · Ben Leong
Keywords: [ hints ] [ CS1 ] [ GPT ] [ programming ] [ feedback ] [ assignment ] [ LLM ] [ repair ]
Feedback for incorrect code is important for novice learners of programming. Automated Program Repair (APR) tools have previously been applied to generate feedback for the mistakes made in introductory programming classes. Large Language Models (LLMs) have emerged as an attractive alternate to automatic feedback generation since they have been shown to excel at generating both human-readable text as well as code. In this paper, we compare the effectiveness of LLMs to APR techniques for code repair and feedback generation in the context of high school Python programming assignments, by evaluating both APR and LLMs on a diverse dataset comprising 366 incorrect submissions for a set of 69 problems with varying complexity from a public high school. We show that LLMs are more effective at generating repair than APR techniques, if provided with a good evaluation oracle. While the state-of-the-art GPTs are able to generate feedback for buggy code most of the time, the direct invocation of such LLMs still suffer from some shortcomings. In particular, GPT-4 can fail to detect up to 11% of the bugs, gives invalid feedback around 7% of the time, and {\em hallucinates} about 4% of the time. We show that a new architecture that invokes GPT using a conversational interactive loop can improve the repair coverage of GPT-3.5T from 64.8% to 74.9%, at par with the performance of the state-of-the-art LLM GPT-4. Similarly, the coverage of GPT-4 can be further improved from 74.9% to 88.5% with the same methodology within 5 iterations.