Poster
in
Workshop: Workshop on Behavioral Machine Learning
Comparing Human and LLM Ratings of Music-Recommendation Quality with User Context
Sherol Chen · Yuri Vasilevski · Andrew Lampinen · Amnah Ahmad · Ndaba Ndebele · Sally Goldman · Michael Mozer · Jie Ren
When developing new recommendation algorithms, it is common to run user studies in order to obtain initial evaluation metrics. We explore using Large Language Models (LLMs) as a proxy for human ratings (an AutoRater) for this setting. In particular, we explore how effectively an LLM with basic Chain-of-Thought (CoT) prompting can predict human ratings, and also leverage knowledge of a user's music likes and dislikes. We ran a study in which paid users provided queries and then rated the resulting playlist. These users also provide their music likes and dislikes. We compare the ratings between the AutoRater system to a human rater baseline---a group of participants who were asked to perform the same task of rating the playlist recommendations, either with or without user context.We found the AutoRater to be as effective if not more so than human raters, and that providing user context leads to higher correlations for both. Also the correlations reliably increase with the size of the rater pool. These results were statistically reliable. Although numerically higher correlations are obtained with 8 songs than with 4 songs, the difference is not statistically reliable. Interestingly, interaction between the rater type crossed with the number songs is statistically reliable reflecting that human raters perform as well or better than the AutoRaters for 4 song playlists, whereas the pattern is flipped for 8 song playlists where AutoRaters can use broad knowledge of music to make more nuanced ratings.