Poster
in
Workshop: Workshop on Open-World Agents: Synnergizing Reasoning and Decision-Making in Open-World Environments (OWA-2024)
Fine-Tuning Web Agents: It Works, But It's Trickier Than You Think
Massimo Caccia · Megh Thakkar · Léo Boisvert · Thibault de Chezelles · Alexandre Piche · Nicolas Chapados · Alexandre Drouin · Maxime Gasse · Alexandre Lacoste
Keywords: [ LLM ] [ fine-tuning ] [ web agents ] [ behavior cloning ]
Recent advancements in large language models (LLMs) have sparked interest in developing autonomous web agents capable of performing digital tasks through web interfaces in a human-like manner. However, even the strongest closed-source models often struggle to achieve robust results on several benchmarks, while a notable performance gap exists between them and open-source counterparts. This study investigates the potential of fine-tuning to enhance the performance of a smaller, lower-performing but cost-efficient LLM by leveraging successful traces from stronger LLMs, referred to as experts. We outline a comprehensive pipeline for data collection, filtering, and supervised fine-tuning and explore various behavior cloning parameters. Our experiments provide key insights into the challenges of fine-tuning LLMs into web agents on benchmarks like MiniWoB and WorkArena. Notably, we find that the fine-tuned agents' ability to predict expert trajectories does not consistently lead to improved downstream task performance. This raises issues such as off-policy bias and the loss of reasoning abilities during fine-tuning. We discuss potential solutions to these challenges and make both the codebase and a dataset of 140M tokens open-source for the community to build upon.