Skip to yearly menu bar Skip to main content

Workshop: Workshop on Open-World Agents: Synnergizing Reasoning and Decision-Making in Open-World Environments (OWA-2024)

Policy optimization to align the validity, coherence and efficiency of reasoning agents in multi-turn dialogues

Jeremy Curuksu

Keywords: [ multi-turn dialogues ] [ generative retrieval ] [ reasoning-acting agents ] [ Large language models ] [ policy optimization ] [ fine tuning ]


Reinforcement learning from human preferences can fine tune language models for helpfulness and safety, but does not directly address the fidelity and efficiency of reasoning agents in multi-turn dialogues. I propose a method to improve the validity, coherence and efficiency of reasoning agents by defining a reward model as a mapping between predefined queries and tools which can be applied to any custom orchestration environment. The reward model is used for policy optimization to fine tune the clarification fallback behavior and help the agent learn when best to ask for clarifications in multi-turn dialogues. This is demonstrated in several orchestration environments where after fine tuning with either proximal policy optimization or verbal reinforcement, the new policy systematically identifies the correct intents and tools in < 2 steps in over 99% of all sampled dialogues.

Chat is not available.