Agentic-o1: Teaching AI Agents to Search with
Reflective-MCTS and Exploratory Learning

1Columbia University, NY; 2Microsoft Research, Redmond;
Project Lead Equal Advising

Our work yields compute scaling of GPT-4o with R-MCTS (left) and fine-tuned GPT-4o (right), for both training and testing, respectively. Left is evaluated on all 234 tasks from Classifieds in VisualWebArena using our GPT-4o based R-MCTS agent, and right is evaluated on 169 unseen tasks from Classifieds using our fine-tuned GPT-4o without any search algorithms.

Introduction

We present R-MCTS and Exploratory Learning for building o1-like models for agentic applications. Our R-MCTS agent extends traditional MCTS by 1) incorporating contrastive reflection, allowing agents to learn from past interactions and dynamically improve their search efficiency; and 2) using multi-agent debate to provide reliable state evaluation. Exploratory Learning is a novel learning strategy that trains the models to explore the environment, evaluate a state, and backtrack to viable ones when it detects that the current state cannot lead to success. Our GPT-4o powered R-MCTS agent creates SOTA performance on VisualWebArena. Notably, R-MCTS and Exploratory Learning demonstrate the compute scaling properties in both training and testing time.

R-MCTS Agent

We propose Reflective Monte Carlo Tree Search (R-MCTS), an extension of classic MCTS that improves the agent's decision making process on the fly by incorporating reflection over its past task executions, and state estimations using multi-agent-debate. Specifically, RMCTS iteratively improves the search process by 1) using contrastive reflection to identify past mistakes/successes; and 2) updating the agent's policy and value functions (in-context) to improve its future task executions during inference. This improvement step helps R-MCTS to avoid repeating the same mistakes and to explore more promising actions when faced with similar situations in the future. We illustrate this inference-improvement loop below.



Overview of an R-MCTS Agent. We omit value function reflection for brevity.


In our experiments (see table below), we find that:

  • R-MCTS sets a new state-of-the-art across all VisualWebArena environments, achieving 6%-30% improvements over the previous best-performing method, Search Agent.
  • MCTS outperforms other search methods by effectively balancing exploration and exploitation via UCT, while others primarily rely on value-based guidance.
  • R-MCTS improves over MCTS by an average of 2.2 points using a single-agent value function.
  • Multi-agent debate integration boosts performance, adding another 1.6 point improvement on average for R-MCTS.


Method

Search

Value
Classifieds Reddit Shopping
Tokens Success Tokens Success Tokens Success
ReACT - SA 1x 28.6% 1x 13.8% 1x 23.2%
ToT BFS SA 3.2x 30.7% 4.7x 19.5% 4.3x 29.2%
ToT DFS SA 3.1x 30.3% 4.2x 16.7% 4.0x 28.3%
Search Agent Best-First SA 4.2x 33.8% 5.4x 21.9% 5.1x 30.3%
MCTS MCTS SA 7.2x 37.6% 9.5x 23.8% 8.9x 29.4%
R-MCTS MCTS SA 7.3x 40.2% 7.3x 25.2% 7.6x 31.9%
R-MCTS MCTS MAD 7.4x 41.0% 9.7x 28.7% 10.1x 32.3%

Comparing different agent's token consumption and performance on VisualWebArena. We show the average token used per task using REACT as a baseline. SA stands for single-agent; MAD stands for multi-agent debate.

Exploratory Learning

We also explore how to teach agents to search and enable test-time compute scaling without relying on MCTS. To accomplish this, we introduce Exploratory Learning, which trains the agent to explore the environment, evaluate states, and backtrack to viable states when facing unpromising states. In our experiments with VisualWebArena, we find that:

  • Exploratory Learning boosts performance: GPT-4o fine-tuned with Exploratory Learning improves performance, even without search, similar to the effects of scaling test-time compute with MCTS (see demo (c) and (d) below).
  • Exploratory Learning enables test-time compute scaling: The fine-tuned GPT-4o exhibits better performance when allowed more actions per task, enhancing decision-making and task completion.
  • Exploratory Learning improves generalization to unseen tasks: The fine-tuned GPT-4o demonstrates improved performance on unseen tasks compared to no training/imitation learning on best actions.


Given a task and a trajectory from R-MCTS, (traditional) imitation learning removes intermediate search trees and directly trains GPT-4o to learn from the final executed actions; Exploratory learning flattens tree traversals into a single trajectory and trains GPT-4o to explore, evaluate, and backtrack.

Demo

BibTeX

@misc{yu2024improvingautonomousaiagents,
    title={Improving Autonomous AI Agents with Reflective Tree Search and Self-Learning}, 
    author={Xiao Yu and Baolin Peng and Vineeth Vajipey and Hao Cheng and Michel Galley and Jianfeng Gao and Zhou Yu},
    year={2024},
    eprint={2410.02052},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2410.02052}, 
}