Language Technologies Institute, Carnegie Mellon University
Agentic search leverages LLMs to solve complex user information needs by executing a multi-step process of planning, searching, and synthesizing information to provide answers. This paradigm introduces unique challenges for LLMs' agentic reasoning capabilities when interacting with search systems. In this paper, we design a reasoning-driven LLM-based pipeline to study effective behaviors in agentic search. Using this pipeline, we analyze successful agentic search trajectories and extract four beneficial reasoning behaviors: Information Verification, Authority Evaluation, Adaptive Search, and Error Recovery. We then propose a new behavior priming technique to train more effective agentic search models. It synthesizes agentic search trajectories with target behaviors and integrates them into the agentic search model through supervised finetuning (SFT), followed by standard reinforcement learning (RL). Experiments on Qwen3-1.7B and Llama3.2-3B-Instruct across three web benchmarks and seven multi-hop QA benchmarks demonstrate that behavior priming 1) yields significant performance gains compared to training with direct RL, and 2) outperforms other SFT-then-RL baselines, such as those SFT on randomly selected trajectories or on trajectories with merely correct outcomes. We also demonstrate that the reasoning behaviors, rather than the correctness of the final answer, is the critical factor for achieving strong performance in RL: SFT on trajectories with reasoning behaviors but incorrect answers leads to comparable performance with SFT on those with reasoning behaviors and correct answers. Our analysis further reveals that the introduced reasoning behaviors endow models with more effective exploration (higher pass@k and entropy) and test-time scaling (longer trajectories) capabilities, providing a strong foundation for RL.
Validating results across multiple sources and cross-checking information to ensure accuracy.
Assessing source reliability and resolving conflicts by weighting reputable sources.
Modifying search strategies dynamically based on intermediate results and feedback.
Detecting and correcting mistakes through strategic pivots and backtracking.
*We select all trajectories from the same trajectory corpora generated by Gemini-2.5-Flash.
Results on three web agent benchmarks (after RL training): *We identified evaluation hacking in this model. Details are provided in the paper's Appendix.
| Method | Base Model | GAIA | WebWalkerQA | HLE | Overall | |||
|---|---|---|---|---|---|---|---|---|
| Level 1 | Level 2 | Level 3 | Avg. | |||||
| Direct RL (No SFT) | Qwen3-1.7B | 15.4 | 11.5 | 0.0 | 11.7 | 26.1 | 3.9 | 13.9 |
| SFT (Random) + RL | Qwen3-1.7B | 18.0 | 11.5 | 16.7 | 14.6 | 33.2 | 7.4 | 18.4 |
| SFT (Correct) + RL | Qwen3-1.7B | 23.1 | 17.3 | 0.0 | 17.5 | 36.8 | 5.8 | 20.0 |
| Behavior Prime + RL | Qwen3-1.7B | 28.2 | 21.2 | 0.0 | 21.4 | 37.2 | 7.8 | 22.3 |
| Direct RL (No SFT) | Llama3.2-3B-Instruct | 12.8 | 11.5 | 8.3 | 11.7 | 24.7 | 7.0 | 14.5 |
| SFT (Correct) + RL | Llama3.2-3B-Instruct* | 12.8 | 9.6 | 8.3 | 10.8 | 18.1 | 4.6 | 11.2 |
| SFT (Random) + RL | Llama3.2-3B-Instruct | 15.4 | 21.5 | 16.7 | 18.4 | 26.5 | 6.6 | 17.2 |
| Behavior Prime + RL | Llama3.2-3B-Instruct | 25.6 | 13.4 | 16.7 | 18.4 | 33.8 | 7.5 | 19.9 |
Results on seven multi-hop question answering benchmarks (after RL training):
| Method | Base Model | 2Wiki | Bamboogle | HotpotQA | MuSiQue | NQ | PopQA | TriviaQA | Overall |
|---|---|---|---|---|---|---|---|---|---|
| Other Work Baselines | |||||||||
| Search-R1-base | Qwen2.5-7B-Base | 47.9 | 57.6 | 63.0 | 27.5 | 60.0 | 47.0 | 76.2 | 54.2 |
| Search-R1-instruct | Qwen2.5-7B-Instruct | 48.8 | 47.2 | 52.5 | 28.3 | 49.6 | 44.5 | 49.2 | 45.7 |
| R1-Searcher | Qwen2.5-7B-Base | 65.8 | 65.6 | 53.1 | 25.6 | 52.3 | 43.4 | 79.1 | 55.7 |
| DeepResearcher | Qwen2.5-7B-Instruct | 66.6 | 72.8 | 64.3 | 29.3 | 61.9 | 52.7 | 85.0 | 61.8 |
| Our Methods | |||||||||
| Direct RL (No SFT) | Qwen3-1.7B | 66.8 | 64.0 | 61.7 | 25.0 | 72.7 | 53.5 | 87.9 | 61.7 |
| SFT (Random) + RL | Qwen3-1.7B | 65.6 | 74.4 | 64.8 | 26.9 | 74.0 | 56.1 | 85.7 | 63.9 |
| SFT (Correct) + RL | Qwen3-1.7B | 70.3 | 68.8 | 65.0 | 29.7 | 72.7 | 54.7 | 86.3 | 63.9 |
| Behavior Prime + RL | Qwen3-1.7B | 73.8 | 70.4 | 67.0 | 30.7 | 73.6 | 56.6 | 86.9 | 65.5 |
| Direct RL (No SFT) | Llama3.2-3B-Instruct | 74.0 | 72.0 | 65.8 | 31.5 | 74.8 | 56.4 | 90.4 | 66.4 |
| SFT (Random) + RL | Llama3.2-3B-Instruct | 77.7 | 76.0 | 71.3 | 32.0 | 78.9 | 55.7 | 89.3 | 68.7 |
| SFT (Correct) + RL | Llama3.2-3B-Instruct | 69.9 | 72.0 | 69.9 | 30.5 | 77.3 | 60.0 | 89.5 | 67.0 |
| Behavior Prime + RL | Llama3.2-3B-Instruct | 75.6 | 80.0 | 69.5 | 38.1 | 82.0 | 60.4 | 90.6 | 70.9 |
If you find this work helpful, please cite: