Beneficial Reasoning Behaviors in Agentic Search and Effective Post-Training to Obtain Them

Jiahe Jin, Abhijay Paladugu, Chenyan Xiong

Language Technologies Institute, Carnegie Mellon University

Abstract

Agentic search leverages LLMs to solve complex user information needs by executing a multi-step process of planning, searching, and synthesizing information to provide answers. This paradigm introduces unique challenges for LLMs' agentic reasoning capabilities when interacting with search systems. In this paper, we design a reasoning-driven LLM-based pipeline to study effective behaviors in agentic search. Using this pipeline, we analyze successful agentic search trajectories and extract four beneficial reasoning behaviors: Information Verification, Authority Evaluation, Adaptive Search, and Error Recovery. We then propose a new behavior priming technique to train more effective agentic search models. It synthesizes agentic search trajectories with target behaviors and integrates them into the agentic search model through supervised finetuning (SFT), followed by standard reinforcement learning (RL). Experiments on Qwen3-1.7B and Llama3.2-3B-Instruct across three web benchmarks and seven multi-hop QA benchmarks demonstrate that behavior priming 1) yields significant performance gains compared to training with direct RL, and 2) outperforms other SFT-then-RL baselines, such as those SFT on randomly selected trajectories or on trajectories with merely correct outcomes. We also demonstrate that the reasoning behaviors, rather than the correctness of the final answer, is the critical factor for achieving strong performance in RL: SFT on trajectories with reasoning behaviors but incorrect answers leads to comparable performance with SFT on those with reasoning behaviors and correct answers. Our analysis further reveals that the introduced reasoning behaviors endow models with more effective exploration (higher pass@k and entropy) and test-time scaling (longer trajectories) capabilities, providing a strong foundation for RL.

Four Beneficial Reasoning Behaviors

Information Verification

Validating results across multiple sources and cross-checking information to ensure accuracy.

Example: "My task is clear: verify if the quoted text exactly matches Greetham's article. Accuracy is paramount; I'll use 'uncoupled', 'authors', 'mis-transmission', and 'veil' to zero in on the relevant section."

Authority Evaluation

Assessing source reliability and resolving conflicts by weighting reputable sources.

Example: "I'm aiming for the USGS's own reports or databases, like the 'Nonindigenous Aquatic Species' page, to get the most reliable data."

Adaptive Search

Modifying search strategies dynamically based on intermediate results and feedback.

Example: "The search engine may not have indexed the quote perfectly, or the user's quote may differ slightly. I'll refine my strategy."

Error Recovery

Detecting and correcting mistakes through strategic pivots and backtracking.

Example: "My initial search didn't deliver the goods... I need to get more precise."

Method Overview

Behavior Identification

  1. Trajectory Generation: Generate diverse agentic search trajectories with diffrent LLMs, and construct successful vs. failed trajectories pairs from different LLMs on same tasks.
  2. Trajectory Analysis: Use LLM-based pipeline to analyze the successful vs. failed trajectories pairs → extract beneficial behaviors from analysis → merger and consolidate common behavior patterns.

Behavior Priming

  1. Trajectory Corpora Generation: Generate large agentic search trajectories corpora using a powerful model (Gemini-2.5-Flash).
  2. Behavior Analysis: Use an LLM-based pipeline to analyze the presence of beneficial behaviors in each trajectory.
  3. SFT with Behavior Priming: Fine-tune models on trajectories exhibiting the four identified behaviors.
  4. RL Training: Apply standard reinforcement learning (GRPO) to train behavior-primed models.

Main Results

Baseline Approaches

  • Direct RL (No SFT): Train the model with direct RL without any SFT.
  • SFT (Random) + RL: Train the model with SFT on randomly selected trajectories and then with RL.
  • SFT (Correct) + RL: Train the model with SFT on trajectories with correct answers and then with RL.

*We select all trajectories from the same trajectory corpora generated by Gemini-2.5-Flash.

Performance on Web Agent Benchmarks

Results on three web agent benchmarks (after RL training): *We identified evaluation hacking in this model. Details are provided in the paper's Appendix.

Method Base Model GAIA WebWalkerQA HLE Overall
Level 1 Level 2 Level 3 Avg.
Direct RL (No SFT) Qwen3-1.7B 15.4 11.5 0.0 11.7 26.1 3.9 13.9
SFT (Random) + RL Qwen3-1.7B 18.0 11.5 16.7 14.6 33.2 7.4 18.4
SFT (Correct) + RL Qwen3-1.7B 23.1 17.3 0.0 17.5 36.8 5.8 20.0
Behavior Prime + RL Qwen3-1.7B 28.2 21.2 0.0 21.4 37.2 7.8 22.3
Direct RL (No SFT) Llama3.2-3B-Instruct 12.8 11.5 8.3 11.7 24.7 7.0 14.5
SFT (Correct) + RL Llama3.2-3B-Instruct* 12.8 9.6 8.3 10.8 18.1 4.6 11.2
SFT (Random) + RL Llama3.2-3B-Instruct 15.4 21.5 16.7 18.4 26.5 6.6 17.2
Behavior Prime + RL Llama3.2-3B-Instruct 25.6 13.4 16.7 18.4 33.8 7.5 19.9

Performance on Multi-hop QA Benchmarks

Results on seven multi-hop question answering benchmarks (after RL training):

Method Base Model 2Wiki Bamboogle HotpotQA MuSiQue NQ PopQA TriviaQA Overall
Other Work Baselines
Search-R1-base Qwen2.5-7B-Base 47.9 57.6 63.0 27.5 60.0 47.0 76.2 54.2
Search-R1-instruct Qwen2.5-7B-Instruct 48.8 47.2 52.5 28.3 49.6 44.5 49.2 45.7
R1-Searcher Qwen2.5-7B-Base 65.8 65.6 53.1 25.6 52.3 43.4 79.1 55.7
DeepResearcher Qwen2.5-7B-Instruct 66.6 72.8 64.3 29.3 61.9 52.7 85.0 61.8
Our Methods
Direct RL (No SFT) Qwen3-1.7B 66.8 64.0 61.7 25.0 72.7 53.5 87.9 61.7
SFT (Random) + RL Qwen3-1.7B 65.6 74.4 64.8 26.9 74.0 56.1 85.7 63.9
SFT (Correct) + RL Qwen3-1.7B 70.3 68.8 65.0 29.7 72.7 54.7 86.3 63.9
Behavior Prime + RL Qwen3-1.7B 73.8 70.4 67.0 30.7 73.6 56.6 86.9 65.5
Direct RL (No SFT) Llama3.2-3B-Instruct 74.0 72.0 65.8 31.5 74.8 56.4 90.4 66.4
SFT (Random) + RL Llama3.2-3B-Instruct 77.7 76.0 71.3 32.0 78.9 55.7 89.3 68.7
SFT (Correct) + RL Llama3.2-3B-Instruct 69.9 72.0 69.9 30.5 77.3 60.0 89.5 67.0
Behavior Prime + RL Llama3.2-3B-Instruct 75.6 80.0 69.5 38.1 82.0 60.4 90.6 70.9

Citation

If you find this work helpful, please cite:

@article{jin2025beneficial, title = {Beneficial Reasoning Behaviors in Agentic Search and Effective Post-Training to Obtain Them}, author = {Jiahe Jin and Abhijay Paladugu and Chenyan Xiong}, year = {2025}, journal = {arXiv preprint arXiv:2510.06534}, url = {https://arxiv.org/abs/2510.06534} }