Leaderboard & Analysis · AgentWebBench

Leaderboard

Main results

Seven LLMs × four coordination settings across all tasks. Pick a task, filter by strategy or model, and sort any column. The per-column leader is marked ● best; runner-up ● 2nd. For Deep Research, KPC is lower-is-better.

Run AgentWebBench Yourself

Classic IR baseline

Swipe the table horizontally to see all columns →

Finding 1

Overall performance is modest. AgentWebBench is challenging under decentralized access: the best agent only slightly beats classic IR on web search, and web recommendation stays near-zero.

Finding 2

Website selection is a bottleneck. Tool_P often beats Tool_E on web search, since LLM reasoning picks more relevant sites than embedding similarity.

Finding 3

Content agents are less stable on retrieval-heavy tasks. Tool_P usually edges out Multi-Agent on web search; the gap fades on generative tasks.

Finding 4

Multi-Agent is promising but task-dependent. Lower on search and deep research, but the gap shrinks for strong models, and it beats Classical on QA.

Finding 5

Model scale helps consistently. Within Qwen3, performance rises with scale, especially on coordination-intensive QA.

Analysis

Understanding the paradigm beyond accuracy

Four lenses on how the Agentic Web behaves as a multi-agent ecosystem.

Systemic impact

Decentralized access concentrates web traffic

In the Classical setting, citations spread across many sources, from academic repositories to community forums. Under multi-agent coordination, citations concentrate on a small set of domains (e.g., wikipedia.org, sciencedirect.com).

Decentralized access acts as a strong filter: better planning repeatedly selects the most useful sources and ignores the rest. This improves reliability but reduces source diversity, and may make it harder for most content providers to be discovered.

Citation frequency of the 20 most-cited sources for Qwen3-4B and Gemini-3 on deep research, comparing Classical vs Multi-Agent. — Citation frequency of the 20 most-cited sources on deep research. Multi-Agent (blue) concentrates on a few high-coverage knowledge bases vs. Classical (yellow).

Optimization

Test-time scaling narrows the gap

Thinking mode consistently improves performance at both model scales. Letting agents simulate and verify action sequences before execution yields more deliberate planning, which is crucial when an agent must choose which content agents to query and when to stop.

It also improves interaction reliability, reducing malformed requests: explicit reasoning acts as an internal check on protocol adherence.

1.0→26.8N@3 on web search, Qwen3-4B (without → with thinking)

3.8→15.9QA accuracy, Qwen3-4B (without → with thinking)

Effect of test-time scaling: token length, N@3 / accuracy, and request validity with and without thinking mode. — With vs. without thinking: response length, task score, and request validity (% of successfully parsed requests).

Efficiency

Coordination needs enough interactions

Efficiency is not about minimizing interactions. Gemini-3 contacts more agents and issues more requests than Qwen3-4B, trading lower validity for broader coverage, and wins on most tasks. Yet on web recommendation, Qwen3-4B interacts more without better results: quantity alone fails when intent inference is unstable. This motivates adaptive interaction budgeting and planning-aware coordination.

Interaction efficiency violin plots across four tasks: turns, contacted agents, requests, and validity for Qwen3-4B vs Gemini-3. — Interaction efficiency across tasks: turns, #agents, #requests, and request validity for the weakest vs. strongest model.

Failure modes

Both sides need to improve

Failures split differently by model and task. On web search, Qwen models fail more on the user-agent side, while DeepSeek, GPT-5, and Gemini-3 fail more on the content-agent side. On QA, failures are mostly attributed to the user agent. Even when content agents return correct evidence, the user agent can answer at the wrong granularity.

Failure on Question Answering · Gemini-3

Query: In Plato's analogy of the sun, what element does he compare to the Form of the Good in enabling knowledge?

Ground truth: Light

Evidence (Wikipedia & Britannica): “The Sun provides light, which allows the eye to see… the Form of the Good provides truth and reality, allowing the soul to understand.”

User-agent answer: The Sun ✗

Correct evidence retrieved, but the agent picks the entity being compared instead of the enabling element, an answer-synthesis error.

Decentralized access concentrates web traffic

Test-time scaling narrows the gap

Coordination needs enough interactions

Both sides need to improve

Failure attribution (%): user agent vs. content agents