Research

Tencent Hunyuan and Tsinghua Researchers Find AI Search Agents Fail at Asking Right Questions

A new benchmark from Tencent Hunyuan and Tsinghua University shows AI search agents struggle with ambiguous queries, achieving end-to-end accuracy below 50% in most cases.

Image: The Decoder

AI search agents rarely fail at multi-step research tasks because of the search itself. Their real problem is failing to ask the user for clarification when queries are ambiguous. That's the finding of a new benchmark from a team at Tencent Hunyuan and Tsinghua University. Repeated searching often performs worse than just guessing. With DiscoBench, the researchers built a test framework that checks whether language models can spot ambiguity on their own during deep search chains, ask targeted follow-up questions, and correct their research path. Previous benchmarks like GAIA or BrowseComp assume user queries are complete and unambiguous. But real-world queries are often vague, incomplete, or flat-out wrong. In long reasoning chains, every unresolved ambiguity compounds and steers the agent down the wrong path. If the model picks the wrong entity at an early node, it keeps searching with clean syntax but misses the actual target entirely. When a search agent guesses instead of clarifying ambiguities, the error cascades through the entire reasoning chain and produces a wrong final answer.

DiscoBench contains 211 tasks with a total of 463 ambiguous points across eleven knowledge domains, including video games, sports, music, film, science, and politics. Each task is split into multiple checkpoints. At each checkpoint, the agent can choose between three actions: keep searching, ask the user for clarification, or give an answer. The framework checks whether the search is unambiguous at each checkpoint and evaluates agents across four metric groups, from task success to cost efficiency. The researchers define four types of ambiguity. A description might match multiple entities, apply to different time periods or versions, allow for multiple valid ranking or evaluation criteria, or contain an outright factual error. The dataset is mostly written in Chinese to reflect typical search patterns on the Chinese-language web. When the agent asks a useful follow-up question, an LLM-based user simulator releases a predefined clue that helps narrow the search. All search queries run through the agent search engine Tavily, and Gemini 3 Flash serves as the simulator. The pipeline first builds clean multi-hop questions in phase one, then injects targeted ambiguities and distinguishing clues in phase two.

The team tested eleven models released in the past six months, including Claude Opus 4.7, GPT 5.4, Gemini 3.1 Pro Preview, Doubao Seed 2.0 Pro, DeepSeek V4 Pro, Kimi K2.6, GLM 5.1, Qwen3.6 Max, MiniMax M2.7, MiMo v2.5 Pro, and Hunyuan 3.0 Preview. Without an explicit hint about possible ambiguity, Doubao Seed 2.0 Pro hit the highest end-to-end accuracy at 43.1 percent. Gemini 3.1 Pro followed at 40.8 percent, Claude Opus 4.7 at 39.8 percent. Weaker models like MiniMax M2.7 and Qwen3.6 Max managed only 16.1 and 12.3 percent, respectively. More search calls don't lead to better accuracy. Claude Opus 4.7 searches frequently but still trails Gemini 3.1 Pro and Seed 2.0 Pro. Source: thedecoder

Key points

AI search agents rarely fail at multi-step research tasks due to the search itself, but struggle with ambiguous queries.
A new benchmark from Tencent Hunyuan and Tsinghua University shows AI search agents achieve end-to-end accuracy below 50% in most cases.
DiscoBench contains 211 tasks with 463 ambiguous points across eleven knowledge domains, including video games, sports, music, film, science, and politics.
The dataset is mostly written in Chinese to reflect typical search patterns on the Chinese-language web.
The team tested eleven models released in the past six months, including Doubao Seed 2.0 Pro, Gemini 3.1 Pro, and Claude Opus 4.7.
Doubao Seed 2.0 Pro achieved the highest end-to-end accuracy at 43.1 percent, followed by Gemini 3.1 Pro at 40.8 percent.
More search calls don't lead to better accuracy, as seen with Claude Opus 4.7, which searches frequently but still trails other models.

Source: The Decoder Read the original →

WRITTEN BY

Maya Chen

AI Research & Breakthroughs

Maya breaks down the latest AI research papers, benchmarks, and technical breakthroughs into plain language.

Tencent Hunyuan and Tsinghua Researchers Find AI Search Agents Fail at Asking Right Questions

Key points

Related articles

26,000-Student Study Reveals AI Learning Losses Take Two Years to Surface

UK AI Security Institute Finds Benchmarks Underestimate AI Agent Capabilities

AMD Introduces Eagle3 Speculative Decoding for AI Inference

Google DeepMind and A24 Launch First-of-its-Kind Research Partnership