AI search agents rarely fail at multi-step research tasks because of the search itself. Their real problem is failing to ask the user for clarification when queries are ambiguous. That's the finding of a new benchmark from a team at Tencent Hunyuan and Tsinghua University. Repeated searching often performs worse than just guessing. With DiscoBench, the researchers built a test framework that checks whether language models can spot ambiguity on their own during deep search chains, ask targeted follow-up questions, and correct their research path. Previous benchmarks like GAIA or BrowseComp assume user queries are complete and unambiguous. But real-world queries are often vague, incomplete, or flat-out wrong. In long reasoning chains, every unresolved ambiguity compounds and steers the agent down the wrong path. If the model picks the wrong entity at an early node, it keeps searching with clean syntax but misses the actual target entirely. When a search agent guesses instead of clarifying ambiguities, the error cascades through the entire reasoning chain and produces a wrong final answer.

DiscoBench contains 211 tasks with a total of 463 ambiguous points across eleven knowledge domains, including video games, sports, music, film, science, and politics. Each task is split into multiple checkpoints. At each checkpoint, the agent can choose between three actions: keep searching, ask the user for clarification, or give an answer. The framework checks whether the search is unambiguous at each checkpoint and evaluates agents across four metric groups, from task success to cost efficiency. The researchers define four types of ambiguity. A description might match multiple entities, apply to different time periods or versions, allow for multiple valid ranking or evaluation criteria, or contain an outright factual error. The dataset is mostly written in Chinese to reflect typical search patterns on the Chinese-language web. When the agent asks a useful follow-up question, an LLM-based user simulator releases a predefined clue that helps narrow the search. All search queries run through the agent search engine Tavily, and Gemini 3 Flash serves as the simulator. The pipeline first builds clean multi-hop questions in phase one, then injects targeted ambiguities and distinguishing clues in phase two.

The team tested eleven models released in the past six months, including Claude Opus 4.7, GPT 5.4, Gemini 3.1 Pro Preview, Doubao Seed 2.0 Pro, DeepSeek V4 Pro, Kimi K2.6, GLM 5.1, Qwen3.6 Max, MiniMax M2.7, MiMo v2.5 Pro, and Hunyuan 3.0 Preview. Without an explicit hint about possible ambiguity, Doubao Seed 2.0 Pro hit the highest end-to-end accuracy at 43.1 percent. Gemini 3.1 Pro followed at 40.8 percent, Claude Opus 4.7 at 39.8 percent. Weaker models like MiniMax M2.7 and Qwen3.6 Max managed only 16.1 and 12.3 percent, respectively. More search calls don't lead to better accuracy. Claude Opus 4.7 searches frequently but still trails Gemini 3.1 Pro and Seed 2.0 Pro. Source: thedecoder