A new benchmark called SWE-Explore has exposed a critical weakness in AI coding agents: they often identify the correct file but fail to locate the exact lines of code that matter. The study, led by researchers from Shanghai Jiao Tong University, highlights that while agents can find the right source file, they struggle with line-level precision. This issue remains hidden because traditional benchmarks only measure whether a bug is fixed, not how the agent arrived at its solution. Source: thedecoder
The benchmark evaluates the first phase of the coding process, where an agent receives a bug description and a software project, then returns a ranked list of code sections it considers relevant. Researchers used data from 848 problems in the dataset, with at least two successful solution attempts from models like GPT-5.4, Gemini 3 Pro, and others. By analyzing these runs, they identified which files and lines were actually examined before fixing the bug. Passages that multiple solution paths converged on were considered strong signals of useful context. Source: thedecoder
The dataset includes 203 open-source projects across ten programming languages, with Python making up 547 of the 848 tasks. Researchers found that keyword search barely beats chance, while AI agents perform better by searching step by step. However, line-level accuracy drops significantly, with agents covering only 14 to 19% of the lines that actually matter. This suggests that while agents can find the right file, they often miss the precise lines needed to fix a bug. Source: thedecoder