Research

SWE-Explore Benchmark Reveals AI Coding Agents Miss Key Code Lines

A new benchmark shows AI coding agents often find the right file but miss crucial lines, with line-level accuracy dropping to 14-19% in tests.

Scientist using microscope for research in a modern laboratory setting.

Photo: Tima Miroshnichenko / Pexels

A new benchmark called SWE-Explore has exposed a critical weakness in AI coding agents: they often identify the correct file but fail to locate the exact lines of code that matter. The study, led by researchers from Shanghai Jiao Tong University, highlights that while agents can find the right source file, they struggle with line-level precision. This issue remains hidden because traditional benchmarks only measure whether a bug is fixed, not how the agent arrived at its solution. Source: thedecoder

The benchmark evaluates the first phase of the coding process, where an agent receives a bug description and a software project, then returns a ranked list of code sections it considers relevant. Researchers used data from 848 problems in the dataset, with at least two successful solution attempts from models like GPT-5.4, Gemini 3 Pro, and others. By analyzing these runs, they identified which files and lines were actually examined before fixing the bug. Passages that multiple solution paths converged on were considered strong signals of useful context. Source: thedecoder

The dataset includes 203 open-source projects across ten programming languages, with Python making up 547 of the 848 tasks. Researchers found that keyword search barely beats chance, while AI agents perform better by searching step by step. However, line-level accuracy drops significantly, with agents covering only 14 to 19% of the lines that actually matter. This suggests that while agents can find the right file, they often miss the precise lines needed to fix a bug. Source: thedecoder

Key points

SWE-Explore benchmark shows AI coding agents often find the right file but miss crucial lines.
Line-level accuracy of AI coding agents drops to 14-19% in tests.
Researchers used data from 848 problems in the dataset, with at least two successful solution attempts from models like GPT-5.4.
The dataset includes 203 open-source projects across ten programming languages, with Python making up 547 of the 848 tasks.
Keyword search barely beats chance, while AI agents perform better by searching step by step.

Source: The Decoder Read the original →

WRITTEN BY

Maya Chen

AI Research & Breakthroughs

Maya breaks down the latest AI research papers, benchmarks, and technical breakthroughs into plain language.