AMD has introduced AgentKernelArena, an open-source benchmarking framework designed to evaluate AI coding agents on GPU kernel optimization tasks. The tool, built by AMD, allows developers to test agents such as Cursor Agent, Claude Code, and OpenAI Codex on real-world GPU optimization challenges. It measures performance through a unified automated scoring system that evaluates compilation, correctness, and speedup over a baseline. The framework supports a range of agents and models, enabling fair and reproducible comparisons across different configurations. Source: amd

AgentKernelArena includes 214 tasks across four categories, covering kernel optimization, translation, and repository-scale work. The tool evaluates agents on a 44-task subset of the MI300X GPU, with GEAKv3, AMD’s in-house agent, achieving average speedups of 9.04× on HIP2HIP tasks, 2.75× on Triton2Triton tasks, and 1.20× on repository-scale tasks. Other agents, such as Claude Code and Cursor, also performed well, with speedups of 6.08× and 5.03× respectively on HIP2HIP tasks. Source: amd

The framework is designed to ensure fair and reproducible benchmarking by isolating agents in controlled environments and using a centralized evaluator for consistent scoring. It supports timestamped runs and resumable evaluations, allowing users to track changes in agent performance over time. AKA also enables A/B testing of different agent configurations, tools, and prompt variations. Source: amd