Research

ITBench-AA Reveals Frontier Models Lag in Agentic SRE Tasks

IBM and Artificial Analysis launched ITBench-AA, a benchmark showing frontier models score below 50% on Site Reliability Engineering tasks, with top performers at 47%.

Image: Hugging Face

Artificial Analysis and IBM Software Innovation Lab have introduced ITBench-AA, the first benchmark assessing large language models on agentic enterprise IT tasks, specifically focusing on Site Reliability Engineering (SRE). The benchmark evaluates models' ability to diagnose Kubernetes incidents by analyzing logs, tracing dependencies, and identifying root-cause entities across complex infrastructure. According to the report, all frontier models scored below 50%, with the highest performer, Claude Opus 4.7 (Adaptive Reasoning, Max Effort), achieving 47%. GPT-5.5 (xhigh) followed at 46%, and Qwen3.7 Max at 42%. ITBench-AA’s SRE tasks include 59 total tasks, 40 public and 19 brand new, each providing a Kubernetes incident snapshot with alerts, events, traces, metrics, logs, and application topology.

Models must identify the minimal set of independent root-cause Kubernetes entities responsible for the incident. The benchmark uses a scoring system based on average precision at full recall, penalizing models that miss any ground-truth root causes. The evaluation was conducted using an open-source Stirrup reference harness, with a 100-turn cap per task and 3 repeats per task.

Source: huggingface

Key points

Artificial Analysis and IBM launched ITBench-AA, the first benchmark evaluating models on agentic enterprise IT tasks.
All frontier models scored below 50% on ITBench-AA’s SRE tasks, with the highest performer at 47%.
Claude Opus 4.7 (Adaptive Reasoning, Max Effort) leads at 47%, followed by GPT-5.5 (xhigh) at 46% and Qwen3.7 Max at 42%.
ITBench-AA includes 59 SRE tasks, 40 public and 19 brand new, each providing a Kubernetes incident snapshot with alerts, events, traces, metrics, logs, and application topology.
The benchmark uses a scoring system based on average precision at full recall, penalizing models that miss any ground-truth root causes.
Models that over-investigate tend to surface upstream fault-injection mechanisms or co-occurring symptoms as false positives.

Source: Hugging Face Read the original →

WRITTEN BY

Maya Chen

AI Research & Breakthroughs

Maya breaks down the latest AI research papers, benchmarks, and technical breakthroughs into plain language.

ITBench-AA Reveals Frontier Models Lag in Agentic SRE Tasks

Key points

Related articles

Meta and Stanford Test AI with Baby-Like Learning

ELIZA Source Code Reveals Chatbot's Multiple Personalities

Hugging Face Evaluates Open-Source AI Models for Swiss Legal Tasks

Anthropic Discovers New Internal Space in AI Models