Artificial Analysis and IBM Software Innovation Lab have introduced ITBench-AA, the first benchmark assessing large language models on agentic enterprise IT tasks, specifically focusing on Site Reliability Engineering (SRE). The benchmark evaluates models' ability to diagnose Kubernetes incidents by analyzing logs, tracing dependencies, and identifying root-cause entities across complex infrastructure. According to the report, all frontier models scored below 50%, with the highest performer, Claude Opus 4.7 (Adaptive Reasoning, Max Effort), achieving 47%. GPT-5.5 (xhigh) followed at 46%, and Qwen3.7 Max at 42%. ITBench-AA’s SRE tasks include 59 total tasks, 40 public and 19 brand new, each providing a Kubernetes incident snapshot with alerts, events, traces, metrics, logs, and application topology. Models must identify the minimal set of independent root-cause Kubernetes entities responsible for the incident. The benchmark uses a scoring system based on average precision at full recall, penalizing models that miss any ground-truth root causes. The evaluation was conducted using an open-source Stirrup reference harness, with a 100-turn cap per task and 3 repeats per task. *Source: [huggingface](https://huggingface.co/blog/ibm-research/itbench-aa)*