Research

UK AI Security Institute Finds Benchmarks Underestimate AI Agent Capabilities

A study by the UK's AI Security Institute shows standard benchmarks fail to capture AI agents' full potential, with success rates rising by up to 25% when given more computing time.

A lab technician performing an experiment with a pipette in a modern laboratory setting.

Photo: Jess Loiterton / Pexels

The UK's AI Security Institute (AISI) has revealed that current benchmarking methods for AI agents systematically underestimate their capabilities when computing budgets are limited. According to the study, AI models' success rates increase by up to 25% when given more computing time, particularly in cybersecurity and software development tasks. This suggests that existing evaluations may not fully reflect what these systems can achieve under different conditions.

The research, which tested frontier models across seven benchmarks with varying compute budgets, found that AI agents' performance follows a curve that rises with test-time compute. Cutting the budget while the curve is still climbing results in measured scores that reflect the minimum, not the maximum, capabilities of the models. The study highlights that newer models benefit disproportionately from larger computing budgets, with their performance improving across multiple dimensions such as reach, reliability, and efficiency.

The findings also show that the number of tokens required by AI models scales with the time a human expert would need to complete the same task. For instance, a one-hour task costs an agent millions of tokens, while a one-week task requires billions. This suggests that fixed evaluation budgets may cut off the longest and hardest tasks, leading to misleading results. The study points to the cyber task 'The Last Ones,' which takes a human expert about 20 hours, as an example of a task that no tested model could solve with fewer than 30 million tokens.

Source: thedecoder

Key points

A study by the UK's AI Security Institute reveals that common benchmarks systematically underestimate the capabilities of AI agents when computing budgets are limited.
When given more computing time, the models' success rates increase by up to 25 percent, with particularly notable gains in cybersecurity and software development tasks.
The findings also show that the amount of tokens AI models require scales with how long a human expert would need to complete the same task.
Newer models (dark red) benefit more than older ones (orange) from larger computing budgets.
A one-minute task costs the agent thousands of tokens. A one-hour task costs millions. A one-week task costs billions.
The estimated doubling rate is partly a product of the evaluation budget you pick, not a fixed property of frontier progress.
Test a model with too small a budget, and you get a score that skews decisions about deployment, economic value, and risk.

Source: The Decoder Read the original →

WRITTEN BY

Maya Chen

AI Research & Breakthroughs

Maya breaks down the latest AI research papers, benchmarks, and technical breakthroughs into plain language.

UK AI Security Institute Finds Benchmarks Underestimate AI Agent Capabilities

Key points

Related articles

AMD Introduces Eagle3 Speculative Decoding for AI Inference

Google DeepMind and A24 Launch First-of-its-Kind Research Partnership

AWS Introduces HippoRAG Framework for Enhanced RAG Systems

Meta Unveils Brain2Qwerty v2 for Non-Invasive Brain-to-Text Communication