The UK's AI Security Institute (AISI) has revealed that current benchmarking methods for AI agents systematically underestimate their capabilities when computing budgets are limited. According to the study, AI models' success rates increase by up to 25% when given more computing time, particularly in cybersecurity and software development tasks. This suggests that existing evaluations may not fully reflect what these systems can achieve under different conditions.

The research, which tested frontier models across seven benchmarks with varying compute budgets, found that AI agents' performance follows a curve that rises with test-time compute. Cutting the budget while the curve is still climbing results in measured scores that reflect the minimum, not the maximum, capabilities of the models. The study highlights that newer models benefit disproportionately from larger computing budgets, with their performance improving across multiple dimensions such as reach, reliability, and efficiency.

The findings also show that the number of tokens required by AI models scales with the time a human expert would need to complete the same task. For instance, a one-hour task costs an agent millions of tokens, while a one-week task requires billions. This suggests that fixed evaluation budgets may cut off the longest and hardest tasks, leading to misleading results. The study points to the cyber task 'The Last Ones,' which takes a human expert about 20 hours, as an example of a task that no tested model could solve with fewer than 30 million tokens.

Source: thedecoder