A new benchmark called AA-Briefcase has exposed how poorly AI models handle real-world knowledge work. The benchmark tests models on multi-week projects using thousands of fragmented source files, such as Slack threads, emails, and meeting transcripts. Even the best model, Anthropic's Claude Fable 5, fully solves only 3 percent of tasks, highlighting significant gaps in AI's ability to perform complex, real-world tasks.
The benchmark shows that errors vary depending on the model's strength. Weaker models struggle with basic execution, often missing relevant files or producing unusable results. In contrast, stronger models fail more subtly, meeting obvious requirements but missing crucial details that require cross-referencing multiple sources. This suggests that AI still lacks the ability to synthesize information effectively in real-world scenarios.
The benchmark also highlights a significant cost disparity, with per-task costs ranging over 800 times. The cheapest model, DeepSeek V4 Flash, costs about $0.04 per task, while the most expensive, Claude Fable 5, exceeds $31 per task. These findings underscore the challenges AI faces in handling complex, real-world knowledge work.
Source: thedecoder