Research

AA-Briefcase Benchmark Shows AI Struggles With Real Knowledge Work

A new benchmark reveals AI models fail to complete most real-world knowledge tasks, with top models solving just 3 percent of tasks.

A detailed view of a microscope being used in a modern laboratory.

Photo: indra projects / Pexels

A new benchmark called AA-Briefcase has exposed how poorly AI models handle real-world knowledge work. The benchmark tests models on multi-week projects using thousands of fragmented source files, such as Slack threads, emails, and meeting transcripts. Even the best model, Anthropic's Claude Fable 5, fully solves only 3 percent of tasks, highlighting significant gaps in AI's ability to perform complex, real-world tasks.

The benchmark shows that errors vary depending on the model's strength. Weaker models struggle with basic execution, often missing relevant files or producing unusable results. In contrast, stronger models fail more subtly, meeting obvious requirements but missing crucial details that require cross-referencing multiple sources. This suggests that AI still lacks the ability to synthesize information effectively in real-world scenarios.

The benchmark also highlights a significant cost disparity, with per-task costs ranging over 800 times. The cheapest model, DeepSeek V4 Flash, costs about $0.04 per task, while the most expensive, Claude Fable 5, exceeds $31 per task. These findings underscore the challenges AI faces in handling complex, real-world knowledge work.

Source: thedecoder

Key points

The AA-Briefcase benchmark tests AI models on multi-week projects using thousands of fragmented source files.
Anthropic's Claude Fable 5, the top performer, fully solves just 3 percent of tasks.
Weaker models fail on basic execution by missing relevant files or producing unusable results.
Stronger models meet obvious requirements but miss details requiring cross-referencing multiple sources.
Per-task costs for AI models range over 800 times, from about $0.04 for DeepSeek V4 Flash to over $31 for Claude Fable 5.

Source: The Decoder Read the original →

WRITTEN BY

Maya Chen

AI Research & Breakthroughs

Maya breaks down the latest AI research papers, benchmarks, and technical breakthroughs into plain language.