An AI model ran continuously for 19 days on a single MirrorCode task that cost $2,600 to complete, according to Epoch AI. The benchmark, developed in collaboration with METR, challenges AI systems to reimplement complete programs across various computer science domains without access to the original source code. The task involved recreating a bioinformatics toolkit with 16,000 lines of code in just 14 hours, a feat that would take a human engineer 2 to 17 weeks to complete without AI assistance. The AI model, which used no human input, finished the task for $251, highlighting the potential for AI to handle complex programming tasks efficiently. Source: thedecoder
In the MirrorCode benchmark, Claude Opus 4.7 led with a 56 percent solve rate, followed by GPT-5.5 at 44 percent and Gemini 3.1 Pro Preview at 32 percent. While models performed well on smaller programs, they struggled with the largest and most complex tasks. The benchmark includes 25 target programs covering areas like Unix utilities, cryptography, and compression, with each solution required to exactly reproduce the output of the original program, including hidden end-to-end tests not seen during development. Source: thedecoder
Epoch AI noted that the largest task in the benchmark cost $2,600 for a single run, with the AI working nonstop for 19 days. The researchers emphasized that while models show progress, the hardest tasks still stump every AI tested. They also mentioned that cost trends for AI tasks do not follow a clear pattern, with GPT-5.5 costing three times as much as GPT-5 for the same tasks, while Claude Opus 4.7 runs three times cheaper than its predecessor. Source: thedecoder