Research

MirrorCode Benchmark Tests AI's Programming Skills

Epoch AI's MirrorCode benchmark assesses AI models' ability to recreate programs from scratch, with one task costing $2,600 to run.

Two scientists wearing protective gear collaborating in a laboratory setting with modern equipment.

Photo: Edward Jenner / Pexels

An AI model ran continuously for 19 days on a single MirrorCode task that cost $2,600 to complete, according to Epoch AI. The benchmark, developed in collaboration with METR, challenges AI systems to reimplement complete programs across various computer science domains without access to the original source code. The task involved recreating a bioinformatics toolkit with 16,000 lines of code in just 14 hours, a feat that would take a human engineer 2 to 17 weeks to complete without AI assistance. The AI model, which used no human input, finished the task for $251, highlighting the potential for AI to handle complex programming tasks efficiently. Source: thedecoder

In the MirrorCode benchmark, Claude Opus 4.7 led with a 56 percent solve rate, followed by GPT-5.5 at 44 percent and Gemini 3.1 Pro Preview at 32 percent. While models performed well on smaller programs, they struggled with the largest and most complex tasks. The benchmark includes 25 target programs covering areas like Unix utilities, cryptography, and compression, with each solution required to exactly reproduce the output of the original program, including hidden end-to-end tests not seen during development. Source: thedecoder

Epoch AI noted that the largest task in the benchmark cost $2,600 for a single run, with the AI working nonstop for 19 days. The researchers emphasized that while models show progress, the hardest tasks still stump every AI tested. They also mentioned that cost trends for AI tasks do not follow a clear pattern, with GPT-5.5 costing three times as much as GPT-5 for the same tasks, while Claude Opus 4.7 runs three times cheaper than its predecessor. Source: thedecoder

Key points

Epoch AI and METR developed the MirrorCode benchmark to test AI models' ability to recreate complete programs from scratch without access to the original source code.
Claude Opus 4.7 achieved a 56 percent solve rate in the MirrorCode benchmark, leading all tested models.
One of the largest tasks in the MirrorCode benchmark cost $2,600 to run and required the AI to work continuously for 19 days.
Claude Opus 4.7 reimplemented a bioinformatics toolkit with 16,000 lines of code in 14 hours, a task that would take a human engineer 2 to 17 weeks to complete without AI assistance.
The hardest tasks in the MirrorCode benchmark still stump every model tested, despite showing progress compared to models from a year ago.
Epoch AI has open-sourced the scaffold and 22 of the 25 target programs, covering 132 task instances across six programming languages.

Source: The Decoder Read the original →

WRITTEN BY

Maya Chen

AI Research & Breakthroughs

Maya breaks down the latest AI research papers, benchmarks, and technical breakthroughs into plain language.