Research

CEO-Bench Test Reveals Only Three AI Models Surpass Startup Capital

In a 500-day startup survival simulation, only three AI models finished with more than the initial $1 million in capital, according to a Princeton University study.

A female scientist examines samples using a microscope in a laboratory setting.

Photo: Edward Jenner / Pexels

Researchers at Princeton University conducted a 500-day startup survival test to evaluate AI models' ability to manage a fictional software company. The test, called CEO-Bench, simulates running a startup for 500 simulated days, with performance measured by remaining cash at the end. Most current AI models failed the task, with only three models finishing above the starting capital of $1 million: Claude Fable 5, Claude Opus 4.8, and GPT-5.5. The researchers point to a famous example from 1997, when Steve Jobs steered Apple to avoid bankruptcy by focusing on specific product quadrants, highlighting the need for strategic steering intelligence in AI agents.

The simulation involves managing a made-up subscription software company called NovaMind, starting with $1 million in the bank and zero customers. Agents control the company through a Python API with 34 tools and a database of 19 tables, writing their own code, querying the database with SQL, and building custom workflows. The task is challenging due to delayed feedback, hidden variables, and the need to adapt to changing conditions, such as shifting customer preferences and simulated business cycles. The researchers deliberately used fixed, transparent rules instead of a language model as referee to avoid issues seen in other benchmarks.

Source: thedecoder

Key points

Only three AI models finished above starting capital in a 500-day startup survival test.
The test, called CEO-Bench, simulates running a fictional software company for 500 simulated days.
Most current AI models failed the task, with only three models finishing above the starting capital of $1 million.
The researchers point to a famous example from 1997, when Steve Jobs steered Apple to avoid bankruptcy by focusing on specific product quadrants.
The simulation involves managing a made-up subscription software company called NovaMind, starting with $1 million in the bank and zero customers.
The researchers deliberately used fixed, transparent rules instead of a language model as referee to avoid issues seen in other benchmarks.

Source: The Decoder Read the original →

WRITTEN BY

Maya Chen

AI Research & Breakthroughs

Maya breaks down the latest AI research papers, benchmarks, and technical breakthroughs into plain language.