Researchers at Princeton University conducted a 500-day startup survival test to evaluate AI models' ability to manage a fictional software company. The test, called CEO-Bench, simulates running a startup for 500 simulated days, with performance measured by remaining cash at the end. Most current AI models failed the task, with only three models finishing above the starting capital of $1 million: Claude Fable 5, Claude Opus 4.8, and GPT-5.5. The researchers point to a famous example from 1997, when Steve Jobs steered Apple to avoid bankruptcy by focusing on specific product quadrants, highlighting the need for strategic steering intelligence in AI agents.

The simulation involves managing a made-up subscription software company called NovaMind, starting with $1 million in the bank and zero customers. Agents control the company through a Python API with 34 tools and a database of 19 tables, writing their own code, querying the database with SQL, and building custom workflows. The task is challenging due to delayed feedback, hidden variables, and the need to adapt to changing conditions, such as shifting customer preferences and simulated business cycles. The researchers deliberately used fixed, transparent rules instead of a language model as referee to avoid issues seen in other benchmarks.

Source: thedecoder