A new survey paper by Tencent's Youtu Lab and several Chinese universities argues that AI systems will not become reliable coworkers until they shift from generating answers to completing tasks reliably. The researchers emphasize the need for reusable 'skills' and persistent work environments to enable real-world task execution. The paper outlines a transition from chatbots to digital colleagues, focusing on how models can reliably turn intent into finished work instead of just producing answers.

The study traces the evolution of large language models through five stages, from basic chatbots to autonomous digital colleagues. In the chatbot era, models generated text quickly by following the most likely continuation without checking intermediate steps or searching for solutions. In contrast, thinking large language models (LLMs) invest extra compute at inference time, exploring solution paths, verifying intermediate steps, and correcting errors before producing final answers. This shift is framed as a move from fast, intuitive 'System 1' thinking to slow, deliberate 'System 2' reasoning, based on psychologist Daniel Kahneman's framework.

The researchers highlight four structural bottlenecks in first-generation agents: limited environmental perception, lack of lasting state from tool calls, unexpected behavior, and infrequent task completion. In the OpenClaw era, models operate in persistent, secure workspaces with files, terminals, reusable skills, and verification loops until verifiable completion. The paper cites OpenHands and SWE-agent as examples of agents embedded in controlled development environments.

Source: thedecoder