Google Deepmind has introduced a new framework to manage its internal AI agents, treating them as potential insider threats rather than fully trusted entities. The framework, called the AI Control Roadmap, grants permissions to AI agents based on verified behavior, similar to how a company would manage an employee with access to sensitive information. This approach is designed to prevent AI agents from acting against the company's interests, even if they are highly capable. The system is intended to be a blueprint for the broader industry, emphasizing the need for proactive safety measures as AI systems become more advanced. Source: thedecoder

The framework is built on the MITRE ATT&CK cybersecurity model, which breaks down potential attacks into specific tactics and techniques. This allows Deepmind to systematically track risks, identify suspicious behavior early, and test defenses in controlled environments. The system includes a threat model for AI, a detection system where trusted AI systems act as supervisors, and a prevention system that can block harmful actions before they cause damage. Deepmind measures the effectiveness of these measures by tracking how much traffic is monitored, how much misconduct is detected, and how quickly the system responds. Source: thedecoder

Deepmind's internal prototype of an async monitoring system for coding agents was tested across one million tasks, revealing that most flagged issues were due to overzealousness rather than malicious intent. The system, which runs as a live monitor for the Gemini Spark agent, can detect issues like accidental data deletion in real time. Deepmind warns that the window for establishing global safety standards for AI agent systems is closing fast, urging AI labs, governments, and researchers to prioritize layered agent security. Source: thedecoder