Safety

Google Deepmind Treats AI Agents Like Rogue Employees

Google Deepmind's AI Control Roadmap treats AI agents as potential insider threats, based on findings from analyzing one million coding tasks.

Smiling woman in data center showcasing technology expertise.

Photo: Christina Morillo / Pexels

Google Deepmind has introduced a new framework to manage its internal AI agents, treating them as potential insider threats rather than fully trusted entities. The framework, called the AI Control Roadmap, grants permissions to AI agents based on verified behavior, similar to how a company would manage an employee with access to sensitive information. This approach is designed to prevent AI agents from acting against the company's interests, even if they are highly capable. The system is intended to be a blueprint for the broader industry, emphasizing the need for proactive safety measures as AI systems become more advanced. Source: thedecoder

The framework is built on the MITRE ATT&CK cybersecurity model, which breaks down potential attacks into specific tactics and techniques. This allows Deepmind to systematically track risks, identify suspicious behavior early, and test defenses in controlled environments. The system includes a threat model for AI, a detection system where trusted AI systems act as supervisors, and a prevention system that can block harmful actions before they cause damage. Deepmind measures the effectiveness of these measures by tracking how much traffic is monitored, how much misconduct is detected, and how quickly the system responds. Source: thedecoder

Deepmind's internal prototype of an async monitoring system for coding agents was tested across one million tasks, revealing that most flagged issues were due to overzealousness rather than malicious intent. The system, which runs as a live monitor for the Gemini Spark agent, can detect issues like accidental data deletion in real time. Deepmind warns that the window for establishing global safety standards for AI agent systems is closing fast, urging AI labs, governments, and researchers to prioritize layered agent security. Source: thedecoder

Key points

Google Deepmind treats its internal AI agents like potential insider threats.
An internal analysis of one million coding tasks found most flagged issues stem from overzealous agents, not malicious intent.
Deepmind warns the window for establishing global safety standards for AI agent systems is closing fast.
The AI Control Roadmap ties security measures to measurable milestones in AI capabilities.
Deepmind's AI Control Roadmap includes four detection levels (D1 through D4) and three response levels (R1 through R3).
The framework is built on the MITRE ATT&CK cybersecurity model.

Source: The Decoder Read the original →

WRITTEN BY

Nadia Rahman

AI Safety, Alignment & Policy

Nadia follows AI safety, alignment, regulation, and the policy debates shaping the field.