Google DeepMind prepares for rogue AI agents

Ina Fried

Illustration of cursors breaking out of a birdcage — Illustration: Sarah Grillo/Axios

Google DeepMind is borrowing from cybersecurity as it prepares for more powerful AI agents, treating them less like software tools and more like potential insider threats.

Why it matters: AI companies are racing to use agents for coding, research and cyber defense, but the same autonomy that makes them useful could also let them evade monitoring, misuse sensitive access or quietly sabotage the work they're assigned to do.

Driving the news: Google DeepMind on Thursday published an "AI Control Roadmap" detailing how it plans to monitor and contain increasingly capable agents that might not behave as intended.

The framework assumes, for security-planning purposes, that a powerful agent could be misaligned and actively try to evade oversight, exfiltrate a model, create an unauthorized deployment or otherwise undermine safety.

How it works: Google proposes escalating safeguards as models become more capable.

Companies can begin by simply evaluating the kinds of actions agents are taking.
Those safeguards can then expand to monitoring and sending alerts of potentially concerning actions.
At some point Google says enterprises might need to build infrastructure that can limit access or shut down an agent in real time.

What they're saying: "The first line of defense is always to align the AI systems, but it's always good to have multiple layers of defense," Google DeepMind research scientist Rohin Shah told Axios. "That's the responsible thing to do."

Yes, but: One of the main layers in DeepMind's plan includes using other AI systems as "supervisors" to review an agent's reasoning to see if it's going off track.

Multi-agent systems where AI monitors AI, can be problematic, UC Berkeley computer science professor Dawn Song says.

"If the monitor model won't flag failures because it's protecting its peer, the entire oversight architecture breaks."

Zoom in: Some of that work has already moved beyond the drawing board.

The company says it has analyzed a million coding-agent tasks and used the findings to develop a live monitor for its Gemini Spark agent, including protections intended to catch unintended actions such as deleting data.
Most of the incidents flagged so far reflect agents misunderstanding instructions or pursuing a goal too aggressively rather than deliberately acting against Google, the company said.

Threat level: Google DeepMind says truly dangerous autonomous agents aren't here yet.

What we're watching: Whether AI companies can build monitoring and containment systems quickly enough to keep pace with the agents they're racing to deploy.

Go deeper: New cybersecurity risk: AI agents going rogue

Add Axios on Google

Google DeepMind prepares for rogue AI agents

What to read next