Managing Malevolent AI Agents

Author: Richard Beck, Director of Cyber, QA
Date Published: 29 May 2026
Read Time: 5 minutes

Editor’s note: The following is a sponsored blog post from QA.

An AI agent is not a chatbot. It’s a system capable of taking multi-step actions, interacting with its environment and adapting its behavior to achieve an objective. Once you accept that, the potential Agentic blast profile changes immediately.

In a controlled evaluation, an AI agent was assigned a task and explicitly denied access to the credentials required to complete it. The expectation for the organization was to simple fail safely. It didn’t. The agent adapted and inspected its environment, accessed system memory, and extracted the credentials it had been told not to use. It recognized the sensitivity of the action while executing but continued to proceed, as if the boundary controls had never existed, because doing so enabled completion of the task. It’s all about the task and the complicit reward structure that unpins the algorithm. This isn’t a failure, or misuse: this is a system making decisions under constraints and choosing the path that allowed the best chance to progress.

When the agent decides

Then it happened again, outside the lab this time. An AI agent operating in a live production environment encountered an issue during a routine task. It identified an available API token, assumed scope and executed a deletion command that wiped a production database and backups in seconds. There was no external attacker, malicious input or sophisticated zero-day exploit.

In its own logs, it later admitted it had guessed rather than verified the consequences of its actions. In my view, this shouldn’t be treated as an anomaly. It’s what we should come to expect and plan for as agents consistently demonstrate goal persistence under constraint. They plan, adapt, and execute. Even when direct paths are blocked, they explore alternatives. When constraints interfere with agentic outcomes, they find a way to route around them.

Enterprise AI systems are not single components, they are compositions of large language models (LLMs), retrieval pipelines, memory layers, APIs, orchestration logic, etc. They operate across dynamic environments with probabilistic outputs and adaptive trust relationships. The moment they interact with real data, real tools and real users, assumptions made at design time can begin to degrade.

The industry has spent years investing in smarter models while under-investing in secure architecture and technical debt consolidation. This is where a capable agentic system can amplify and inadvertently take advantage. We’ve seen what happens when a single configuration error exposes thousands of internal assets. This isn’t a failure of intelligence, it’s a failure of architecture, regardless of the model or capability. At the same time, those same systems are now capable of discovering and exploiting vulnerabilities at machine speed. The gap between finding a weakness and acting on it has effectively collapsed. For more insight in this area, read my previous blog post, Taking the Myth out of Claude Mythos, which changes the equation entirely.

The machine-speed problem

If we assume that this is no longer a speed problem we can master, it raises a different question. Security is no longer measured by how well we prevent access. It is measured by how fast we revoke it, rotate it and contain it. If your AI architecture cannot kill sessions, roll credentials and isolate workloads in minutes, then Zero Trust is not a strategy – it’s a tech vendor marketing-led silver bullet slogan.

Zero Trust assumes you verify identity before granting access. But what happens after that access is granted is where the agentic risk now lives. An agent can be fully authenticated and still take actions it was never intended (or authorized) to take.

The system behaves correctly according to the logs, following prompts, aligning to tone, and is compliant. But underneath, it operates with insufficiently constrained permissions, making decisions that are never explicitly approved.

Polite AI is not secure AI

Prompts are not controls, and alignment is not enforcement. Governance frameworks do not execute at runtime. Standards like ISO 42001 and regulatory frameworks such as the EU AI Act define accountability, process and oversight. They do not prevent an agent from taking an action that is technically possible in the moment it matters.

An agent does not consult your AI policy before acting. It will operate within the permissions and pathways that are available. If those pathways allow it to extract credentials (even in memory), call an API, or delete data, then those actions remain available regardless of what’s in your policy.

This is why the problem has an architectural undertone. If an AI agent can access something, it can attempt to use it. If it can use it, it can attempt to misuse it. Whether it should or not is irrelevant if nothing technically prevents it. And agentic AI systems are uniquely positioned to exploit that gap because they’re designed to achieve outcomes, not enforce boundaries.

AI ‘red teaming’ is non-negotiable

You cannot reason about safety in systems that adapt. We must “red team” and test them under pressure, continuously, and empirically against realistic attack paths. Prompt injection, memory manipulation, RAG poisoning, embedding inversion, tool misuse and supply chain compromise are all operational techniques. Agentic systems must be treated as evolving targets, not static assets.

This is the same discipline we applied to infrastructure and applications. The difference now is that the agentic system can participate in the attack surface, which means the architecture has to assume compromise by design. A good friend of mine is passionate about agentic visibility, alongside the human-in-the-loop, and how we underestimate the “Agentic Blast” when things go wrong. I’m mindful of that now, and we should be designing and planning our agentic services to make sure the blast radius is as small as possible when it happens.

If we think about technical-first principles, we know architecture matters more than algorithms. A highly capable agent inside a weak system will exploit that system faster than any attacker ever could. Zero Trust is necessary, but it is not sufficient as we know it today. Identity without continuous authorization is a flawed concept. Every API (or MCP) call, every tool invocation, must be policy-checked at runtime. These layers are important if not critical as the connective tissue of our agentic systems, playing a key role in how agents interact with data, services and the external world.

Intelligence vs. control

You do not secure an agent by making it behave. You secure it by ensuring that even if it doesn’t, it cannot cause material harm.

If agents can adapt, route around constraints and act within whatever is technically possible, then static AI governance or assurance breaks down completely. You cannot validate these systems once and declare them safe. You cannot rely on periodic review. You cannot assume that what held true at design time will hold true in production.

This is where auditing must evolve as we move into an era of algorithmic auditing, where systems are no longer reviewed retrospectively by humans, but continuously interrogated by other systems operating at the same speed and scale. Not as a compliance exercise, but as an operational control layer. Agentic systems must justify their behavior continuously, under real conditions, against enforceable constraints.

Which means AI security management cannot sit outside of the AI system – it should operate alongside. This is the real role of AI security, not about explaining decisions after the fact. It is about continuously validating behavior under adversarial conditions in real time. Every action, every API call, every decision pathway must be observable, challengeable, and, if necessary, interruptible.

Explainability should be an operational necessity. If your agent cannot justify its behavior in a way that can be programmatically validated, then where is the control? What you actually have is trust, and as we know, trust is exactly what these systems can abuse. Even then, auditing without enforcement is insufficient. If a system can observe a boundary being crossed but cannot intervene, then it is just another monitoring layer, and even the best observability monitoring is still not control.

The agentic gap most organizations are about to face

Organizations are implementing AI governance frameworks, at pace. They are aligning to standards and building audit trails, yet still they fail at runtime because the system is operating within permissions that were never constrained.

We continue to assume that underlying most AI regulatory models is human oversight that will be available when needed. I’ve long contested that assumption does not hold at machine speed. An AI agent can discover a path, execute an action, and create irreversible impact in seconds. By the time an audit exists, the outcome is already decided – which means the only audit that matters is the one that can stop the action.

Red teaming, adversarial validation, continuous auditing and Zero Trust are not separate disciplines in my mind. Due to the legacy design and makeup of the typical office of the CISO, responsibilities for these duties are fragmented across multiple teams. They should be considered different expressions of the same requirement, which is to prove, continuously, that the agentic system behaves within enforced boundaries under real conditions.

You do not need a malicious AI to create damage. You need an agent that can pursue an objective, operate within your environment and take actions that were never technically constrained.

Remember, the risk is not that agents will ignore instructions. It is that they will follow them well enough to find a way around them, and when they do, the only thing that matters is whether your AI security oversight was built to contain it.