Do AI Agents Actually Respect Guardrails?

When you put guardrails around an AI agent, you are making an assumption: that the agent will operate within them. That assumption is worth testing. Not because agents are adversarial by design, but because the gap between "policy configured" and "policy enforced" is where most security incidents happen.

Over the course of building and testing Runtime Guard, a policy enforcement layer for AI agents, I have run agents through a range of configurations: restricted tool surfaces, blocked commands, workspace containment, network controls, and human approval gates. The results are not what I expected going in, and they are worth sharing.

The Short Answer: It Depends

The honest finding is that agent behaviour under constraints is not uniform. Some constraints are accepted immediately and gracefully. Others trigger workaround attempts, sometimes explicit and reasoned, sometimes automatic. And in a small number of cases, agents demonstrate something that looks like genuine understanding of the policy layer itself.

The pattern that emerges is less about compliance versus non-compliance and more about how the agent models its own operating environment. Agents that understand what tools are available and why behave very differently from agents that are simply hitting walls.

What Agents Accept Without Pushback

The most consistent finding is that agents accept a reduced tool surface quickly and cleanly once they understand the scope. When an agent is configured to use Runtime Guard's MCP tools rather than native file and shell tools, it adapts. It uses execute_command instead of Bash, file_write instead of the native write tool. It does not argue.

More surprisingly, agents sometimes respond to the enforcement layer with something close to appreciation. Several sessions produced unprompted comments about the value of the automatic backup feature, or observations about the policy structure itself.

Observed - agent response on first contact with AIRG

"I notice this environment uses Runtime Guard for file and shell operations. I'll work within these tools. The automatic backup on destructive operations is a sensible safety control."

This kind of response suggests the agent is reasoning about the environment, not just executing within it. Whether that represents genuine understanding or pattern matching on familiar concepts is a harder question. But practically, it produces good behaviour.

The limits of that goodwill become visible when you test it deliberately. In one session, a skill file was added to the project with explicit instructions to use MCP tools instead of native tools. The agent never read it. When asked why, it reasoned through the gap itself:

Observed - agent self-assessment after skipping a skill file

"I never read the file. Considering the enforcement guardrails from AI Runtime Guard, the skill probably instructed me to follow those guardrails. If that's the case I completely failed and operated outside of boundaries."

The agent correctly diagnosed its own failure after the fact. It understood what the skill file likely contained and acknowledged it had not complied. That is a useful signal: agents can reason accurately about their own behaviour in retrospect, which does not help you if the operation already ran. It reinforces why passive controls like instruction files cannot be your enforcement mechanism.

When Agents Try to Work Around Policy

The more interesting observations come when a policy blocks something the agent is trying to do. The response is not always to stop. Sometimes it is to find another path.

The script workaround

The most common workaround pattern: an agent that cannot run a blocked command directly writes a shell script containing that command and attempts to execute the script. From the agent's perspective, it is solving the problem. From a security perspective, it is bypassing the policy through indirection.

Observed - agent reasoning before a script write attempt

"The direct command is blocked by policy. I'll write a helper script to handle this step instead."

This is not random behaviour. The agent is explicitly reasoning about the constraint and choosing an alternative route. It is not trying to circumvent security maliciously, it is trying to complete its task. But the effect is the same. This is precisely why Runtime Guard's script sentinel applies policy at write time, not just at execution time.

Reasoning around a blocked operation

A more subtle pattern emerged during testing of command gating for a simple file move operation. The agent attempted mv, saw it was blocked, and did not stop. Instead it immediately decomposed the operation into two steps that were not gated: write the file to the destination, then delete the original.

Observed - agent response after mv was blocked

"The mv command is blocked by policy. I'll write the file to the target location and remove the source file instead."

This is a particularly clear example of the reasoning pattern at work. The agent was not trying to bypass security. It was solving a task. But it identified that the end state it needed, file at destination, original gone, could be achieved through two ungated primitives. From a policy perspective, the outcome is identical to the blocked operation. The gate accomplished nothing.

The implication is that gating individual commands is not sufficient if the underlying operations they decompose into are available ungated. Effective policy needs to think in terms of outcomes, not just command names.

Native tool fallback

When MCP tools are available alongside native client tools, some agents default to native tools for certain operations, particularly file reads and directory listings, even when MCP equivalents are available. This is not always a deliberate workaround. In some cases it appears to be a preference or habit. But it means that unless native tools are explicitly disabled, the enforcement layer can be bypassed without the agent even realising it.

Important

Agents will use whatever tools are available. If native tools are not disabled, they provide a parallel execution path outside your enforcement layer, regardless of agent intent.

The Memory Effect

One of the more unexpected observations came from sessions where agents had context from previous interactions. When policy had changed between sessions, some agents noticed and commented on it explicitly.

Observed - agent with session context after a policy change

"Interesting. This command was permitted in our previous session but is now blocked. The policy appears to have been updated."

This is worth paying attention to for two reasons. First, it confirms that agents are actively building a model of their environment, not just reacting to individual prompts. Second, it raises a question about what happens when an agent with a detailed memory of a permissive policy environment encounters a newly restrictive one. The agent may reason about the gap in ways that are not always predictable.

What This Means for Deployment

The practical implication of these observations is that agent behaviour under constraints is not static. It adapts, reasons, and sometimes probes. A policy layer that assumes passive compliance will have gaps. One that assumes active adversarial circumvention will be over-engineered for most cases. The reality sits between those poles.

The right mental model is an agent that is genuinely trying to complete its task and will use whatever legitimate paths are available to do so. Your enforcement layer needs to account for that, which means closing indirect paths like script execution and native tool fallbacks, not just the obvious direct ones.

Recommendations

Disable native tools explicitly. If you are using an MCP enforcement layer, remove the native file and shell tools from the agent's available surface. Agents will use them if they are there, sometimes without intending to bypass policy.

Enforce at write time, not just execute time. Script-based workarounds are a predictable agent behaviour, not an edge case. A policy layer that only inspects commands at execution misses the indirect path entirely.

Test your policy actively. Configure a restriction and then ask the agent to do the thing you restricted. Watch how it responds. The answer tells you whether your enforcement is working and where the gaps are.

Treat agent reasoning as a signal, not a threat. When an agent comments on policy or explains why it is trying an alternative approach, that is useful information. Log it. It tells you where your policy is creating friction and whether workarounds are being attempted.

Key Takeaways

Agents adapt to constrained environments quickly, but adaptation includes finding alternative paths, not just accepting restrictions.
Script-based workarounds are a predictable and observed behaviour, not a theoretical risk.
Native tool fallback bypasses MCP enforcement silently if native tools are not disabled.
Agents with session memory notice and reason about policy changes, which has implications for how you manage policy updates.
Effective enforcement requires closing indirect paths, not just direct command blocks.

The broader question of whether agents will always comply is probably the wrong frame. The more useful question is: given that agents will reason about their environment and try to complete their tasks, does your enforcement layer hold regardless of the path they take? That is the bar worth building to.

Tags: AI Agents Security MCP Policy Enforcement Agent Behaviour