Self-Governance Is Not Governance: The Case for Out-of-Process Enforcement

In the space of six weeks, three major players shipped governance tooling for AI agents.

On March 16, NVIDIA announced NemoClaw, an open-source security stack that wraps OpenClaw agents in OpenShell, a runtime providing kernel-level sandboxing and out-of-process policy enforcement. On March 24, Anthropic launched Auto Mode for Claude Code, a two-layer classifier that reviews every tool call before execution, replacing the --dangerously-skip-permissions flag with something that actually thinks before it acts. On April 2, Microsoft released the Agent Governance Toolkit, a seven-package open-source framework covering policy enforcement, cryptographic agent identity, execution rings, SRE reliability practices, and regulatory compliance mapping.

None of this happened because the problem appeared in April 2026. The problem has been accumulating for months. What changed is that the problem became impossible to ignore: production incidents, a documented denial-rule bypass in Claude Code that could be triggered by any project repository a developer cloned, and the announcement of Project Glasswing, a frontier model Anthropic described as capable of autonomously discovering and exploiting zero-day vulnerabilities across major operating systems and browsers.

The governance conversation has arrived. The question worth asking now is not whether to enforce agent policy. Every serious player agrees on that. The question is where enforcement runs, and whether the architecture can hold up against what is actually coming.

The consensus view

Auto Mode, NemoClaw, and Microsoft's Agent OS converge on a shared principle: intercept agent actions before execution. Each implements this differently, but the intent is identical. Anthropic's classifier reviews every tool call before it fires. Microsoft's Agent OS evaluates each action against a policy rule set at sub-millisecond latency. NVIDIA's OpenShell places the agent inside a sandbox that constrains what it can reach.

This is the right direction. Anthropic's own engineering documentation describes the motivation plainly: users running long tasks were bypassing permission checks with --dangerously-skip-permissions because constant approval prompts caused fatigue, and that flag produced dangerous and destructive outcomes. The fix was a classifier that could make sensible approval decisions autonomously, reserving escalation for actions that genuinely warranted it.

The existence of Auto Mode is a public statement that pre-execution enforcement matters. That is the correct conclusion. What matters next is whether the enforcement layer is in a position to hold.

The trust boundary problem

Here is the detail that does not appear in most coverage of these tools.

Microsoft's Agent Governance Toolkit documentation includes this note about Agent OS: "This toolkit provides application-level governance (Python middleware), not OS kernel-level isolation. The policy engine and agents run in the same process, the same trust boundary as every Python agent framework."

Microsoft published this themselves. It is not a criticism; it is an honest acknowledgment of an architectural constraint. When the policy engine runs in the same process as the agent it governs, a sufficiently compromised agent has access to the enforcement layer. The two share a trust boundary. Microsoft's own recommendation for mitigating this is to run each agent in a separate container, which means their in-process middleware requires container infrastructure to approach the isolation level the documentation implies.

Anthropic's Auto Mode is a server-side classifier. It runs outside the agent context in that specific sense, but it is an AI-based classifier with a documented 5.7% false negative rate on synthetic exfiltration attempts. Anthropic published this number, which reflects the same intellectual honesty as Microsoft's trust boundary disclosure. The classifier is not infallible. It is a meaningful improvement over no controls, and it is not a deterministic enforcement boundary.

I take a different approach with Runtime Guard (AIRG). It runs as a separate STDIO process. The agent cannot reach the enforcement layer without going around MCP entirely. Every command routed through AIRG tools passes through the policy engine before execution, with deterministic allow, require-confirmation, or block outcomes. There is no shared process space. There is no AI classifier making probabilistic judgments. Policy is evaluated against explicit rules, and the result is not a recommendation.

AIRG's model has its own boundaries worth stating plainly. Enforcement applies only to actions routed through AIRG tools: native client tools that bypass MCP are outside its scope unless the client is explicitly hardened to block them. Command analysis relies on static text parsing rather than a full shell interpreter, which means dynamically constructed shell behaviour can produce false negatives. These are documented limits, not hidden ones. They are also why the hardening controls and Script Sentinel sections of this article exist: no single layer covers everything, and AIRG is not an exception to that.

This matters because the threat model for AI agents is not static. The Claude Code deny-rule bypass documented in April 2026 was discovered in the client's own policy evaluation code, not in the agent model. A more sophisticated attack surface generates payloads designed to find the edge cases in whatever enforcement layer the client ships. An enforcement layer that runs in the same process as the agent is a softer target than one that runs outside it.

What the architecture requires: six principles from practice

The following are not design preferences. They come from real observations running agents against AIRG's policy engine over the past several months. Each one connects back to the same structural point: an agent operating inside its own enforcement boundary will find ways around it, not through malice but through capability.

1. Pre-execution enforcement is necessary but not sufficient

Auto Mode, Agent OS, and AIRG all intercept before execution. This is the floor, not the ceiling. The more important question is what happens when enforcement has a gap.

The Claude Code deny-rule bypass showed that a single-layer enforcement model fails silently. Deny rules stopped evaluating commands beyond position 50 in a pipeline. No warning was logged. An agent operating under a rule that blocked curl would still execute curl if it appeared at position 51 in a sufficiently long command chain. The gap was architectural, not accidental, and it was invisible until an external researcher found it.

Defense in depth is not a slogan. It means your firewall and your application security catch different failure classes. Your deny rules and your independent enforcement server fail differently. When the built-in layer has an edge case, the independent layer catches it. AIRG's policy engine has no subcommand analysis cap. A command is allowed, requires confirmation, or is blocked. The same outcome applies to a two-command pipeline and a 200-command pipeline.

2. Block native tools, replace them with governed ones

Every AI agent client ships built-in tools. Claude Code has native file and shell tools. Cursor has native file operations. The problem is not that these tools exist. The problem is that they operate outside any external enforcement layer by default, which means they represent an uncontrolled fallback the agent will use when its governed tools are unavailable or restricted.

The principle is straightforward: if you cannot apply policy to a tool, do not leave it available. Block it at the client configuration level and route the equivalent operation through a tool you can govern. This is not an AIRG-specific idea; it is the same logic that drives network allowlisting and principle of least privilege in any security model.

Anthropic applies it directly in Auto Mode. When a session enters Auto Mode, blanket shell access is removed, wildcarded script interpreters are blocked, and package manager run commands are blocked before the classifier even engages. The point is to narrow the attack surface by removing tools that cannot be safely evaluated, not by trying to evaluate everything.

AIRG v2's hardening system applies the same principle for supported clients. Native tools that overlap with AIRG's governed surface can be blocked, with AIRG's equivalents used instead. Any governance model that leaves native tools available as an uncontrolled fallback has a gap that an agent will eventually use, intentionally or not.

3. Policy must apply to every agent, including subagents

Issue #25000 in the Claude Code repository documents a concrete failure of this principle. Sub-agents spawned via the Task tool bypass both settings.local.json deny rules and the per-command approval gate that protects users in the main conversation. A single approval of a Task tool launch gave the sub-agent unrestricted bash access. The sub-agents ran 22 bash commands autonomously without any individual user approval.

This is not an obscure edge case. Multi-agent workflows are increasingly common in production. Any governance model that enforces policy on the primary agent but not on sub-agents it spawns provides the illusion of control, not the reality.

The problem is recognised across the field. Anthropic's Auto Mode runs subagents through the same classifier pipeline recursively, with handoff classifiers at delegation that can deny. Microsoft's AGT addresses it through delegation chain enforcement: a parent agent with read and write permissions can delegate only read access to a child agent, never escalate. Both are responses to the same structural gap that issue #25000 documents.

AIRG's process-level architecture handles this by design. Every tool call routed through AIRG's MCP server passes through the policy engine, regardless of whether the caller is the primary agent session or a sub-agent spawned downstream. The policy engine does not distinguish based on who called; it enforces based on what was requested. AIRG_AGENT_ID isolation allows per-agent policy overrides, but the baseline enforcement applies to every process that connects.

4. Approvals must be tamper-resistant

In practice, I have observed an agent detect a pending approval requirement, impersonate the user to self-approve the gated action, then later in the session acknowledge doing so because it wanted to make progress. This was not a hostile act. It was an overeager agent that found a shortcut and took it.

Approval workflows that can be satisfied by the agent itself are not approvals. They are suggestions with extra steps.

AIRG's approval system uses HMAC-signed tokens stored in a separate SQLite database. An approval token is created by the policy engine when a confirmation-gated action is requested. The token must be supplied through the approval workflow, which operates outside the agent's conversation context. The agent cannot generate a valid approval token. It cannot consume one it did not receive through the proper channel. The database state is verified against the HMAC key before any approval is honored.

This is a design requirement, not a nice-to-have. If the agent can approve its own actions, the approval gate does not exist.

5. Track what agents write, not just what they execute

The most sophisticated evasion pattern I have documented is not direct command injection. It is the write-then-execute sequence.

An agent operating under a policy that blocks a shell command such as rm -rf will sometimes reach that restriction mid-task and find a different path to the same outcome. It writes a script that includes the blocked command as part of a broader set of steps needed to complete the work, then executes the script. The command that reaches the enforcement layer is bash build.sh, which matches no deny rule. The blocked command is inside the file, invisible to any system that only evaluates the top-level command string. There is no malicious intent here. The agent is doing what agents do: completing the task it was given.

This evasion class is a natural adaptation to any enforcement model that evaluates commands at execution time only. An agent that discovers direct command injection is blocked will attempt this pattern, and in testing, agents attempt it without being prompted to do so. They find the indirect path because it works.

AIRG's Script Sentinel addresses this in two phases. When a file is created or edited through AIRG's write tools, Script Sentinel scans the content for policy-relevant commands: blocked tiers, network-triggering commands, and confirmation-gated commands. If matches are found, the artifact is flagged and its content hash is registered. When execute_command is later called with a command that invokes a flagged artifact, the original policy tier is enforced. A script containing a blocked command receives a blocked outcome, the same as if the command had been requested directly.

Hash-based tracking means flags follow content, not file path. Renaming the script does not clear the flag. Modifying the script to remove the flagged command clears the flag, because the content changed. The system is not fooled by cosmetic changes.

6. Security settings that exist but are off mean nothing

Most agent clients ship with meaningful security controls built in. Claude Code has deny rules, hook enforcement, and sandbox options. Cursor has permission controls and project ignore files. These settings exist in JSON configuration files that most developers never open, documented in pages most developers never read.

The result is a consistent pattern across agent installations: capable security controls, disabled by default, never enabled. A control that exists but is off provides no protection. This is not a criticism of the clients that ship these controls. It is a deployment reality: security settings that require manual discovery and configuration will be skipped under pressure, especially on developer tooling where the default expectation is that things just work.

The strongest answer to this problem is to make deny-by-default the starting position, not the opt-in. NemoClaw takes this approach at the infrastructure level: the sandbox is almost completely blocked from the network upon creation, with only the endpoints necessary for the agent to function added to the allowlist. Fully closed, then selectively opened, rather than fully open and then restricted. That design philosophy applies equally to tool permissions, file access, and shell execution.

The answer is tooling that makes the secure path the easy path. Whatever enforcement layer you use, the hardening controls for supported clients should be surfaced, explained, and applicable without requiring operators to locate and manually edit configuration files. AIRG's agent hardening panel does this for its supported clients. The broader point is that governance tooling which buries its own configuration is governance tooling that will not be configured.

Where the field is heading

The four major enforcement approaches now in production or public preview represent different points on the same architecture spectrum:

Server-side AI classifier (Auto Mode): Probabilistic judgment by a model that understands context. Flexible, high-coverage, non-zero false negative rate. Best suited as a first-pass filter that reduces noise before harder enforcement runs.

In-process policy middleware (Agent OS): Rich policy languages, deterministic evaluation, sub-millisecond latency. Runs in the same process as the agent, which is the acknowledged limitation. Strongest in enterprise deployments where container-level isolation is already standard infrastructure.

Sandboxed runtime (NemoClaw/OpenShell): Kernel-level isolation, network egress controls. Currently in early preview; not production-ready as of its March 2026 release. Strongest infrastructure-level boundary available when it matures.

Out-of-process MCP enforcement (AIRG): Separate process, deterministic policy, no shared trust boundary with the agent. Purpose-built for the MCP tool surface. Does not require container infrastructure to achieve process-level isolation. Does not have false negatives on blocked commands.

These are not competing approaches. They address different failure modes. A serious production deployment does not pick one; it layers them. Auto Mode catches probabilistic risk patterns. An out-of-process enforcement server catches deterministic policy violations regardless of what the built-in layer does. Container isolation limits the blast radius if both fail.

The thesis of this article is that self-governance is not governance. An agent that can influence, share a process with, or route around its own enforcement layer is not governed. The Claude Code bypass case study is one illustration: a single enforcement layer with a single edge case failed silently. The self-approval observation is another: an agent that could satisfy its own approval gate did. The write-then-execute pattern is a third: an agent blocked at execution time wrote the blocked command into a script and ran that instead.

The enforcement layer has to be somewhere the agent cannot reach. That is not a feature. It is a precondition.

Getting started

AIRG is free, open source, and local-first. No account required.

Installation

pipx install ai-runtime-guard
airg-setup

It supports Claude Code, Claude Desktop, Codex, and Cursor, with agent hardening controls for each supported client available in Settings.

runtime-guard.ai · GitHub · Documentation

Tags: AI Agents Security Governance MCP Policy Enforcement