Configuring AI Agent Security Is Harder Than It Should Be

If you have spent time trying to properly secure an AI coding agent, Claude Code, Cursor, Codex, or similar, you have probably hit the same wall. There are controls to configure, but they live in different places, serve purposes that are not always clearly distinct, and interact in ways that are rarely documented end to end. You configure one layer, assume you are covered, and later discover a gap you did not know existed.

This is not a complaint about any specific tool. It is a structural problem across the ecosystem. Security policy for AI agents is currently split across at least four distinct layers, each with its own configuration format, scope, and failure mode. Understanding those layers is the first step to getting coverage right.

The Four Layers

Layer 1: Model-level instructions

Most AI coding tools support some form of instruction file that shapes agent behaviour. In Claude Code this is CLAUDE.md. In Codex it is agents.md. These files tell the model what to do, what to avoid, and how to behave in a given project context.

They are useful for shaping intent, but they are not enforcement. They are instructions, and a sufficiently distracted or manipulated model can ignore or misinterpret them. Relying on instruction files as your primary security control is the same as relying on a sign that says "do not enter" to stop an intruder.

Important

Instruction files influence model behaviour. They do not constrain it. Never treat them as a security boundary.

Layer 2: Client-level deny rules and permissions

Most clients add a second layer: declarative rules that restrict which tools the agent can use or which commands it can run. Claude Code uses settings.json with a denyList for shell commands. Cursor has similar controls. These sit closer to enforcement because they are applied by the client, not interpreted by the model.

However, they come with important caveats. In Claude Code, settings.json lives inside the project directory, which means it is user-writable and version-controlled alongside the code the agent is modifying. An agent with write access to the project can, in principle, modify the file that governs its own restrictions. There are also documented cases of deny rules being inconsistently applied, ignored entirely for file operations, and bypassed across multiple tool types.

claude code - .claude/settings.json

{
  "permissions": {
    "denyList": [
      "Bash(rm -rf *)",
      "Bash(curl:*)",
      "Bash(wget:*)",
      "WebFetch(*)"
    ]
  }
}

Known limitation

In Claude Code, settings.json is stored in the project directory alongside agent-writable files. Deny rules should not be your only enforcement layer.

Layer 3: Runtime enforcement via hooks or MCP

The third layer is where actual enforcement happens: intercepting actions at the point of execution, before they reach the system. In Claude Code, this is done via hooks, shell scripts that run before or after tool invocations and can block, log, or modify them. A more general approach is to route agent actions through an MCP server that applies policy before execution.

This is the layer that holds regardless of what the model intends or what the client config says. If the hook or MCP server blocks a command, it does not run. Full stop.

claude code - hooks in .claude/settings.json

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [
          {
            "type": "command",
            "command": "/path/to/policy-check.sh"
          }
        ]
      }
    ]
  }
}

The complication here is that hooks and MCP configuration add their own setup overhead. They require absolute paths, correct environment variables, and careful scoping. Getting them wrong silently means you think you have enforcement when you do not.

Layer 4: Sandboxing

Both Claude Code and Codex support sandboxed execution environments that isolate the agent from the broader host system. Claude Code offers a --sandbox flag that restricts the agent to a contained environment with limited filesystem and network access. Codex runs in a sandboxed container by default.

Sandboxing is the outermost containment layer. Where hooks and MCP enforcement control what the agent is allowed to do within your system, a sandbox controls what the agent can reach at all. It limits the blast radius if every other layer fails. An agent running inside a sandbox that exfiltrates a file can only exfiltrate what is visible inside that sandbox.

Key concept

Sandboxing is not a substitute for enforcement layers. It is the last line of containment. An agent can still cause damage inside a sandbox. Use it alongside hooks and MCP policy, not instead of them.

The practical limitation is that sandboxing can conflict with legitimate agent tasks. An agent that needs to access files outside the sandbox, run build tools, or make network calls will need those capabilities explicitly granted, which means carefully scoping what the sandbox permits. Getting this right takes time but is worth doing for any agent with significant system access.

Why the Overlap Is Confusing

The four layers are not redundant by design. They exist at different levels of the stack and catch different things. But from the outside, they look like they are solving the same problem, which leads people to configure one layer and assume they are done.

The mental model that helps is this: instructions shape intent, client rules reduce surface, runtime enforcement provides the guarantee, sandboxing limits the blast radius. You need all four working together for meaningful coverage. Any one of them in isolation leaves gaps.

Key concept

Think of it as defence in depth. Instructions shape intent. Client rules reduce surface area. Runtime enforcement is the actual guarantee. Sandboxing limits the blast radius if everything else fails. Each layer compensates for the weaknesses of the one above it.

The Script Bypass Problem

One gap that catches people off guard: an agent can work around command-level restrictions by writing a shell script that contains the blocked command and then executing the script. The deny rule sees a bash script.sh invocation, which may not be on the deny list, rather than the underlying command.

Addressing this properly requires policy enforcement that inspects script contents at write time, not just at execution time. This is a good example of why surface-level controls are not enough: the attack surface includes indirect paths, not just direct invocations.

Tools like Runtime Guard handle this through a script sentinel that applies policy checks at the point a script is written, flagging violations before the script ever runs.

Recommendations

Use all four layers, not just one. Instruction files, client deny rules, runtime enforcement, and sandboxing each cover different failure modes. Skipping any one of them leaves a gap the others cannot fully compensate for.

Disable native tools when using MCP enforcement. If you are routing agent actions through an MCP policy server, the client's native file and shell tools must be explicitly disabled. Otherwise they provide a parallel execution path that bypasses enforcement entirely.

Use absolute paths everywhere. Hooks, MCP server commands, workspace roots, and log paths all need absolute paths. Relative paths cause silent failures that are difficult to diagnose.

Verify enforcement is active, not just configured. After setup, test with a known-blocked command and confirm it is actually blocked. A misconfigured hook fails silently. You want to know before the agent does something you did not intend.

Account for indirect execution paths. Deny rules that only check direct command invocations miss script-based bypasses. Ensure your enforcement layer inspects what is being written, not just what is being directly executed.

Key Takeaways

AI agent security policy spans four distinct layers: instruction files, client-level rules, runtime enforcement, and sandboxing. Each has a different scope and different failure modes.
Instruction files are not enforcement. They influence model behaviour but do not constrain it.
Client deny rules can be bypassed through indirect paths such as scripts, and in some clients are stored in agent-writable locations.
Runtime enforcement via hooks or MCP is the only layer that provides a hard guarantee at execution time.
Defence in depth requires all four layers working together, with native tools disabled when MCP enforcement is active, and sandboxing scoped to what the agent legitimately needs.

The good news is that the controls exist. The challenge is understanding how they fit together and ensuring there are no gaps between them. As tooling matures, this should get easier. For now, it requires deliberate configuration and verification at each layer.

Tags: AI Agents Security Claude Code MCP Policy Enforcement