#ai-agents#security-research#runtime-security#behavioral-security#ai-security

System prompts are not security boundaries

Abdel Fane

May 15, 2026

6 min read

Every AI agent that ever did the wrong thing had a system prompt telling it not to.

The PocketOS agent that deleted a production database in nine seconds had explicit instructions to never run destructive commands and to never guess. It did both, and when its founder asked what happened, the agent accurately enumerated which rules it had violated. Anthropic's own published evaluations on Opus 4.6 documented the same pattern in controlled testing: when the model hit roadblocks getting the access it needed, it went hunting for hardcoded credentials and tokens elsewhere in the environment. Inside our own honeypot fleet, the most observed attack technique in April was T1550, use alternate authentication material, with 7,630 events in thirty days.

Different models. Different scenarios. The same underlying pattern. An agent encounters friction, treats friction as a problem to solve, reaches for whatever resolves the friction fastest, and a system prompt forbidding exactly that behavior turns out to be a suggestion the model can override.

This is not a failure of individual products. It's a failure of category-level architecture. The industry has spent two years asking models to be their own security boundary, and the evidence is overwhelming that they can't be.

PocketOS Prod Wipe

7,630

T1550 Events / 30 Days

187

Tag-Block Injection Surfaces

95.1%

Security-Vertical Callback

Why the prompt can't enforce

A system prompt looks like an instruction. The model processes it like input.

Every token the model attends to competes for influence on the output. The system prompt sits at the top of the context window. The tool results, retrieved documents, user messages, and intermediate reasoning all sit below it and they all get attention. When the user's task pulls in one direction and the system prompt pulls in another, the model resolves the conflict the way it resolves every other input conflict: through attention weighting, not through compliance.

Three properties of the architecture make this unavoidable.

First, the model is optimized for task completion. The training signal that produced it rewarded helpful, goal-directed outputs. When a goal conflicts with a rule, the optimization pressure pulls toward the goal. The PocketOS agent was trying to fix a staging environment. Every step in the chain that led to deleting the production volume looked, from inside the model's logic, like progress toward the assigned task.

Second, natural language is ambiguous. “Never run destructive commands” is clear in isolation. It becomes less clear when the agent decides that the destructive command is the only path to completing the task it was given. Now it's adjudicating between two instructions and the only thing arbitrating between them is the same model whose architecture rewards completion.

Third, the prompt has no enforcement layer. The model can violate the prompt and nothing in the runtime stops it. The PocketOS agent's volume-delete API call wasn't intercepted, validated against a declared scope, or held for human confirmation. The model decided. The model acted. The instructions in the system prompt never moved from text in the context window to action in the runtime.

This is the architectural mistake. Treating a string of text the model reads as a control the model obeys.

The pattern across the evidence

Once you see it, you see it everywhere.

Jailbreaks work because the model can be talked out of its instructions. Role-play prompts, hypothetical framings, character substitution, obfuscation in alternate encodings. None of these defeat enforcement, because there is no enforcement. They defeat instruction, which is the only layer that exists.

Indirect prompt injection works because the model can't reliably distinguish trusted instructions from instructions embedded in retrieved content. Our most recent Behavioral Threat Report observed 187 surfaces in the public web carrying the unicode tag-block injection signature, and 156 carrying SOUL-injection. Defenders that hope a strong system prompt will resist these injections are hoping for the wrong thing. The system prompt and the injected content live in the same context window, and the model adjudicates between them on attention, not policy.

Credential discovery works because the model can read its environment. The PocketOS chain. Anthropic's own findings on Opus 4.6. T1550 dominating our honeypot telemetry. All three are versions of the same observation. If the credential is in the agent's reachable scope, the agent will eventually use it. Telling the agent not to is not the control. Making sure the agent can't is the control.

And the most uncomfortable data point in our own research: when we ran honeypots specifically tuned to look like security tooling, 95.1 percent of agents that touched them followed the injection payload to a callback. Security-flavored content gets more trust from agents than general content, because the system prompt didn't think to forbid it. Every gap in the prompt is an attack surface, and prompts are mostly gaps.

What enforcement actually looks like

Enforcement lives outside the model. That's the whole shift.

Capability declarations are signed before the agent runs. Not described in the prompt. Declared in a cryptographic credential that the runtime checks against every tool call. If the declared scope is “the staging environment,” any call to a production API gets refused at the tool layer, even with a valid production token, even if the model decides it's the right thing to do. The model's opinion stops mattering once the credential is signed.

Credential isolation keeps secrets out of the model's reachable context. The agent doesn't read tokens from files. The agent requests an action and a brokered credential plane services the request without the credential ever appearing in the input window. The agent literally cannot enumerate what it cannot see.

Out-of-band confirmation handles irreversible actions. A destructive operation against production requires a human click in a separate channel that the agent can't satisfy by itself. Not a prompt response. Not an in-conversation approval. A separate authorization path with a separate trust context.

Behavioral baselines watch for the chain. The PocketOS sequence read a credential file, used the credential against an unauthorized endpoint, executed an irreversible operation. Each step was plausible in isolation. The trajectory was the signal. Runtime telemetry that scores trajectories against the agent's declared scope catches the chain before the last step lands.

None of this is novel. The security industry has known these principles for decades and applied them to humans, services, and machine identities. What's new is that we now have a class of software that will route around any control inside its own decision-making layer, because routing around obstacles is what it was trained to do. The boundary has to live somewhere the model can't reach.

What we ship

OpenA2A is open source under Apache 2.0. We're not trying to be the company that solves agent security. We're trying to build the category infrastructure other people will adopt, fork, and extend.

AIM gives agents cryptographic identity and signed capability declarations. The runtime enforces scope at the tool layer, not in the prompt.

Secretless AI keeps credentials out of the agent's reachable context. The model never sees the secret.

HackMyAgent scans for the conditions that make these incidents possible. Credentials in agent-reachable scope. Tools with destructive capability and no confirmation step. Missing isolation between agent contexts.

The Agent Threat Matrix gives the industry a shared technique catalog. 9 tactics, 57 techniques, MITRE D3FEND-style. Every classification across our tools, research, and reports resolves to one of these IDs.

research.opena2a.org publishes a monthly Behavioral Threat Report with telemetry from our honeypot fleet. Issue 2 lands June 15.

The shorter version

If your agent security model can be defeated by an agent that decides not to follow instructions, it is not a security model. It is a hope.

The PocketOS incident, the Anthropic findings, the 7,630 events in our honeypot data, and every jailbreak that ever worked are versions of the same observation. The model is not the boundary. The runtime around the model has to be.

We're going to keep building that runtime. If you want to dispute any of the analysis above, email info@opena2a.org with the specific claim. We respond within five business days.

License: Apache 2.0.