The safety check inside an AI coding agent is supposed to be the thing that stops it from running a destructive command on your machine. New research shows that check is a text filter, and a shell trick older than the agents it guards walks right past it. Adversa AI calls the technique GuardFall, and it defeated the command guardrail in ten of the eleven open-source AI agents the firm put through its tests. Read it as the security story it is: this is not an AI problem, it is command injection wearing a new costume.
A word filter cannot referee a shell
The agents mostly defend themselves the same way. Before a command runs, they match its text against a list of dangerous patterns and reject anything that hits. The problem is that the shell never runs that text as written. Before execution, bash rewrites it. It strips quotation marks, splits words on a separator it controls called the internal field separator, and expands variables and shortcuts.
So a command typed as r''m looks harmless to a filter hunting for the string rm, because the two are not the same characters. The shell then removes the empty quotes and runs rm anyway. Adversa described other flavors of the same idea: smuggling a command in through base64 decoding piped into a shell, or bolting a destructive flag onto an ordinary tool like find or dd. The filter and the interpreter look at two different strings, and the space between them is the entire attack.
| AI coding agent | Safety check bypassed |
|---|---|
| opencode | Yes |
| Goose | Yes |
| Cline | Yes |
| Roo-Code | Yes |
| Aider | Yes |
| Plandex | Yes |
| Open Interpreter | Yes |
| OpenHands | Yes |
| SWE-agent | Yes |
| Hermes | Yes |
| Continue | No |
Why a lab trick is a software supply-chain problem
These agents execute shell commands using the full rights of whatever developer account is driving them, and in continuous integration that account often holds cloud keys. Point one at a repository you do not control and the danger stops being theoretical. The trigger paths Adversa lists are the everyday surfaces of open-source work: instructions buried in a build file that looks ordinary, a booby-trapped reply inside tool documentation, or a project config file such as the one Aider reads straight from the repo and trusts.
The agents in the test hold roughly 548,000 GitHub stars between them, so this is mainstream tooling, not a fringe experiment. Adversa drove the agents with Claude Sonnet 4.6 and carried full attacks through to completion against Plandex and eight of the others. One precondition matters: the agent has to be running with auto-execute turned on or its sandbox switched off, which is exactly how teams wire these into pipelines to get unattended runs. The setting that makes an agent useful in a clean repo that still hands an attacker a shell is the same setting that makes GuardFall land.
This is command injection's third act
We have watched this exact failure twice before. SQL injection worked because an application validated a string that the database parser then read differently. Classic command injection worked because a program cleaned input that the shell then re-expanded. GuardFall is the same bug a third time: a guardrail inspects one form of a command while the interpreter executes another.
The lesson security learned twenty-five years ago, that a denylist of dangerous strings can never keep pace with a parser, did not travel into the AI tooling that reinvented the pattern. A blocklist will always lose here, because a shell offers quote removal, word splitting, variable expansion, filename globbing and decode-and-pipe, and the number of ways to spell rm through those is effectively unbounded. Patching the filter to catch r''m only moves the game to the next spelling. We have seen the broader version of this argument before: the toolchain that ships your code is itself the attack surface.
The one agent that resisted did the textbook thing
Continue was the only tool that held, and how it held is the real fix. Instead of pattern-matching the command text, it parses the command the way bash would before deciding whether to allow it, then blocks destructive operations outright. That is the standard cure for every injection bug ever written: parse first, decide second, so the guardrail and the interpreter agree on what the command actually is.
Adversa put the engineering cost at roughly two days for an experienced team. That number matters, because it means the other ten did not hit a hard research wall. They made a design choice, and it was the wrong one. The practical read for a defender: treat any command allowlist or dangerous-command blocklist feature in an agent as advisory, not a control, unless the vendor can tell you it parses a command before it checks it. This is the same trust mistake behind an assistant running a repo's own config file as you.
Watch the agent like any other privileged process
Because the guardrail is bypassable and the agent runs as you, the place to catch GuardFall-style abuse is downstream of the agent, not inside it. The agent process is privileged and now untrusted, so monitor it that way. Two signals are worth an alert by tomorrow morning: the agent binary spawning a shell that reads credential paths, such as the directories holding SSH and cloud keys, and outbound connections from an agent run to hosts that are not on an allowlist.
Neither signal depends on knowing the exact bash trick, which is the point. You are detecting the consequence, not the syntax. A detection setup that already watches process lineage and network egress will see the credential read and the exfiltration attempt even when the agent's own safety check waved the command through. The same instinct applies when a single web page turns a local AI agent into remote code execution: assume the guardrail fails and watch what the process does next.
Give the agent a throwaway identity before its next pull
The fix that does not wait on any vendor shipping a better parser is to assume the guardrail will fail and contain what happens when it does. Four concrete steps:
-
Keep auto-execute and sandbox-skip flags off by default; turn them on only inside a disposable environment.
-
Give the agent a throwaway home directory with none of your real SSH or cloud credentials in it, so a successful bypass steals nothing of value.
-
Never let an agent run automatically on pull requests from forks. That is an attacker handing you the malicious repo and asking you to run it.
-
Treat a repository's config files as code you do not trust, no different from a script you just downloaded off the internet. Merely opening the project is enough to set the attack off, as it was when opening a folder in the editor ran npm supply-chain malware.
The deeper point outlasts this one technique. We keep bolting AI features onto shells, browsers and package managers and assuming a text filter can referee them. It cannot, and the next bypass is already being written. Confinement is the control. The guardrail is a courtesy.