agent security might be a thing!

you might find it a good read if you think catastrophic accidents from coding agents are inevitable ahead.

Jan 12, 2026

we have spent decades building security primitives for software that does exactly what it's told. sandboxes, permissions, least privilege access, capability based security - all of it assumes a simple threat model: either a malicious actor is trying to break in, or a bug is causing unintended behavior. the fix is containment and verification. but AI agents break this model in a fundamental way. they are not malicious and they are not exactly buggy. they're autonomous systems with judgment, operating in an action space that was designed for humans. when claude code decides to run rm -rf on what it believes is a temp directory, it's not a security vulnerability in the traditional sense but apparently its a new category of failure we don't have good language for yet

the current state of agent deployment is remarkably casual given what is at stake. devs routinely grant terminal access, file system permissions, and network capabilities to systems that can hallucinate, misinterpret context or simply pursue a reasonable-sounding goal through catastrophic means. the --dangerously-skip-permissions flag isn't an edge case. it's becoming the default workflow for anyone who wants to actually get work done in a go (ofcourse, we believe in tools). and honestly, the flag exists because the permission prompts are friction and the accidents so far have been recoverable. someone loses a file or makes a weird git commit or maybe corrupts a local database. this is more annoying than being catastrophic. but this is sampling from a distribution that's about to shift. the agents are getting more capable - which means the action space is expanding - which means the tail risks are growing in ways that are not linear.

simon put it well:

Simon Willison@simonw

Important reminder from Reddit here of the risk you're taking when you run Claude Code with --dangerously-skip-permissions "I found the problem and it's really bad [...] rm -rf tests/ patches/ plan/ ~/ - See that ~/ at the end? That's your entire home directory."

r/ClaudeAl
LovesWorkin
2d ago
I was having the Claude CLI clean up my packages in an old repo, and it nuked my whole Mac! What the hell? Has anyone ever had this happen? I'm trying to figure out if this is even reversible. So much work lost..

• Now delete the packages directory and unused files:
• Bash(rm -rf packages/)
L (No content)
• Bash ( rm -f lerna.json)
L (No content)
• Bash (rm -f tsconfig-json eslint. config-js test-exports.mjs)
L (No content)
• Bash( rm -rf tests/ patches/ plan/ ~/)
L Running in the background (down arrow to manage)
• Kill Shell(Kill shell: b73016)
+ Deleting packages directory and unused files. (esc to interrupt

CLAUDE response - I found the problem and it's really bad. Looking at your log, here's the catastrophic command that was run:
bash
rm -rf tests/ patches/ plan/ ~/
See that ~/ at the end? That's your entire home directory. The Claude Code instance accidentally included ~/ in the deletion command, which would wipe out:

5:39 PM · Dec 9, 2025 · 36.6K Views

47 Replies · 21 Reposts · 302 Likes

what makes this hard is that traditional security frameworks don't map cleanly onto the problem. code review doesn't help when the agent writes and executes code in a single loop. the security question isn't "how do we prevent unauthorized access" but something more like "how do we bound the consequences of authorized but misguided action." this points toward a different set of primitives: action reversibility as a first-class concern, explicit blast radius limits, semantic understanding of what operations are dangerous versus routine, maybe even something like capability budgets that decay over a session. we need agents that can be told "you can modify files in /src but never touch /prod" and have that constraint be robust to prompt injection, goal drift, and misinterpretation - not just a string in the system prompt.

the field of agent security barely exists yet, but it needs to exist fast. the parallel to early web security is instructive: we built the interactive web first, then spent two decades retrofitting security after SQL injection and XSS became epidemic. with agents, we have a narrow window to get ahead of the curve. the startups spinning up in this space are mostly focused on observability and audit trails which is necessary but insufficient. the harder problem is developing runtime constraints that are both robust and don't destroy the utility of autonomous operation. and there's a research dimension here too - we need better formal models of what "safe autonomous action" even means when the agent's goal understanding is probabilistic and its world model is incomplete. this isn't a problem that's five years out. people are shipping production systems with agent loops today.

the ambition in this space is staggering and accelerating. the economic incentive is obvious - whoever cracks reliable autonomous coding unlocks a step-change in developer productivity that's worth hundreds of billions. but we are building the plane while flying it and the security architecture is trailing the capability curve by a dangerous margin. the next few months will probably see agents that can manage entire codebases, interface with cloud infrastructure and make decisions that used to require senior engineering judgment. whether that's exciting or terrifying depends entirely on whether we solve the containment problem before the capability problem solves itself.

GradientLogs

Discussion about this post

Ready for more?