Hot take on knowing when NOT to automate with AI agents, and the emerging skill of restraint in AI engineering

The Hardest Skill in AI Engineering Is Knowing When to Stop

I’ve been sitting with a uncomfortable observation for a few months now. We’ve crossed some kind of threshold in AI tooling where the answer to “can an agent do this?” is almost always yes. The models are capable enough, the frameworks are mature enough, the APIs are cheap enough. So we build the agent. Every time. Because we can.

And I think that’s becoming a real problem.

🔧 The Automation Debt Nobody Talks About

Technical debt gets a lot of attention. Automation debt doesn’t. But it’s real, and it compounds faster.

Here’s what it looks like in practice. You build a pipeline. It works. You add another skill invocation. Still works. Six months later, debugging a single failure requires tracing through eight different agent calls to find where things went sideways. The system does its job until it doesn’t, and when it doesn’t, nobody can explain why.

I watched a thread light up this week around the Anthropic team’s writeup on how they use skills in Claude Code. Smart people, genuinely valuable work. But the discourse around it was mostly “what else can I automate?” Very little “what should I leave alone?”

That asymmetry in how we think is the problem.

🎯 Where Agents Actually Belong

Agents are genuinely good at a specific kind of task. High volume, low variance, well-defined success criteria, and tolerant of occasional errors. Data transformation pipelines. Routine document classification. Scheduled report generation. Tasks where a human doing the same thing for the 400th time is the bigger risk.

They are not good at tasks where the definition of “correct” shifts depending on context that isn’t in the prompt. They struggle with low-frequency edge cases that carry high stakes. They break badly when the environment changes in ways nobody anticipated when writing the skill.

The failure mode I see constantly is engineers treating “technically possible” as equivalent to “worth automating.” Those are not the same thing.

When the Cost of Being Wrong Is High

There’s a threshold question every automation decision needs to pass: what happens when this fails silently?

Some failures are recoverable. A mis-tagged support ticket gets rerouted. Fine. But some failures cascade. An agent that autonomously touches financial records, customer communications, or deployment pipelines is operating in territory where silent errors can travel far before anyone notices.

The 2x speed improvement in GPT-5.4 mini that OpenAI announced this week is impressive. Faster and cheaper agents will make this problem worse before it makes it better, because the economic argument for automating gets stronger while the judgment about whether you should doesn’t improve automatically alongside it.

Restraint as Engineering Discipline

The senior engineers I respect most have a habit that looks like laziness from the outside. They ask “do we actually need to automate this?” before asking how. They will look at a workflow that runs twice a week, involves genuine judgment calls, and affects something important, and say: a human should do this. Not because they can’t build the agent, but because they’ve thought about what happens when it breaks at 2am.

That’s not timidity. That’s architecture.

The skill that separates decent AI engineers from good ones right now isn’t prompt engineering or RAG optimization. It’s having an honest framework for when automation creates more risk than it removes, and being willing to say so when the answer is uncomfortable.

The teams that figure this out will have systems they can actually maintain in two years. The ones that don’t will be rewriting everything from scratch, wondering how it got so complicated.

Build the agent when it earns its place. Leave it out when it doesn’t.

Sources & Further Reading

#AIEngineering #MLOps #SoftwareEngineering #ArtificialIntelligence #AgentDesign