Prediction/insight on what it actually takes to build effective multi-agent engineering systems, beyond just prompting

The Wrong Question About AI Agents

Most developers I talk to are still asking “which model should I use?” It’s the wrong starting point. The model is almost irrelevant if the system around it is broken. I’ve watched teams spend weeks benchmarking GPT-4o against Claude 3.7 against Gemini, then hand the winner a vague task description and wonder why the output is mediocre. The model didn’t fail them. The architecture did.

What Actually Breaks Multi-Agent Systems

Andrej Karpathy put his finger on exactly this problem. His observation, which has been spreading through engineering circles, is that LLMs become dramatically better when you force them into disciplined workflows. That sounds obvious until you realize almost nobody does it. The failure modes he identified are specific and familiar to anyone who has run these systems for real: models assume instead of asking, they overengineer simple tasks, they hide confusion, they rewrite code you didn’t ask them to touch, and they optimize for looking done rather than being correct.

That last one is the killer. An agent that prioritizes completion over correctness will confidently produce something wrong every single time.

System-Framing vs. Task-Framing

The shift that changed my own output was moving away from task-framing entirely. “Write this function” is a task frame. It gives the agent one thing to do and no information about whether it did it right. The agent fills the gaps with assumptions, and those assumptions compound.

System-framing looks different. You give the agent a goal, hard constraints, a verification method, and explicit rules about when to stop and ask versus when to proceed. You’re not telling it what to do step by step. You’re giving it success criteria and letting it loop. The difference in output quality is not marginal.

The CLAUDE.md pattern that Karpathy referenced is a good example of this made concrete. These files are not elaborate prompts. They function more like operating system rules for the agent: think before coding, make surgical edits only, simplicity before optimization, goal-driven execution. When that context is present at the start of every session, the agent’s behavior changes in ways that no amount of in-context prompting replicates reliably.

Orchestration Is Not Pair Programming

Here’s the part most tutorials skip. Running one agent in a loop on one task is not orchestration. Real orchestration means parallel agents with separated concerns, one researching, one writing tests, one debugging, one validating outputs. Each agent has a narrow job and a clear handoff condition. They are not all touching the same code at the same time.

The people reporting the biggest productivity shifts are the ones who restructured their workflows this way. The rough numbers floating around are significant: developers moving from roughly 80% manual coding to 80% agent-driven coding within months. Not because the models got perfect, but because the leverage became too large to ignore once the system design was right.

What This Means for Engineering as a Skill

The highest-leverage engineers over the next few years will probably not be the ones who write the best code by hand. They’ll be the ones who design the best systems around agents. That’s a real change in what the job demands. It requires thinking in terms of constraints, verification loops, and failure modes rather than syntax and algorithms.

I find that genuinely interesting rather than threatening, but only because I made the mental shift from “agent as faster autocomplete” to “agent as team member with specific responsibilities and specific limits.” The second framing demands much more upfront design work. It also produces results that are actually reliable.

The teams still treating this like autocomplete are going to have a hard time competing with the teams that aren’t.

Sources & Further Reading

#AIEngineering #MultiAgentSystems #SoftwareEngineering #LLMs #AIAgents

Sources & Further Reading

Andrej Karpathy on disciplined AI coding workflows and CLAUDE.md patterns

Prediction/insight on what it actually takes to build effective multi-agent engineering systems, beyond just prompting

Sources & Further Reading

Perplexity’s always-on ‘Personal Computer’ Mac mini and the shift from reactive to ambient AI agents

Prediction: open-source TTS beating ElevenLabs signals that API-access moats are disappearing faster than most product teams realize

Context window management: treating LLM context as working memory, not unlimited storage

AI agent swarm reconstructs Operation Epic Fury in 4D from public OSINT data, raising questions about capability compression and information asymmetry

Claude Skills and progressive context disclosure as a real engineering pattern, not prompt engineering

3-year AI video generation progress comparison (Modelscope vs Grok Imagine v1)

Leave a Reply Cancel reply

Sources & Further Reading

Similar Posts

Leave a Reply Cancel reply