Context window management: treating LLM context as working memory, not unlimited storage

Context Windows Are Not Scratch Pads

Most engineers I talk to treat a 200K token context window like a gift. More space means more context, more context means better results, better results mean ship it. That logic feels intuitive. It is also how you end up with a production agent that confidently contradicts its own instructions at turn 47.

The mental model is wrong. A context window is not storage. It is working memory. And if you learned anything from cognitive science or just from trying to hold too many things in your head at once, you know that working memory degrades under load.

What the Research Actually Says

Anthropic’s work on attention patterns in transformer models makes this concrete. Attention is not uniform across a context window. Models tend to weight tokens near the beginning and end of a window more heavily than content in the middle. This is sometimes called the “lost in the middle” problem, and researchers at Stanford documented it formally in 2023, finding that retrieval accuracy on multi-document QA tasks dropped significantly when relevant information appeared in the middle of long contexts rather than at the edges.

So when you dump 180K tokens of codebase into a Claude or GPT-4 session and the answer you need is buried in the middle, you are statistically more likely to get a wrong or vague response. The model sees all the tokens. It does not reason equally well over all of them.

The Production Agent Problem

I have watched this play out in real systems. A team builds an agentic loop, celebrates when they hit a high task-completion rate on short sessions, then deploys it and watches performance fall apart over longer runs. The post-mortem usually blames the model. The real culprit is context hygiene.

As a conversation or agent loop grows, a few things happen. Earlier instructions get diluted. Contradictory information accumulates, because no one pruned it. The model starts blending what it was explicitly told with what it inferred from the accumulated context, and those two things are not the same. By turn 50, you have a context that looks authoritative but is actually a mess of stale state and ambiguous signals.

Bigger context windows make this worse in a counterintuitive way. They lower the urgency to manage context actively, so teams don’t.

What You Should Do Instead

Treat context as a budget, not a buffer. Decide what actually needs to be in the window at each step and actively remove what doesn’t.

For agents, this means summarizing completed sub-tasks rather than keeping the full transcript. Write a short, dense summary of what was decided and why, then discard the raw back-and-forth. You keep the conclusions without paying for the noise.

For RAG systems, chunk and retrieve rather than inject full documents. A 50-page spec injected wholesale is not the same as the three paragraphs actually relevant to the current query. The model does not reward you for volume.

For long conversations, consider periodic context resets with a structured handoff. Summarize the session state, start a new context, inject the summary at the top. This feels manual and slightly awkward. It also works.

The Bigger Picture

There is a version of this problem that compounds as models get more capable. Gemini 1.5 Pro supports a 1 million token context window. That is genuinely impressive engineering. It is also an invitation to stop thinking about what belongs in context at all.

I’d argue that context discipline is going to matter more as windows expand, not less. A model that can technically process 1M tokens will still exhibit attention drift, still blend inferred state with explicit instruction, and still produce degraded outputs when the context is a dumping ground. The failure mode just takes longer to appear, which makes it harder to debug.

The engineers building reliable agentic systems I have actually seen ship well are not the ones chasing the biggest window. They are the ones treating context like RAM on a constrained system: careful about what gets loaded, deliberate about what gets flushed, and suspicious of anything that sits in memory longer than it needs to.

That instinct is not new. Systems programmers have thought this way for decades. We just forgot to bring it with us into the LLM era.

Sources & Further Reading

#LLM #AIEngineering #MachineLearning #ContextWindows #AgentArchitecture #ProductionAI

Watch the full breakdown on YouTube

Context window management: treating LLM context as working memory, not unlimited storage

Sources & Further Reading

Google Stitch AI design tool upgrade and what it means for designers and builders

Career-ops: open-source Claude Code system that filters job applications using engineering discipline instead of spray-and-pray

OpenAI adds container pooling to Responses API, making agent tool calls ~10x faster to spin up

I Built a Dating Site for AI Agents in One Night

Prediction: open-source TTS beating ElevenLabs signals that API-access moats are disappearing faster than most product teams realize

Insight: the real value of AI for engineers is eliminating wait states and uncertainty, not raw speed

Leave a Reply Cancel reply

Sources & Further Reading

Similar Posts

Leave a Reply Cancel reply