Context windows: how much can the model hold — and what happens when it forgets

What context windows are, why they matter for your product, and the practical difference between 128K and 200K tokens in real applications.

Think of it as the model’s desk.

Everything on the desk, the model can see and work with. Your instructions, the conversation history, the document you uploaded, the code you pasted — if it is on the desk, the model uses it. The moment the desk is full, something has to come off before anything new can go on.

That is a context window. The maximum amount of text a model can hold in a single session — input and output combined. When you hit the limit, earlier messages fall off the edge. The model does not tell you this has happened. It simply stops knowing what was said before.

The numbers that matter

The context window range across current models is wide. Some models work with 8,000 tokens — about six pages of text. Others work with 200,000 or more — roughly two full novels. The difference is not academic. It determines what you can build.

A product that summarises long legal documents needs a different context window than a product that answers one-sentence questions. A coding assistant that holds an entire codebase in context produces better suggestions than one that can only see the current file.

What actually happens when you hit the limit

Most applications hit context limits before their developers expect to. A chat history that feels short to a user — 30 messages back and forth — can easily consume 20,000 tokens. Add a system prompt, the document being discussed, and the model’s responses, and you are at the limit faster than you think.

When the context fills, you have three options: summarise earlier content and replace it, selectively remove older messages, or use a model with a larger window. Each has tradeoffs. The right choice depends on whether conversational continuity matters more than cost.

The desk analogy, continued

A larger desk is not always better. A cluttered desk slows you down. Models can lose focus in very long contexts — performance on information from early in the context degrades compared to information near the end. Precise context — only what the model needs — outperforms stuffed context almost every time.

Start here: Count the tokens in your most recent production prompt. If you are above 50% of your model’s context window, you have a context management problem waiting to happen.

Context windows: how much can the model hold — and what happens when it forgets

The numbers that matter

What actually happens when you hit the limit

The desk analogy, continued

Related glossary terms