Context Rot
How Increasing Input Tokens Impacts LLM Performance
Context Rot initially started off as an investigation into instruction retrieval for agents. I was running SWE-bench and analyzing traces to see where exactly instruction retrieval would be useful. What stood out more, however, was how the agent would make nonsensical choices as it got further along in its trajectory (i.e. repeating the same mistake multiple times, forgetting critical information from earlier on, etc.). Anecdotally, I had also been noticing quality degradation in long ChatGPT/Claude conversations and AI coding sessions.
This got us more interested in investigating the impact of context length on LLM performance directly. We started off with a simple experiment on text replication (which later turned into our repeated words experiment) across various context lengths, and found performance degradation as context length grew, giving us our initial validation.
A key guiding principle in designing the experiments was to keep task difficulty consistent across context lengths. For many tasks, difficulty tends to scale with context length. Take a math problem for instance: if the amount of relevant context increases, that probably means an increase in problem difficulty as well (more complexity added to solving the problem). Or in a retrieval task (i.e. finding a specific fact within a long piece of text), longer context tends to mean more distractors, which in turn increases task difficulty. This shaped the setup for our Needle in a Haystack and Repeated Words experiments, where we scale the haystack with irrelevant content, repeat the same word, and so on, so that context length is the only thing that changes.
Across all experiments, we found performance degradation with increasing context length. Alongside this main finding, a few other observations stood out to me:
- Shuffling the contents of the surrounding haystack slightly improves task performance. I found this counterintuitive, since I had expected a random fact breaking the flow of logical ideas to stand out more and thus make the retrieval task easier. It turned out to be the opposite.
- Distinct behaviors between model families. For example, when analyzing failures in the NIAH experiments, Claude models tended to abstain the most when uncertain, and GPT the least.
It’s also worth noting that these results don’t necessarily mean that LLMs with their current architecture will always degrade on long context. More training may well lead to more consistent performance down the line. This just reflects the state of the models as of July 2025, when the results were published.
A direction of future work I’d be particularly interested in is understanding the mechanisms behind this performance degradation, both generally around increasing context length and for specific observations like how the structure of ideas (i.e. shuffled context) impacts performance.
Featured in
- Anthropic’s Opus 4.6 release and Effective context engineering for AI agents
- OpenAI’s Codex subagents
- Hamel Husain & Shreya Shankar’s AI evals course
- Paper analysis: Yannic Kilcher, Matt Palmer
- Recursive Language Models (RLM)