Anyone who has used a frontier AI model for a long, complex task has ran into this problem. You're a hundred messages deep into a Claude Code session, or you've pasted a huge document into ChatGPT, and somewhere along the way the model starts making more mistakes. It forgets earlier details. It contradicts itself. In layman’s terms, its performance falls of a cliff.
This phenomenon has a name, and it is called context rot. As the number of tokens in a model's context window grows, its ability to accurately recall and reason over earlier information degrades steeply. Most needle-in-a-haystack benchmarks miss this entirely, because of how they are designed. In most cases, they check whether the model can find a single fact hidden in filler text, which frontier models can do without running into any problems. However, real tasks require a model to hold everything in view at once and reason over all of it together, and that is something that these benchmarks don’t check effectively.
Enter Recursive Language Models, a general inference framework that sidesteps context rot by never forcing a model to ingest huge inputs directly. Their January 2026 preprint reports results that are, at times, hard to believe, as the researchers behind the system reported that they managed to get a much smaller model to approach GPT-5 performance on long-context tasks. Let’s explain the mechanism behind those results.
Treat the Prompt as an Object, Not an Input
The key idea of Recursive Language Models (RLM) is very simple. Instead of feeding a long prompt into the transformer, an RLM loads it into a persistent Python REPL environment as a variable. In effect, the model never "sees" the raw context the way a traditional model does. It sees metadata, such as how long the context is and what it looks like, and then writes code to:
- take a look at parts of the input
- split it into chunks
- retrieve and summarize relevant snippets
- recursively invoke the model on those snippets to create an answer
So in this context, the models are called "recursive" not because a recursive neural architecture is used, but because the models are being invoked recursively over subproblems.
Think of it like a senior engineer who doesn't read an entire codebase word-by-word, but writes scripts to find relevant patterns, delegates sub-investigations to junior colleagues, and synthesizes results at the end. From the outside, the interface looks exactly the same as if you were using a standard approach, meaning you send a prompt and get back a response. However, recursively calling the model on sub-problems and synthesizing the results at the end is precisely what allows RLMs to outperform standard approaches and sidestep the degradation that sets in when a model's context window starts to fill up.
Detailed Breakdown Of The Process
The easiest way to understand what is happening is to look at an RLM trajectory. This is a step-by-step record of the code the model writes, what it reads from the long text, and what sub calls it makes.
In an RLM setup, the full text is not put inside the model attention window. Instead, the text is stored in an external workspace, like a sandboxed Python REPL, as a variable or a file. At the start, the main model sees only a small amount of information, like the total length of the text and often a short beginning snippet, plus instructions for how to read the text from the workspace.
Because the model only sees a tiny part of the text at first, it usually starts by printing a small slice, often the first few hundred characters or lines, to figure out what the data looks like. It is trying to spot things like headings, separators, repeated patterns, or record formats without pulling the whole text into its prompt.
After it gets a basic idea of the structure, it often uses simple, cheap tools to narrow things down. This can include regex searches, keyword searches, or scanning lines for certain patterns. When the text is structured and the question matches clear keywords or fields, this can quickly point to the right areas and make them easy to double check.
For harder tasks, keyword search is not enough. If the model needs to understand the meaning of lots of entries, it often breaks the text into smaller chunks or selects likely sections first. Then it sends those chunks to sub LLM calls with a clear mini task, like pulling out relevant facts, labeling items, or computing partial totals. It saves the results in the workspace and then combines them into a final answer.
For tasks that need a long or highly structured output, the model usually does not try to write everything in one go. Instead, it builds the answer in pieces over multiple steps, saving partial drafts, lists, or tables in the workspace and putting them together at the end. This can get around one response length limits, though it still has real limits like tool output limits, logging limits, cost and speed limits.
One big benefit of this setup is that it is fully transparent. Since the system records the full trajectory, you can review exactly what the model did, what it looked at, and what sub calls it made. That transparency comes from the surrounding system and its logs, not from the base language model by itself.
Article continues below
Want to learn more? Check out some of our courses:
Why Is This Approach Different From What Already Exists
Many long context agent systems use context compaction. When the running history gets too long, they compress older turns into a shorter summary so the model can keep going. The downside is that summarization is lossy, so some details can be dropped and cannot be reliably recovered later.
Retrieval agents and coding agents often work by selecting relevant snippets from documents or codebases and copying those snippets into the model context window. This helps, but if the task needs lots of snippets or lots of intermediate reasoning, the window can still fill up. At that point the system is forced to either drop information, compress it, or stop adding context.
RLMs solve both problems at once. The full prompt is stored outside the model in an execution environment, and the model interacts with it by running code that inspects small parts at a time. This gives the model a symbolic handle to the prompt so it can work with very large inputs without copying them into the root context window.
They also let the model build intermediate results and even large outputs inside the environment across many steps, including sub calls on smaller slices of the data. This can reduce pressure on both the context window and single completion output limits, while still being bounded by practical system limits.
How Much Better Do RLMs Actually Perform
The paper reports results for GPT-5 and Qwen3-Coder-480B-A35B on three benchmarks:
- BrowseComp-Plus
- OOLONG
- OOLONG-Pairs
On BrowseComp-Plus, a multi hop research task over 1,000 documents totaling 6 to 11 million tokens, base GPT-5 cannot run on the full input and is reported as 0.0 due to context limits. RLM(GPT-5) scores 91.3 percent, beating the summary agent which is a baseline for the benchmark by 20.8 points.
On OOLONG, which requires semantically labeling many entries and aggregating results, an RLM using GPT-5 beat base GPT-5 by 12.5 points. It also beat the summary agent by 10.5 points. On the same benchmark, an RLM using Qwen3-Coder-480B-A35B beat base Qwen3-Coder by 12.0 points.
On OOLONG-Pairs, which is constructed to require pairwise aggregation and therefore scales quadratically, an RLM using Qwen3-Coder-480B-A35B beat base Qwen3-Coder by about 23.0 F1 points.
The Costs
Of course, there are downsides to using RLMs. On a typical query, RLMs cost about the same as a direct model call or even less. But some trajectories, particularly on complex tasks with many sub-calls, can end up significantly more expensive. So while the average cost looks reasonable, the worst cases do not.
RLMs also aren't always the right tool. On short inputs that fit easily within a model's context window, they actually perform slightly worse than just calling the model directly. The overhead of setting up the REPL and managing sub-calls adds unnecessary complexity when the task didn't need any of that in the first place. They work best when the context is genuinely long and difficult, not as a blanket replacement for standard model calls.
Speed is another limitation worth noting. Right now, every sub-call waits for the previous one to finish before starting. On long trajectories, that adds up. Running sub-calls in parallel rather than one at a time would likely cut runtimes significantly, though that improvement hasn't been implemented yet.
Implementation
An official implementation of the framework is available under an MIT license at https://github.com/alexzhang13/rlm. In fact, you can try out RLMs quickly by installing from PyPi. The default RLM client uses a REPL environment that runs on the host process, using the same virtual environment as the host process with some limitations in its available global modules.
The implementation supports two different types of REPL environments, an isolated one and a non-isolated one. The non-isolated ones which run on the same machine as the RLM are ideal for low risk tasks, but are not ideal if you foresee that you might run into some cybersecurity risks. Fully isolated environments use cloud-based sandboxes, such as Modal Sandboxes, to run code generated by the RLM. This is going to ensure that you have complete isolation from the host process.
Finally, the system supports the use of models from OpenAI and Anthropic, as well as models provided by router platforms such as OpenRouter, Portkey and LiteLLM. In terms of local deployment, vLLM is recommended.
Are RLMs The Definitive Solution To Context Window Problems
Recursive Language Models do not give transformers infinite memory. They change the workflow. Instead of forcing a model to absorb millions of tokens at once, the prompt lives in an external workspace and the model interacts with it through small, targeted reads plus code that searches, chunks, and delegates sub tasks. That is why they can avoid context rot.
Of course, this approach is not free. There is overhead from orchestration, some runs can become expensive when many sub calls are needed, and sequential tool use can slow things down. Execution also raises security and isolation concerns, which is why sandboxing matters. So RLMs are not a blanket replacement for direct prompting. They are most useful when the input is genuinely long, messy, and interdependent.
To summarize, RLMs are an interesting way to tackle context window limits by working with the text in smaller pieces instead of trying to fit everything into one prompt. They probably won’t end up being "the solution" for context window problems, but they demonstrate how thinking out-of-the box can lead to big improvements.