New AI Reasoning System Shocks Researchers: Unlimited Context Window

Recursive Language Models: A Deep Dive

Key Concepts:

Context Window: The amount of text a language model can process at once.
Context Rot: The degradation of performance in language models as input length increases.
Recursive Language Models (RLMs): A new approach where the model interacts with a large prompt as an external world, exploring and processing it in steps rather than attempting to ingest it all at once.
Ulong Pairs: A benchmark task measuring a model’s ability to transform or compare large numbers of entries.
Ripple Environment: The external workspace where the large input resides in RLMs.
LLM Batch: A technique used by Prime Intellect to send out multiple small tasks simultaneously.
Ablation Study: A method of testing by removing components to assess their impact.
Out-of-Core Algorithms: Computer science algorithms designed to process datasets larger than available memory.

The Problem with Expanding Context Windows

For the past few years, the trend in language models has been towards larger context windows – increasing from 8,000 to 32,000, 100,000, and even 1 million tokens. While seemingly a solution to processing vast amounts of information, simply increasing the context window doesn’t translate to improved performance. Performance degrades with longer inputs (“context rot”), costs escalate, and models eventually “lose the plot.” This degradation is particularly pronounced in complex tasks requiring relationships between multiple parts of the input. Benchmarks like oolong and ulong pairs demonstrate this, with models like GPT-5 experiencing a sharp performance drop, especially on tasks with linear or quadratic complexity. Specifically, on ulong pairs, GPT-5’s F1 scores drop close to zero as input length increases, even before reaching the hard context limit. This indicates the issue isn’t solely about token capacity, but how the model processes the information.

Introducing Recursive Language Models (RLMs)

MIT and Prime Intellect propose a fundamentally different approach: Recursive Language Models (RLMs). Instead of forcing the model to process a massive prompt at once, RLMs treat the prompt as an external “world” the model can explore. The model doesn’t read everything; it “pokes around,” inspects pieces, searches, and even utilizes smaller versions of itself for assistance. This shifts the focus from memory capacity to intelligent information navigation.

How RLMs Work: A Step-by-Step Process

External Workspace: The entire input text resides outside the model in a “ripple environment,” like a large document on a desk.
Initial Glance: The model begins with a quick overview to understand the input’s structure (e.g., list, documents, code).
Selective Search: The model searches for relevant keywords, patterns, or lines, ignoring irrelevant information.
Chunking & Delegation: The input is broken down into smaller chunks, which can be processed by cheaper, smaller AI models.
Iterative Refinement: The main model collects useful information from these chunks and combines it into an answer.
Piecewise Output: For long answers, the model builds the response piece by piece, saving segments and assembling them at the end.
Recursive Calls: The model can revisit, refine, or verify its work using smaller, focused calls to itself.

This process resembles flipping through a book, highlighting key passages, and consulting an assistant for summaries, rather than memorizing the entire text.

Performance & Cost Analysis

RLMs demonstrate significant performance improvements, particularly on complex tasks.

Accuracy: When paired with GPT-5, RLM achieves over 91% accuracy on a benchmark involving up to 1,000 documents (millions of words). This contrasts sharply with traditional methods.
Cost: The average cost per query with RLM + GPT-5 is under $1, compared to $1.50 - $3 for directly processing the entire input.
Code QA (LongBench V2): GPT-5 alone achieves 24% accuracy, a summarization agent improves it to 41.33%, while RLM reaches 62%.
Ulong Pairs: GPT-5 scores ~0.04 F1, summarization agents near zero, Kodak with retrieval reaches 24.6, and RLM achieves 58.0.
Quen 3 Coder: Base scores below 0.1 F1, RLM reaches 23.11.

An ablation study revealed that even without recursive sub-calls, simply moving the context to an external environment (the “ripple only” variant) significantly improved performance, reaching 66% accuracy on some tasks. This highlights the importance of offloading the context from the model’s memory.

Prime Intellect’s Implementation (RLMNV)

Prime Intellect built upon the MIT blueprint to create RLMNV, a concrete RLM system. Key features include:

Focused Workspace: The main AI has access only to a simple workspace, avoiding web browsing or large tool outputs.
Helper Models: Heavy lifting tasks (web search, file access) are delegated to smaller, specialized models.
LLM Batch: Enables parallel processing of multiple small tasks.
Strict Output Rule: The model must clearly write its final answer to a designated location, preventing wandering thoughts and ensuring a concise output.

Model-Specific Behavior & Future Directions

Testing across various scenarios (deep dive web research, math Python, oolong) showed that both GPT-5 and Quen 3 Coder benefited from the RLM structure, but behaved differently. GPT-5 was more cautious and selective in its exploration, while Quen 3 Coder was more aggressive, splitting tasks into smaller units. A simple warning to Quen 3 Coder to avoid overuse of helper calls significantly altered its behavior, demonstrating the importance of base model judgment calls.

Current RLMs are limited to one level of recursion and operate sequentially. The authors emphasize the potential for improvement through reinforcement learning, which could teach models to explore inputs efficiently, determine recursion depth, and optimize stopping criteria. They view RLM runs as a new form of “reasoning trace” that can be used for training. This approach mirrors out-of-core algorithms in computer science, utilizing small, fast working memory with symbolic access to a large external store.

Conclusion

RLMs represent a significant shift in how language models handle large inputs. By moving beyond the limitations of context windows and focusing on intelligent information navigation, RLMs unlock the potential for handling massive datasets, solving complex tasks, and achieving comparable or lower costs. This approach offers a promising path towards building agents capable of processing entire codebases, knowledge graphs, and extensive logs without losing crucial details. The future of language model scaling may lie not in simply making models bigger, but in making them smarter about how they access and process information.