DeepSeek Engram: We’ve Been Building LLMs Wrong

Deepseek’s Ingram: Conditional Memory for Efficient LLMs

Key Concepts:

Large Language Models (LLMs): AI models trained on massive datasets of text to generate human-like text.
Transformer Architecture: The dominant architecture for LLMs, based on self-attention mechanisms.
Complex Reasoning: Tasks requiring synthesis, evidence weighing, and argument building.
Simple Recall: Tasks involving retrieving factual information.
Mixture of Experts (MOE): A conditional computation technique where tokens are routed to different “expert” networks.
Engram: Deepseek’s conditional memory mechanism utilizing scalable lookup tables.
N-grams: Sequences of n tokens extracted from text.
Conditional Memory: A memory mechanism activated based on the input context.
System 1 & System 2 Thinking: A cognitive psychology framework describing fast, automatic thinking (System 1) and slow, deliberate thinking (System 2).
CKA (Centered Kernel Alignment): A metric used to compare the representational similarity between different neural network layers.

1. The Inefficiency of Traditional LLMs

The core problem with current LLM architectures, based on transformers, is their computational wastefulness. LLMs perform the same complex computations – running through dozens of transformer layers – regardless of whether the task requires deep reasoning or simple factual recall. For example, answering “Paris is the capital of France” involves the same processing as answering “Why did the Roman Empire fall?” This is an inherent inefficiency because simple recall tasks could be handled with a dictionary-like lookup. The Inagram paper from Deepseek addresses this by proposing a mechanism to differentiate between these task types.

2. Distinguishing Between Reasoning and Recall

LLMs are generally tasked with two types of operations: complex reasoning and simple recall. Complex reasoning, exemplified by questions like “Why did the Roman Empire fall?”, demands synthesis, evidence evaluation, and argument construction. Simple recall, such as “Who is Princess Diana?”, requires only factual retrieval. Traditional transformers treat both identically, applying the full computational pipeline (e.g., 30+ layers) to both. This is a “hidden inefficiency” as it applies deep thinking to tasks that should be instant lookups.

The paper illustrates this inefficiency with the example of processing “Diana, Princess of Wales.” A traditional model requires six layers to reconstruct this information, starting with identifying Wales as a country and progressively building context, when a simple lookup would suffice. This demonstrates that transformers are using deep computation to simulate memory.

3. The Foundation: Feed Forward Layers as Key-Value Memories

A 2021 paper by Gao et al. revealed that the feed forward layers within transformers already function as key-value memories. The first feed forward layer acts as a pattern detector (the “keys”), while the second layer projects information into the output (the “values”). This suggests that transformers are already performing memory lookup through computation, leading to the question: why not provide them with an actual hash table?

4. Introducing Inagram: Conditional Memory via Scalable Lookup

Deepseek’s Inagram addresses this by introducing a conditional memory mechanism designed to complement, not replace, existing transformer architecture. LLMs already utilize Mixture of Experts (MOE) for conditional computation, routing tokens to different experts based on their needs. Inagram adds a second axis of sparsity: conditional memory. Instead of routing to an expert, the model performs a direct lookup into massive embedding tables. This creates two forms of sparsity: one for computation (reasoning) and one for memory (recall).

5. How Inagram Works: A Step-by-Step Process

N-gram Extraction: Given an input (e.g., “Alexander the Great”), the model extracts n-grams (e.g., “Alexander,” “the,” “Great,” “Alexander the,” “the Great,” “Alexander the Great”).
Hashing: These n-grams are hashed, resulting in multiple hashes to mitigate collisions.
Lookup in Embedding Table: These hashes point to a massive embedding table (billions of parameters) where embeddings are retrieved. This lookup is computationally inexpensive.
Context-Aware Gating: Retrieved embeddings pass through a context-aware gate. This gate assesses whether the retrieved memory aligns with the model’s current hidden state. If it does, the gate opens, allowing the memory to flow through; otherwise, it suppresses the memory.

6. The Context-Aware Gating Mechanism: Handling Ambiguity

The gating mechanism is crucial for handling ambiguity. For example, if the input is “Apple Inc.” and the context is a tech company stock, the gate opens. However, if the context is a fruit salad recipe, the gate closes, suppressing the irrelevant memory. This allows the model to disambiguate words with multiple meanings based on context.

7. Parameter Allocation: Finding the Optimal Balance

The paper found a U-shaped curve when determining the optimal parameter allocation between computation and memory. Allocating all parameters to either computation (0% to memory) or memory (100% to memory) is suboptimal. The sweet spot lies between 70-80% of parameters for computation and 20-25% for Inagram. This U-shape confirms the need for both mechanisms to work in tandem.

8. Experimental Results: Performance Gains

Deepseek compared Inagram 27B to a baseline Pure 27B model with the same parameters, computational FLOPS, and training data. The only difference was the architecture and the optimal parameter distribution. Results showed:

Knowledge Tasks: Improved performance on MMLU (up 3 points) and Chinese knowledge (up 4 points).
Reasoning Capabilities: Surprisingly, reasoning benchmarks also improved, with ARC AGI increasing by 3.7% and HumanEval (code) and mathematics showing gains.
Functional Depth: The Inagram model achieved equivalent representation at layer 5 as the baseline model at layer 12, indicating increased reasoning per layer.
Long Context Performance: Significant improvements in long context handling, as Inagram handles local dependencies, freeing up attention for long-range dependencies.

9. Hardware Implications and Efficiency

Ingram’s deterministic lookup enables pre-fetching. While the GPU computes layer 1, the CPU can fetch embeddings for layer 2, minimizing communication bottlenecks. The paper demonstrated less than a 3% throughput penalty for a 100 billion parameter table residing in host RAM (not GPU memory). This significantly reduces hardware requirements and deployment costs.

10. Limitations and Future Directions

Hash Collisions: Multiple engrams can map to the same slot, although multi-head hashing and gating mitigate this.
Static Embeddings: Embeddings don’t adapt during training, unlike neural parameters, though the gating mechanism provides some compensation.
Limited N-gram Order: The paper used 2-gram patterns; higher-order n-grams might be missed.
Not RAG: Inagram doesn’t connect to external knowledge bases; lookup tables are constructed during training.
Domain Specificity: Performance in highly specialized domains remains an open question.

Synthesis & Conclusion

The Inagram paper presents a compelling argument for separating concerns in LLM architecture. By providing a dedicated memory mechanism alongside existing computational capabilities, Deepseek demonstrates a path towards more efficient and powerful LLMs. The core insight – that LLMs have been using computation to simulate memory – is profound. Ingram offers a solution by giving LLMs an “address book,” mirroring the dual-system thinking observed in human cognition. This approach promises to improve memory efficiency, reduce hardware costs, and open new avenues for architectural research, potentially leading to significant advancements in the field. The paper highlights the importance of matching the mechanism to the task, utilizing lookup for static patterns and deep computation for dynamic reasoning.