DeepSeek Just Killed Visual Reasoning (And It's 10× Cheaper)

By Prompt Engineering

Share:

Key Concepts

  • Visual Primitives: A technique where the model uses special tokens (bounding boxes and coordinate points) within its chain-of-thought to "point" at objects, solving the "reference gap."
  • Reference Gap: The limitation where language is too imprecise to accurately describe or track specific entities in complex visual scenes.
  • KV Cache Compression: A method to drastically reduce the memory footprint of visual data, allowing for significantly faster and cheaper inference.
  • DeepSeek V4 Flash: The underlying language backbone, a Mixture-of-Experts (MoE) model with 284B total parameters but only 13B active parameters.
  • Topological Reasoning: The ability to understand spatial relationships, such as pathfinding in mazes or tracing trajectories, where traditional language models often struggle.

1. The Lineage of DeepSeek Vision

DeepSeek has maintained a consistent research trajectory over the last 24 months, focusing on the core question: "What is the cheapest representation that still works?"

  • DeepSeek VL (March 2024): Established the foundation using hybrid SIGLIP and SAM encoders.
  • Janus (October 2024): Introduced decoupled visual encoders for understanding versus generation, avoiding the "single encoder bottleneck."
  • V2/V3 Vision (December 2024): Integrated Mixture-of-Experts (MoE) and Multi-Head Latent Attention (MLA) into vision, achieving high performance with low parameter activation.
  • Janus Pro 7B (January 2025): Gained popularity for its ability to run on consumer GPUs while outperforming DALL-E 3 in specific benchmarks.
  • DeepSeek OCR (October 2025): A breakthrough in compression where text is rendered as images, encoded, and then decoded back to text with 97% accuracy, achieving 10x compression.

2. The "Thinking with Visual Primitives" Framework

The paper argues that current multimodal models suffer from a perception gap (seeing fine details) and a reference gap (inability to point).

  • Methodology: Instead of relying on imprecise natural language to describe objects, the model outputs "reference tags" and "box tags" containing coordinate data directly into its chain-of-thought.
  • Application: This allows the model to perform complex tasks like counting in dense crowds or navigating mazes by "pointing" at specific coordinates, effectively mimicking human finger gestures.

3. Architecture and Efficiency

The model achieves extreme efficiency through a custom-built vision transformer and advanced compression:

  • Vision Encoder: Uses 14x4 patches. For a 756x756 image, it generates 2,916 patch tokens, which are compressed via 3x3 spatial compression to 324 tokens.
  • KV Cache Optimization: By applying the compressed sparse attention mechanism from the V4 paper, the model reduces the KV cache to only 81 entries for an 80x80 resolution image.
  • Comparison: This is nearly 10x more efficient than Gemini 3 Flash (1,000 entries) and significantly lower than Sonnet 4.6 (870 entries).

4. Training Pipeline

The team utilizes a five-stage training process to ensure high-quality reasoning:

  1. Pre-training: Trillions of multimodal tokens.
  2. Specialized SFT: Training two separate models—one for grounding (boxes) and one for pointing (points).
  3. RL with GRPO: Using Group Relative Policy Optimization (GRPO) with three reward heads: format, quality, and accuracy.
  4. Unified RFD: Merging the two specialist models.
  5. On-policy Distillation: Consolidating the knowledge into a single student model.

5. Benchmarks and Limitations

  • Performance: The model significantly outperforms competitors (Gemini 3 Flash, GPT-5.4, Sonnet 4.6) in topological reasoning and maze navigation, often doubling the scores of the lowest competitor.
  • Honesty/Transparency: The authors explicitly state that these scores cover only a subset of dimensions relevant to the research, rather than claiming universal superiority over all frontier models.
  • Known Limitations:
    1. Resolution Bound: Fine-grained scenes can still cause errors.
    2. Manual Triggering: The "visual primitives" mode must be explicitly triggered; it is not yet autonomous.
    3. Generalization: Point-based topological reasoning does not yet generalize well across all scenarios.

Synthesis

DeepSeek’s latest approach represents a shift from simply "seeing better" to "reasoning better" through spatial grounding. By treating coordinates as first-class tokens in the chain-of-thought, they have bypassed the inherent imprecision of language. The most significant takeaway is the extreme efficiency of the architecture, which provides frontier-grade reasoning at a fraction of the computational cost of current industry standards, signaling a move toward more accessible, high-performance multimodal AI.

Chat with this Video

AI-Powered

Hi! I can answer questions about this video "DeepSeek Just Killed Visual Reasoning (And It's 10× Cheaper)". What would you like to know?

Chat is based on the transcript of this video and may not be 100% accurate.

Related Videos

Ready to summarize another video?

Summarize YouTube Video