Open Models at Google DeepMind — Cassidy Hardin, Google DeepMind

By AI Engineer

Share:

Key Concepts

  • Gemma 4: The latest open-source model family from Google DeepMind.
  • Mixture of Experts (MoE): An architecture using a router to select specific "expert" neural networks for each inference pass.
  • Grouped Query Attention (GQA): An attention mechanism that shares key/value heads across multiple queries to reduce memory usage.
  • Per-Layer Embeddings (PLE): A technique storing embedding tables in flash memory rather than VRAM to optimize on-device performance.
  • Effective Parameters: The number of parameters active during a single forward pass versus the total representational parameters.
  • Native Multimodality: Integrated support for text, vision, and audio inputs.

1. Model Architecture and Sizes

Gemma 4 is released in four distinct sizes, categorized by their primary use cases:

  • Large Models (Cloud-focused):
    • 31B Dense: A state-of-the-art multimodal model for advanced reasoning, featuring a 256k context length. It supports native thinking, function calling, and structured JSON outputs.
    • 26B Mixture of Experts (MoE): The first MoE in the Gemma family. It utilizes 128 total experts but only activates 8 per forward pass (3.8B active parameters), balancing efficiency with high performance.
  • Effective Models (On-device):
    • Effective 2B & 4B: Designed for local execution on phones, laptops, and tablets. These models support text, vision, and audio inputs.

2. Architectural Improvements

  • Attention Mechanism: Gemma 4 utilizes a mix of local and global layers.
    • Interleaving: A 5:1 ratio of local to global layers (4:1 for the 2B model).
    • Sliding Window: Local layers use a sliding window (512 tokens for small models, 1,024 for large) to maintain efficiency.
    • Global Layer Priority: The final layer is always a global layer, ensuring it attends to all preceding tokens.
  • Grouped Query Attention (GQA): To mitigate the memory cost of global layers, GQA groups 8 queries to share one key/value head. To maintain performance, the key/value head length was doubled to 512.

3. Per-Layer Embeddings (PLE)

To solve VRAM constraints on edge devices, Google DeepMind introduced PLE:

  • Mechanism: Instead of storing the entire embedding table in VRAM, the per-layer tables are stored in flash memory.
  • Efficiency: The embedding dimension is reduced to 256. At the end of each decoder block, the model performs a lookup and projects the 256-dimension vector up to the full model embedding size. This allows for high performance without the memory overhead of standard large embedding tables.

4. Multimodal Capabilities

  • Vision:
    • Encoders: 550M parameter encoder for large models; 150M for effective models.
    • Variable Resolution/Aspect Ratio: Developers can select from five resolutions/token budgets. This replaces the older "pan and scan" method, allowing the model to process images based on their native aspect ratio rather than splitting them into fixed squares.
    • Patching: Images are divided into 16x16 patches, grouped into 3x3 grids, and projected into soft tokens.
  • Audio:
    • Supported in E2B and E4B models for translation and speech recognition.
    • Pipeline: Raw audio → MEL Spectrogram → Conformer (35M parameters) → Soft tokens.

5. Licensing and Accessibility

  • Apache 2.0 License: Gemma 4 has moved to an Apache 2.0 license to facilitate easier integration for developers from testing to production.
  • Deployment:
    • Self-hosting: Available via Hugging Face, Kaggle, and Ollama.
    • Cloud: Larger models (31B/26B) are accessible via Google AI Studio and Vertex AI.

6. Synthesis and Conclusion

Gemma 4 represents a significant leap in open-source efficiency and multimodal integration. By moving to an MoE architecture for larger models and implementing Per-Layer Embeddings for on-device models, Google DeepMind has successfully lowered the barrier for running high-performance AI locally. The shift to native multimodality—specifically the move away from "pan and scan" for vision—provides developers with granular control over token budgets, making the models highly adaptable for specialized tasks like OCR and object detection.

Chat with this Video

AI-Powered

Hi! I can answer questions about this video "Open Models at Google DeepMind — Cassidy Hardin, Google DeepMind". What would you like to know?

Chat is based on the transcript of this video and may not be 100% accurate.

Related Videos

Ready to summarize another video?

Summarize YouTube Video