Embedding Gemma: On-Device RAG Made Easy

Embedding Gemma: On-Device RAG and NLP

Key Concepts:

On-device Retrieval Augmented Generation (RAG)
Embedding model
Lightweight model
Gemma 3
Multilingual support
Dimensionality reduction
Matrica representation
MTB benchmark
Dense embedding models
Theoretical limits of embedding-based retrieval
Prompt engineering
Fine-tuning
Triplet loss

Introduction

Google has released Embedding Gemma, a lightweight embedding model built on the Gemma 3 architecture, designed to facilitate on-device RAG and other NLP tasks. Its small size (300 million parameters, ~200MB VRAM requirement) makes it suitable for resource-constrained environments.

Technical Details

Architecture: Trained on Gemma 3.
Size: 300 million parameters.
Multilingual Support: Supports over 100 languages.
Output Dimensions: Customizable, ranging from 128 to 768.
Matrica Representation: Uses matrica representation, a dimensionality reduction technique. Truncating dimensions reduces accuracy but improves speed and lowers compute cost. A 3% drop in accuracy is expected on average from the highest to the lowest dimension, with a 6% drop for code-related tasks.
Quantization: Supports quantization, but its effect is less pronounced than dimension changes.

Performance

Embedding Gemma is best-in-class for its weight.
Compared to the Quen embedding model (600 million parameters), Embedding Gemma is almost half the size while maintaining comparable performance on the MTB benchmark.

Google's AI Strategy

Google offers both open-weight models (Gemma 3, Gemma 3N) and Frontier models (Gemini 2.5 Pro, Gemini embeddings) to cater to different developer needs. Gemini embeddings are multimodal and state-of-the-art.

Theoretical Limits of Dense Embedding Retrieval

DeepMind research indicates that dense embedding-based retrieval in RAG systems has theoretical limitations, regardless of model size or power. This suggests that relying solely on dense embeddings might not be optimal. Embedding Gemma, being a dense embedding model, is also subject to these limitations.

Using Embedding Gemma for Different Tasks

Embedding Gemma supports retrieval, classification, and topic modeling. Effective use requires careful prompt engineering.

Prompt Structure: Nature of the task: Query/Document
Task Examples:
- Retrieval (User Query): "Search result: [user query]"
- Document Embedding: "Title: [document title], Text: [document text]" (Title can be "None" if unavailable)
- Question Answering: "Question answering: [user query]"
- Other Tasks: Factchecking, classification, topic modeling, clustering, semantic similarity, code retrieval, reranking, summarization, multi-label classification, instruction retrieval.

RAG Setup with Embedding Gemma

The video demonstrates a RAG setup using the transformers package.

Components:
- Text Generation: Gemma 3 4-bit instruct fine-tune version.
- Embedding: Embedding Gemma (300 million parameters).
Sequence Length: Maximum sequence length is 4048 tokens.
Output Dimension (Default): 768 (can be truncated).
Prompt Engineering (Transformers Package):
- Query Embedding: prompt_name: retrieval query: [user query]
- Document Embedding: title: [document title], text: [document text]

RAG Example: HR and Leave Policies

Corpus: HR and leave policies, IT and security, finance and expenses, office and facilities.
Process:
1. User query (e.g., "How do I reset my password?").
2. Category identification (e.g., "account password management").
3. Document retrieval within the identified category.
4. Generation using Gemma 3N with a prompt template: "Answer the question based on the provided context. Context: [retrieved documents], Question: [user query]".

Fine-Tuning Embedding Gemma

Data Set: Requires a triplet data set (anchor, positive example, negative example).
Loss Function: Sentence transformer training package is used.
Training Parameters: Output directory, number of epochs, batch size, learning rate.
Benefits: Fine-tuning improves similarity scores and retrieval accuracy.

Conclusion

Embedding Gemma is a valuable lightweight embedding model for on-device RAG and NLP tasks, especially when dealing with a relatively small number of documents and requiring quick retrieval. Fine-tuning can further enhance its performance for specific applications.