Running Google's Gemma LLMs in the browser with MediaPipe Web

Key Concepts

Gemma: Google's family of open, lightweight, state-of-the-art large language models (LLMs).
MediaPipe Web: A framework for running C++ research and machine learning models in the browser using technologies like WebGL, WebGPU, and WebAssembly.
LLM (Large Language Model): A type of artificial intelligence model trained on vast amounts of text data to understand and generate human-like text.
Parameters: The learnable variables within a machine learning model, indicating its complexity and capacity.
Gemma 3: A Gemma model family offering wide language support and developer-friendly sizes, suitable for various text processing tasks.
Gemma 3N: A mobile-first Gemma model architecture optimized for low-latency audio and visual understanding, designed for multimodal input.
Gemini Nano: A model architecture shared with Gemma 3N, optimized for on-device AI.
WebAssembly (Wasm): A binary instruction format for a stack-based virtual machine, enabling high-performance applications in web browsers.
WebGL/WebGPU: Web standards for rendering interactive 2D and 3D graphics within any compatible web browser without the use of plug-ins.
QAT (Quantization-Aware Training): A technique used to train models that are robust to compression, resulting in smaller file sizes.
Bfloat16 (Brain Floating Point): A 16-bit floating-point format that trades some precision for a larger exponent range, commonly used in ML applications.
Float16: A 16-bit floating-point format with a smaller maximum value compared to Bfloat16.
Float32: A 32-bit floating-point format offering higher precision and a larger range than Float16.
Instruction Tuning: A process of fine-tuning LLMs to follow specific instructions and respond in a desired format.
Multimodality: The ability of a model to process and understand information from multiple types of data, such as text, images, and audio.
Streaming Loading: A system that allows models to be loaded piece by piece, essential for handling large models in memory-constrained environments like browsers.
Per-layer Embeddings Caching: A technique where some model weights are computed and stored on the CPU to free up GPU memory.

Running Gemma in the Browser with MediaPipe Web

This presentation details the advancements made by the Google AI Edge team in running Gemma, Google's open large language model family, entirely within the web browser using MediaPipe Web. The focus is on enabling LLMs to operate efficiently and accessibly on the web.

Gemma Model Family Overview

Gemma is described as a collection of lightweight, state-of-the-art, open models built using the same technology as Google's Gemini models. The year 2025 saw the launch of two new model families:

Gemma 3:
- Offers wide language support and is available in developer-friendly sizes.
- Ideal for pushing text processing capabilities across a variety of use cases.
- Variants range from the Gemma 3 270M (270 million parameters), which is fast and easy to fine-tune, to the Gemma 3 27B (27 billion parameters), providing enhanced understanding for sophisticated applications.
- Trained on material from over 140 languages, making it multilingual.
- The text-only Medge Gemma 27B model, a variant of Gemma 3 27B, is used for applications like differentiating between bacterial and viral pneumonia.
Gemma 3N:
- Features a mobile-first architecture optimized for low-latency audio and visual understanding.
- While Gemma 3 also has vision capabilities, Gemma 3N is presented as the choice for handling mixed or multimodal input types.
- Shares architecture with Gemini Nano, enabling simultaneous development of a web Gemma 3N runner and a Gemini Nano GPU implementation for Chrome's built-in AI.

MediaPipe Web's Role and Technical Approach

MediaPipe Web is highlighted for its ability to power ML in the browser and provide real-time segmentation. It leverages technologies like WebGL, WebGPU, and WebAssembly to bring C++ research and ML to the browser in a cross-platform and scalable manner.

The cross-platform approach of MediaPipe Web is crucial, allowing deep engine code to be written once in C++ and then deployed across various target platforms.

Implementation Challenges and Solutions

1. Gemma 3 Implementation:

Transformer Stack: Required adding one transformer architecture feature to improve performance with larger working memory or context sizes.
Auxiliary System: Significant system work was needed to support the wide variety of models in the Gemma 3 family. This work is ongoing, with the web inference API currently supporting only some QAT (Quantization-Aware Training) models.
Float Precision:
- Challenge: Gemma 3 models were trained using Bfloat16, which has a larger maximum value than the Float16 used by GPU inference backends. This mismatch can lead to overflow issues with large values generated internally by the models.
- Solution: A special transition system was implemented. Parts of the model with smaller operations that don't overflow are run in Float16 for speed. Operations that accumulate results and can generate large values are run in Float32 to prevent overflow.

2. Gemma 3N Implementation:

Model Variants: Focused on supporting two variants: E4B (larger) and E2B (smaller).
Auxiliary System: Minimal auxiliary system work was needed, except for handling multimodality.
Low-Level Architecture: Featured significant cutting-edge research focused on mobile-first efficiency, aiming to increase speed and reduce compute resource usage, especially for mobile devices with constrained GPU memory.
Per-layer Embeddings Caching: An example of mobile-first efficiency, allowing some weights to be computed and kept on the CPU, freeing up GPU space.

Performance Results and Streaming Loading

Performance: On a 2024 MacBook Pro, even the largest Gemma models can run in the browser, generating content at near human reading speeds with remarkably little CPU memory.
- Generation Speed (decode): Approaching human reading speeds.
- Input Processing Speed (prefill): Significantly faster than output generation across the board.
Streaming Loading System: Launched last year, this system is essential for loading large models in browsers with limited WebAssembly memory (2GB or 4GB).
- For Gemma 3: Enabled a tiny overall CPU footprint even with 27GB models.
- For Gemma 3N: Allowed combining all model components into a single file and loading parts on demand, making vision and audio optional modalities.

API and Example Usage

Web Inference API: Documentation is available at goo.g/mediapipelm/mediapelm-inference-web.
JavaScript Demo: A small demo demonstrates loading an LLM and generating responses. It requires URLs for the MediaPipe JavaScript library, MediaPipe WebAssembly files, and the LLM model file.
Instruction Tuning: Web conversions of Gemma 3 models use instruction-tuned versions, requiring a specific template (prefix and postfix) around queries for correct operation.
- Example: Adding an exact prefix and postfix around a query.
Multimodality with Gemma 3N:
- Inference Options: Enable multimodality by setting maximumImages > 0 for vision and supportAudio to true for audio.
- Query Structure: The API accepts an ordered list of prompt pieces (audio, text, image) for flexible interleaving.
- Supported Inputs:
  - Vision: Image URLs, most common image, video, or canvas objects.
  - Audio: Single-channel audio buffer, audio file URLs.
- Example Output:
  - Vision: Gemma 3N describes the user, backpack, and background, inferring involvement in tech/AI/web.
  - Audio: Gemma 3N transcribes an audio file and expands on its significance.

Open-Sourced Demos and Applications

Two demos are open-sourced and hosted as Hugging Face Spaces, requiring a device with sufficient resources, WebGPU support, a Hugging Face account, and acceptance of the Gemma terms of use.

Gemma 3 Chat Suite:
- Allows chatting with cached or remotely downloaded models.
- Model loading can be fast from cache but slow if downloaded remotely.
- Requires signing in with a Hugging Face account to download new models.
- The Medge Gemma 27B model requires accepting a different license.
Gemma 3N Multimodal Demo:
- Uses text or microphone to ask questions about the webcam feed.
- Demonstrates handling text, microphone input, transcribing text, and translating it to English.
- The source code is approximately 400 lines of JavaScript, including model caching, authentication, webcam/microphone usage, UI, and LLM running.
- Portability: Potentially portable to some mobile devices (e.g., Chrome on Pixel 9) with text and vision. Audio is more challenging.
- An E2B version of the demo is offered for more resource-constrained devices, compared to the original using the E4B variant.

Conclusion

The presentation concludes by emphasizing the successful integration of Gemma LLMs into the browser via MediaPipe Web, showcasing their performance, flexibility, and the innovative solutions developed to overcome technical challenges. The open-sourcing of demos aims to empower developers to build their own LLM-powered web applications.