Bring the power of on-device AI to life with Google AI Edge and Gemma
By Google for Developers
Key Concepts
- On-Device AI: Running AI models locally on hardware (mobile, desktop, IoT) without internet connectivity.
- LiteRT (formerly TensorFlow Lite): Google’s core runtime for deploying AI models on edge devices.
- LiteRT-LM: A specialized runtime for Large Language Models (LLMs) that provides a text-in/text-out interface optimized for edge devices.
- MediaPipe Tasks: A suite of pre-built, plug-and-play AI solutions for common tasks like vision, audio, and gesture recognition.
- NPU (Neural Processing Unit): Dedicated hardware accelerators for efficient AI inference.
- Quantization: The process of reducing model precision to decrease memory footprint and improve speed.
- Agentic Coding: Using AI coding assistants to generate boilerplate code and implementation plans based on existing repositories.
1. The Evolution of On-Device AI
The landscape of on-device AI has shifted significantly in 2026. While previously limited to basic computer vision, modern small-language models (SLMs) like Gemma 2B and 4B now outperform much larger models from previous years.
- Benefits: Reduced Cloud API costs, offline functionality, lower latency, and enhanced data privacy.
- Hardware Acceleration: Modern devices utilize CPU, GPU, and dedicated NPUs. Google has expanded NPU support to include Qualcomm, MediaTek, Google Tensor, and Intel chipsets, with ongoing integration for Broadcom, Raspberry Pi, and Exynos.
2. Developer Personas and Frameworks
The Google AI Edge stack is categorized into three primary workflows based on developer needs:
A. The LLM Explorer (Meghan’s Persona)
- Goal: Integrating dynamic, conversational LLMs into applications (e.g., NPCs in a game).
- Tooling: LiteRT-LM. It abstracts boilerplate code, manages chat sessions, and handles KV caching.
- Workflow: Developers can download pre-quantized models from Hugging Face, use the Gallery App as a blueprint, and utilize the Google AI Edge portal to benchmark performance across a fleet of physical devices.
- Real-world Case: Kakao deployed a 1.3B parameter model in their Android app, reducing the runtime footprint by 600MB through memory mapping and custom OpenCL priority settings.
B. The Plug-and-Play Developer (Rob’s Persona)
- Goal: Quickly implementing common AI features without building custom models.
- Tooling: MediaPipe Tasks.
- Application: Used for tasks like pose detection, hand gesture recognition, and image classification.
- Example: A selfie app that uses Pose Landmarker to track shoulder y-coordinates to trigger a camera shutter at the peak of a jump.
C. The Custom ML Engineer (Chris’s Persona)
- Goal: Full control over the ML pipeline using custom architectures (PyTorch, JAX, Keras).
- Tooling: LiteRT CLI and Compiled-Model API.
- Workflow:
- Convert custom models to the portable
.tfliteformat. - Perform ahead-of-time compilation for specific NPUs.
- Use the Compiled-Model API to automatically route inference to the most efficient hardware.
- Convert custom models to the portable
- Real-world Applications:
- Adobe Lightroom: 30% performance boost in image editing.
- Epic Games (Unreal Engine): Enables 30 FPS AR experiences via NPU.
- Argmax: Real-time audio transcription.
3. Step-by-Step Implementation Framework
- Prototyping: Use the Gallery App or MediaPipe Studio to test capabilities and generate code via AI agents.
- Model Selection: Choose between pre-built tasks (MediaPipe), optimized LLMs (Gemma via LiteRT-LM), or custom models (LiteRT).
- Optimization: Use the Gemma Cookbook for fine-tuning and Google AI Edge Torch to convert weights into optimized formats.
- Benchmarking: Utilize the Google AI Edge portal to test performance on real physical hardware.
- Deployment: Integrate the library (Kotlin, Swift, C++, JS, or Flutter) and leverage hardware-specific delegates (GPU/NPU) for power efficiency.
4. Notable Quotes
- "The hardware is so fast now that the bottleneck is no longer the processors; it's how fast the phone can move data from memory to the chip." — Sachin Kotwani
- "LiteRT-LM abstracts away the complex boilerplate of on-device AI." — Erin Walsh
5. Synthesis and Conclusion
The Google AI Edge ecosystem provides a tiered approach to on-device AI, catering to developers ranging from those needing simple, pre-built tasks to those requiring deep, custom-engineered pipelines. By leveraging LiteRT and MediaPipe, developers can achieve high-performance, low-latency, and privacy-focused AI features that function entirely offline. The key takeaway is that the barrier to entry for high-quality on-device AI has been significantly lowered through standardized runtimes, hardware-agnostic APIs, and robust community resources like the Gemma Cookbook and Hugging Face model repositories.
Chat with this Video
AI-PoweredLoad the transcript when you're ready to chat so the initial page stays lighter.