TLMs: Tiny LLMs and Agents on Edge Devices with LiteRT-LM — Cormac Brick, Google
By AI Engineer
Share:
Key Concepts
- Edge AI: Running AI models directly on local devices (mobile, IoT, laptops) rather than in the cloud to improve latency, privacy, and offline functionality.
- Tiny LLMs (TLMs): Models typically under 1 billion parameters, optimized for specific, narrow tasks via fine-tuning.
- System-Level GenAI: Large models (2B–5B parameters) integrated into the OS (e.g., Android AI Core, Apple Intelligence) for broad, system-wide tasks.
- Agent Skills: A framework allowing models to dynamically load and execute tools (JavaScript, system intents) to extend their capabilities beyond static knowledge.
- MediaPipe & LiteRT (formerly TensorFlow Lite): Google’s cross-platform inference frameworks for deploying models on Android, iOS, Web, and IoT.
- Quantization & AOT/JIT Compilation: Techniques used to compress models and optimize them for specific hardware (CPU, GPU, or NPU).
- LoRA (Low-Rank Adaptation): A fine-tuning technique for adapting models to specific tasks with minimal memory overhead (8–100MB).
1. Overview of Edge AI Deployment
The speaker, Cormbrick from Google AI Edge, emphasizes that Edge AI is essential for latency-sensitive applications (e.g., live voice translation on Pixel) and privacy-centric workflows where data must remain encrypted on the device.
- Deployment Stack: Google uses MediaPipe and LiteRT to enable a "write once, deploy everywhere" approach. A single model file can run on CPU/GPU across Android, iOS, macOS, Linux, and Windows.
- NPU Optimization: While CPU/GPU deployment uses JIT (Just-In-Time) compilation, NPUs require AOT (Ahead-Of-Time) compilation, resulting in specialized artifacts for specific hardware.
2. System-Level vs. In-App GenAI
The presentation distinguishes between two primary deployment patterns:
- System-Level GenAI: Large models (2B–5B parameters) pre-loaded in the OS. These are customized via prompting or "skills" and are intended for general-purpose assistance.
- In-App GenAI: Smaller, task-specific models (100M–500M parameters) bundled with an application. These require fine-tuning to achieve production-level reliability (85–90% accuracy).
3. Gemma 4 and Agentic Workflows
The recently launched Gemma 4 models (E2B and E4B) are designed for edge devices.
- Memory Efficiency: The "2B" and "4B" designations refer to the parameters that must remain resident in RAM. Other parameters (per-layer embeddings) are memory-mapped and loaded only as needed, significantly reducing the memory footprint.
- Agent Skills: By combining "thinking" capabilities with function calling, models can now use "skills."
- Mechanism: The model uses a "load skill" function to fetch metadata (
skill.md) only when needed. This "progressive disclosure" keeps the context window small and efficient. - Execution: Skills can include JavaScript for UI rendering (e.g., showing a map or a flashcard) or native system intents (e.g., toggling Wi-Fi).
- Mechanism: The model uses a "load skill" function to fetch metadata (
4. Development Workflow and Tools
- LiteRT Torch: A package that enables native PyTorch optimizations and quantization for edge deployment.
- Synthetic Data: For tiny models, the recommended workflow is to use a large cloud-based LLM to generate synthetic data, then fine-tune the tiny base model on that data.
- Google AI Gallery: An open-source app that serves as a sandbox for prototyping models and testing community-developed skills. It allows developers to benchmark performance (tokens per second) on their specific hardware.
5. Case Study: "AI Edge Eloquent"
This iOS app demonstrates the power of tiny models for transcription:
- Problem: Standard transcription captures "ums," "ahs," and speech errors.
- Solution: A two-step pipeline. First, an ASR (Automatic Speech Recognition) engine generates raw text. Second, a dedicated "tiny" LLM acts as a Text Polishing Engine to remove interjections and apply a "biasing list" (custom vocabulary/technical terms).
- Modularity: By keeping the ASR and the Polishing Engine as separate, modular models, developers can reuse the ASR component for other features, optimizing the total memory cost.
6. Key Takeaways
- Fine-tuning is essential for tiny models: For models under 500M parameters, fine-tuning typically yields a 20–40 point improvement in reliability.
- Modular Architecture: Pragmatic mobile development favors chaining small, specialized models rather than one massive model.
- Community-Driven: The "AI Edge Gallery" encourages developers to share skills, which can be integrated into the app to extend functionality without updating the core model.
- Future Outlook: As mobile RAM remains constrained by cost, the industry is shifting toward "tiny" models and LoRA-based hot-swapping for task-specific performance.
"For the really, really tiny models—certainly less than 500 million parameters—in our experience, you need to fine-tune to get production-level reliability." — Cormbrick
Chat with this Video
AI-PoweredLoad the transcript when you're ready to chat so the initial page stays lighter.