Back to all videos

From 46% to 90%: Fine-Tuning Tiny LLMs for On-Device Agents — Cormac Brick, Google

By AI Engineer

On-Device AI LLM Optimization Mobile AI Development

Share:

Key Concepts

AI Edge: Google’s stack for running AI models locally on devices (Android/iOS) to ensure privacy, low latency, and offline functionality.
Tiny LLMs (TLMs): Language models with fewer than 1 billion parameters, optimized for specific, boutique tasks on mobile hardware.
System GenAI: Pre-installed AI capabilities (e.g., Gemini Nano via AI Core) available for system-wide use.
App GenAI: Custom AI models bundled directly within an application for specialized, high-degree customization.
LiteRT (formerly TensorFlow Lite): A cross-framework runtime for executing models on CPU, GPU, or NPU.
Skill Harness: A framework allowing LLMs to perform specific actions (e.g., opening maps, restaurant selection) via tool calling.
Function Gemma: A specialized, small-parameter model (270M) optimized for robust function calling.

1. AI Edge Architecture and Deployment

The speaker, a tech lead on the Google AI Edge team, outlines two primary paths for developers to integrate Generative AI:

System-Level GenAI: Leveraging pre-installed models like Gemini Nano (Gemma 4 E2B/E4B) via AI Core. This is ideal for general tasks without increasing app size.
App-Level GenAI: Using the LiteRT runtime to bundle custom models directly into an app or web page. This requires more development effort but offers full customization and control.

Technical Infrastructure:

LiteRT: Supports over 2.7 billion devices. It acts as a cross-platform runtime that allows developers to target specific hardware (CPU, GPU, or NPU).
Model Format: The .tflite (or LiteRT) format is the standard for packaging models, including tokenizers, for efficient on-device execution.

2. Agent Skills and Tool Calling

The presentation highlights the new "Skill Harness" capability, which allows models like Gemma 4 to interact with app-specific functions.

Methodology: The model is provided with a system prompt containing skill descriptions. It does not load all function details at once; instead, it uses a "load skill" tool call to selectively activate specific functions (e.g., Maps, Restaurant Roulette) only when needed.
Implementation: Developers can use JavaScript within the skill to render UI components (like a map or a roulette wheel) directly within the app.
Development: The team uses a "Gemini CLI" to create and test skills. The process is highly iterative, allowing developers to publish skills to GitHub and load them into the Google AI Edge Gallery app via URL.

3. Tiny LLMs (TLMs) and Fine-Tuning

For tasks requiring extreme efficiency, the speaker advocates for models under 1 billion parameters.

Function Gemma (270M parameters): Designed for high-reliability function calling.
Fine-Tuning Workflow:
1. Synthetic Data Generation: Use larger models (like Gemini) to generate training data.
2. Fine-Tuning: Use tools like the "Function Gemma fine-tuning lab" (available on Hugging Face) to train the model on specific app intents.
3. Performance: This workflow can improve function-calling success rates from ~46% to over 90%.
Real-World Application: The "Eloquent" app uses a chain of tiny models—one for ASR (Automatic Speech Recognition) and one for text polishing—to provide offline, personalized transcription that removes filler words and recognizes custom jargon.

4. Notable Statements

"If you have a more specific task that you want to do that's kind of highly customized or something really boutique, you can use an App Gen AI."
"You can use skills to write skills." (Referring to the ability to use LLMs to generate the code/prompts for new agent capabilities).
Regarding the limits of current agent technology: "The thing we're still working on that's harder is through a single interaction with the app for the app to know to call multiple skills as part of a single answer."

5. Synthesis and Takeaways

The presentation emphasizes a shift toward on-device intelligence where developers choose between system-provided models for convenience or custom-tuned tiny models for specialized performance. The core takeaway is that robustness in small models is achieved through fine-tuning on synthetic datasets rather than relying solely on large-model prompting. Developers are encouraged to use the Google AI Edge Gallery as a sandbox to test these models and contribute to the open-source ecosystem of "skills."

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video