Running LLMs on your iPhone: 40 tok/s Gemma 4 with MLX — Adrien Grondin, Locally AI
By AI Engineer
Key Concepts
- MLX: An Apple-developed machine learning framework specifically optimized for Apple Silicon (iPhone, iPad, and Mac).
- MLX Swift LM: A GitHub repository/framework used to integrate and run Large Language Models (LLMs) on iOS, iPadOS, and macOS.
- Quantization: The process of reducing the precision of model weights (e.g., 4-bit, 8-bit) to make them smaller and faster for on-device execution.
- Hugging Face MLX Community: A repository hub where quantized versions of popular models are hosted for easy integration.
- Locally AI: An iOS application that allows users to run on-device models; recently acquired by LM Studio.
- Tool Calling: The ability of an LLM to interact with external systems or APIs to perform specific tasks.
1. Running LLMs on iPhone with MLX
The presentation outlines how developers can deploy models like Google’s Gemma 4 on Apple devices using the MLX framework. MLX is designed to leverage the specific architecture of Apple Silicon, ensuring high performance and efficiency.
- Implementation: Developers should utilize the
MLX Swift LMrepository. The API is described as "straightforward," allowing for integration in under 10 minutes. - Ecosystem: Beyond text models, the MLX ecosystem includes
MLX VLM(vision),MLX Audio, andMLX Videofor image/video generation, enabling "omni-model" capabilities on mobile.
2. Model Sourcing and Optimization
- Hugging Face: This is the primary source for model weights. Developers are directed to the "MLX Community" section on Hugging Face, which hosts thousands of models already quantized for MLX.
- Quantization Strategy:
- Recommended Range: 4-bit to 8-bit.
- Trade-offs: Anything below 4-bit significantly degrades output quality. 8-bit is recommended for smaller models to maintain accuracy.
- Performance: On modern iPhones, a 4-bit quantized Gemma 4 model can achieve speeds of approximately 40 tokens per second, which is highly efficient for real-time streaming.
3. Development Workflow
- Integration: Install the
MLX Swift LMpackage into the iOS/macOS project. - Model Selection: Identify a model ID from the Hugging Face MLX Community.
- Deployment: Pass the model ID to the framework, which handles the download and integration automatically.
- Tool Calling: The framework supports tool calling, allowing models to interact with external systems, a feature that has seen significant improvement in recent model iterations.
4. Real-World Applications and Tools
- Locally AI App: A native chatbot app that demonstrates the capability of running models offline. It supports various models, including Gemma 4 and smaller models (e.g., 350M parameters) that can be integrated into iOS Shortcuts for automation.
- LM Studio: Following the acquisition of Locally AI, LM Studio serves as an AI studio for local models. It allows users to run models via Llama CPP or MLX, host local servers, and connect apps using standard API formats (OpenAI or Anthropic response types).
5. Notable Quotes
- "In less than 10 minutes, you can have an iOS app with a model that is running on your device." — Adria, on the simplicity of the MLX Swift LM framework.
- "40 tokens per second is more than acceptable for a lot of use cases." — Regarding the performance of 4-bit quantized models on current-generation iPhones.
6. Synthesis and Conclusion
Running LLMs on-device is becoming increasingly accessible due to the synergy between Apple’s MLX framework and the active quantization efforts of the Hugging Face community. The primary barrier remains the model size (typically 1GB to 3GB), but as hardware improves and models become more efficient, on-device AI is reaching a high level of usability. Developers can now easily integrate sophisticated models into native apps, enabling offline, private, and fast AI interactions with support for advanced features like tool calling.
Chat with this Video
AI-PoweredHi! I can answer questions about this video "Running LLMs on your iPhone: 40 tok/s Gemma 4 with MLX — Adrien Grondin, Locally AI". What would you like to know?