Web AI Summit 2025: State of client side AI

Key Concepts

Web AI: The practice of running machine learning models client-side within a web browser, utilizing the user's device resources (CPU, GPU, NPU) and web technologies like JavaScript, WebAssembly, WebGPU, and WebNN.
Cloud AI: AI models that execute on server-side infrastructure and are accessed via cloud-based APIs, requiring an active internet connection.
Agentic Behavior: The ability of a system to autonomously perform advanced tasks on behalf of a user by breaking them down into structured steps, dynamically selecting and using tools, and interacting with the outside world.
Tools: Functions, APIs, or data sources that an AI agent can call to gather information or perform actions, bridging the gap between the language model and the external world.
Client-Side AI: Refers to AI processing that happens directly on the user's device, as opposed to server-side processing.
Low Latency: Reduced delay in processing and response times, crucial for real-time applications.
Frictionless Experience: User interactions that require no installation or complex setup.
Agent Orchestration Layer: The component that manages and coordinates the AI agent's components, including language models, memory, and tools.
WebGPU: A web API that provides access to the GPU for high-performance graphics and general-purpose computation.
WebAssembly (Wasm): A binary instruction format for a stack-based virtual machine, designed as a portable compilation target for high-level languages, enabling near-native performance in web browsers.
WebNN: An emerging web standard for neural network inference on the client side.
LLM (Large Language Model): A type of AI model trained on vast amounts of text data, capable of understanding and generating human-like text.
Multimodal Models: AI models that can process and understand information from multiple modalities, such as text, images, and audio.
Vector Database: A database designed to store and query vector embeddings, often used for semantic search and similarity matching.
Caching Proxy Library: A tool that helps manage the caching of large models fetched from a server, reducing repeated downloads.
Web MCP (Message Channel Protocol): A proposed web protocol for enabling standardized communication between web agents and services.

Web AI Summit: Kicking Off a New Era of Agentic Web Experiences

This summary details the opening of the second Web AI Summit, focusing on the evolution of client-side AI and the emergence of agentic capabilities within web browsers. The presenter, Jason Mays, Web AI Lead at Google, highlights the rapid growth of Web AI and its potential to revolutionize how users interact with the internet.

Introduction and Vision

Jason Mays opens the summit with a rap performance, showcasing the integration of AI with human creativity. He emphasizes that the visuals in his rap video are real-world Web AI client-side demos shared by the community. He introduces himself as "the web AI guy at Google" and sets the stage for a discussion on frugal AI usage through client-side processing, offering real-time results, total privacy, and low costs.

The Rise of Web AI

Growth and Adoption: Web AI has been a focus since 2017. The summit, initiated internally in 2022, has grown from 10 engineers to over 1,500 Googlers. Model and library usage has seen exponential growth, with yearly downloads increasing from 1 million in 2019 to over 1.2 billion in just four years, representing a 1,000x growth.
Summit Expansion: The summit is now public for the second time, having doubled in size since the previous year, bringing together a global community of innovators.
Community Showcase: The summit features presenters from over 20 teams and individuals already utilizing Web AI, aiming to accelerate innovation in the JavaScript community for client-side AI.

Defining Web AI and its Advantages

Mays formally defines Web AI as the art of running machine learning models client-side in a web browser. This contrasts with Cloud AI, where models run on servers and require constant internet connectivity.

Key advantages of Web AI over Cloud AI include:

Privacy: No user data (camera, microphone, text) needs to be sent to remote servers, protecting personal information.
Offline Capability: Tasks can be performed on the device even in areas with low or no connectivity after the initial page load.
Low Latency: Real-time results are achievable as data doesn't need to travel to the cloud and back. This is crucial for mobile users. Models like MediaPipe for body pose and segmentation can run at over 120 frames per second on mid-range GPUs with high accuracy.
Lower Cost: Eliminates the need for expensive cloud-based graphics cards and processing.
Frictionless Experience: No installation is required; users simply access a link and it works.
Reach and Scale: Leverages the vast ecosystem of over 6 billion browser-enabled devices.

The Future: Agentic Internet and On-Device AI

Mays presents a vision of an agentic internet, where websites are AI agent-compatible, allowing natural language interaction for task completion. He predicts that websites not embracing this will struggle to compete, similar to the shift towards mobile-first design.

Key aspects of agentic behavior discussed:

Definition of an Agent: A system that autonomously performs advanced tasks by breaking them down into structured steps, dynamically selecting and using tools, and interacting with the outside world to fill knowledge gaps.
The Role of Tools: Language models, while knowledgeable, can hallucinate. Tools (e.g., API calls, function executions, vector data store lookups) allow agents to access real-time, accurate information and bridge the gap to the external world.
Language Models: Agents can utilize one or more LLMs or multimodal models capable of instruction-based reasoning.
Orchestration: An orchestration layer manages multiple models and delegates subtasks. Agents plan, define subtasks, choose appropriate tools, and use outputs as context for future steps, potentially cycling through steps to complete a job without constant human intervention.
Agentic System Components: Typically comprises one or more language models, a memory implementation, and tools, all coordinated by an agent runtime.

Web AI Prototype: Simulating a Flight Search Agent

Mays demonstrates a Web AI prototype that puts agentic concepts into action, running locally on a five-generation-old Nvidia 1070 GPU using Google's Gemma model and the MediaPipe Web LLM library.

Demo Walkthrough:

Flight Search Simulation: The prototype simulates a flight search page.
Initial Interaction: The user initiates a request for a holiday with a friend, but provides incomplete information.
Agent's Response: The agent infers that two people are traveling and identifies the need for more data to use the website's search tool. It asks a follow-up question.
User Input: The user provides departure location (San Francisco), a desired destination type (skiing in the French Alps), and dates (December 5th for one week).
Agent's Action: The agent extracts date information, interprets "one week later," and suggests a popular skiing destination (Chamonix). It then calls the search tool with the gathered data.
Fictional Results: For demonstration, the LLM generates fictional flight results, showcasing the model's knowledge.
Change of Plans: The user modifies the request: an extra friend joins (total 3 passengers), the destination changes to Tokyo, seats are upgraded to business class, and dates are shifted to autumn (September-November).
Successful Adaptation: The agent successfully increments the passenger count, sets seats to business, changes the destination to Tokyo, and suggests dates within the specified autumn range.
Cost-Effectiveness: All processing occurs locally, incurring no inference costs beyond the initial model download.

Versatility and Reusability

Web AI DJ Prototype: Mays demonstrates the reusability of the agentic framework by creating a Web AI DJ in one day. This prototype uses public APIs for music services and an upgraded text-to-speech engine (Corro via Transformers.js) for a more natural voice.
Example Interaction: The DJ agent is asked to play music suitable for late-night coding sessions, described as "chill and not too many words."
Agent's Process: The agent uses its knowledge to select artists, announces the track using the Corro TTS model (running locally), and then calls the Spotify API to play a song by the chosen artist.
Rapid Prototyping: This example highlights how quickly a new prototype can be built by changing the agent's tools and persona.

Future Implications and Hybrid Approaches

Smaller, Tuned Models: Mays predicts the rise of smaller LLMs (sub-billion to 9 billion parameters) that are fine-tuned for specific purposes and run on consumer hardware for agentic behaviors. These models can be optimized for tool selection and execution.
Agentic Internet: The potential for websites to bid for performing useful work for end-user agents, rather than traditional advertising.
Hybrid Approaches: Combining Web AI with cloud AI offers the best of both worlds. For instance, a client-side Web AI agent can leverage a cloud-based vector database for domain-specific knowledge to answer complex user requests.
Standardization: The future may see AI-compatible websites exposing their tools via standardized protocols like the proposed Web MCP, enabling natural language interaction for task completion.
Browser as a Hub: The browser is an ideal platform for agentic interactions, as users are already signed in to various services, allowing agents to access tools from multiple sources seamlessly.
Next Generation Interaction: Future generations may find it alien to not be able to simply command agents to create and perform tasks, similar to how current generations grew up with the internet and search engines.

Call to Action

Mays encourages the audience to explore Web AI and its agentic applications in their respective industries. He emphasizes that the attendees have a unique opportunity to shape the future of the internet. He invites them to share their creations using the #WebAI hashtag and potentially present at future summits.

Acknowledgements

Mays expresses gratitude to the presenters, companies providing demos, ARM for sponsoring an external speaker, and the behind-the-scenes team (over 46 individuals) who contributed to the summit's success. He also gives a shout-out to Thomas Steiner, whose code he built upon for his demos.