Don't let AI agents push your buttons - use webMCP instead!

Key Concepts

Agents: Personal assistants that act on behalf of users to accomplish tasks on the web.
UI Actuation: The traditional method of agents interacting with websites by parsing pages, clicking UI elements, and waiting for animations.
Model Context Protocol (MCP): A proposed standard for AI applications, aiming to be the "USBC port of AI applications" for backend integrations.
Web MCP: An MCP-like API designed specifically for the web platform to enable cooperative interactions between users, web pages, and agents.
Agent-Specific Paths: Additions to websites that expand how agents can interact beyond human-designed UI.
Three Pillars of Agent-Specific Paths:
- Context: Data to understand user's current activity and long-term memory.
- Capabilities: Ability for agents to take actions on behalf of users.
- Coordination: Optimization of control flow between user and agent.
Tools: Well-documented APIs declared by websites for agents to invoke.
Progressive Enhancement: Web MCP's design allows agents to prefer tools but fall back to web UI if necessary.

Cooperative Interaction: The Future of the Web with Agents

Kashal Sager, a software engineer on the Chrome team with over a decade of experience in browser development, presented an idea incubating at the WebML community group that aims to significantly enhance the web's evolution with AI. The core concept revolves around agents, which function as personal assistants built directly into the browser, offering users a more streamlined and efficient web experience.

The Problem with Current Agent Interactions

Currently, agents interact with websites through a process called UI actuation. This involves parsing web pages, clicking on user interface (UI) elements, and waiting for animations. Sager argues that this method, designed for human users, is inefficient and hinders agents from accomplishing user goals effectively. This problem of connecting agents to external systems is not unique to the web, leading to the development of Model Context Protocol (MCP) as a standard for AI applications, akin to a "USBC port." While MCP can facilitate backend integrations, the human use of the web, with its rich UI experiences, will persist.

The Vision: A Shared Interface for Cooperative Interaction

The future envisioned is a shared interface that fosters visually rich, cooperative interactions between the user, the web page, and the agent. This model aims to enhance, rather than replace, the connection between the site and the user, making websites more useful. This cooperative model is built upon agent-specific paths added to websites, which go beyond human-designed UI.

Three Pillars of Agent-Specific Paths

These agent-specific paths are built on three fundamental pillars:

Context: This refers to the data an agent needs to understand the user's current activity and their long-term memory. Browsers can provide application state visible to them as context, such as information within the DOM. However, this information is often limited. For instance, if a user is watching a lecture series, the browser might not have access to information from other chapters, which could be crucial for answering a user's question or navigating to a relevant section. Imperative rendering techniques like canvas elements further restrict the browser's visibility into the DOM.
Capabilities: This pillar empowers agents to take actions on the user's behalf, moving beyond simply answering questions. By exposing actions to agents, websites can help users navigate faster, similar to how web UI is optimized for humans. This allows agents to perform tasks for users, not just assist them.
Coordination: Website authors can optimize the control flow between the user and the agent. For example, an agent can drive an interaction until human input is required, such as when a user needs to make a choice between available options (e.g., whole milk vs. 2% milk).

Web MCP: An API for Cooperative Web Interactions

Web MCP is an API designed with these three pillars in mind. The process generally works as follows:

A user or agent loads a web page.
The loaded page declares its agent-specific functionality as tools, which can be thought of as well-documented APIs.
This set of tools is sent to the browser's agent.
The agent selects the appropriate tool based on the user's query.
The request is routed back to the page, and the corresponding function is executed.
During execution, the website can:
- Request user input if needed.
- Present relevant information in its UI.
- Access local client state (e.g., selected text).
- Utilize cookies for authentication and authorization.
- Interact with its backend server.
The result of the execution is returned to the browser's agent, which then plans its next action.

Example: Clothing Brand Interaction

An example illustrates Web MCP in practice: a user asks an agent to find dresses of their size suitable for a cocktail wedding, attaching an image of a preferred style.

The clothing brand's website registers a search_products tool. This tool includes a unique name, a description of its purpose, and a JSON schema defining the parameters the agent must pass.
The agent, recognizing the search_products tool, generates code to execute it, bypassing UI actuation.
The browser executes this function on the site.
Advantages over UI Actuation:
- The site can display minimal UI, focusing on relevant information like discount offers or related products.
- The agent receives the result in one step, whereas UI actuation might require multiple steps to handle lazy loading or pagination. This leads to faster agent performance and a better user experience.
The agent receives the JSON result and applies further filtering based on the provided image.
The agent then uses another registered tool, show_products, to display the filtered list within the site's UI, allowing the site to maintain its rich, branded experience.

Progressive Enhancement and MCP Alignment

The API is designed as a progressive enhancement. Agents will prioritize using agent-specific tools but can fall back to using the web UI if tools are unavailable. This allows website authors to incrementally add agent-specific paths.

Furthermore, the syntax of Web MCP is intentionally aligned with MCP's base parameters. This ensures that agentic capabilities on the web can be utilized by any MCP-compatible client with minimal translation, enabling code reuse between MCP services and Web MCP implementations.

Call for Feedback and Future Development

The proposal is currently in an early incubation phase, and feedback from web developers, agent providers, and browser vendors is crucial. The full explainer is available on the WebML community group repository. A prototype is under development in Chrome, and interested parties can follow the Chrome status entry for WebMCP to be notified when it's ready for a dev trial. The presenter expressed excitement for the innovative experiences that will be built using this technology.