Wasm, WebGPU, & WebNN: How compute abstraction are enabling client side AI

Key Concepts

Compute Abstractions: A unified term encompassing WebAssembly (Wasm), WebGPU, and WebNN, used to describe technologies that enable running complex computations, particularly AI, within web browsers.
WebAssembly (Wasm): A binary instruction format for a stack-based virtual machine. It's a compilation target for languages like C++, Python, Kotlin, Swift, and Dart, allowing them to run in the browser. Wasm executes on the CPU and supports threading via Web Workers and SIMD for optimized instructions.
WebGPU: A new graphics API that provides access to the GPU, succeeding WebGL. It explicitly supports compute shaders, making it ideal for AI operations requiring high throughput for large, complex computations.
WebNN: An API that provides access to the CPU, GPU, and Neural Processing Unit (NPU) by leveraging OS-provided frameworks (e.g., Windows ML on Windows, Core ML on macOS/iOS, Android NNAPI). It is currently in an early stage of development.
AI Runtimes: Frameworks like MediaPipe, ONNX Runtime, and Transformers that facilitate the use of AI models within web applications by targeting compute abstractions.
Bundle Size Cost: A significant consideration when using compute abstractions, as AI models can be large (tens to hundreds of megabytes), impacting download times and user experience, especially for smaller websites.
Agency and Autonomy: A key advantage of compute abstractions, allowing developers full control over model selection, updates, and implementation, unlike built-in AI APIs where these aspects are managed by the browser vendor.
Media Stream Manipulation: A prime use case for client-side AI processing via compute abstractions due to the need for low latency and handling large data volumes in real-time video and audio.

Compute Abstractions: Wasm, WebGPU, and WebNN

This section details the three primary compute abstractions discussed: WebAssembly, WebGPU, and WebNN.

WebAssembly (Wasm)

Definition: Wasm is a language compiled from other languages such as C++, Python, Kotlin, Swift, and Dart.
Key Advantage: Beyond performance, Wasm offers significant language support and compatibility. It allows existing codebases, like C++ implementations (e.g., llama.cpp) or Python SDKs (e.g., OpenAI SDK), to be compiled and run in the browser. Developers can hook in JavaScript import functions to interface with Wasm modules.
Execution Model: Fundamentally CPU-based execution.
Performance Features: Maximizes performance through access to threading primitives via Web Workers with shared memory and SIMD (Single Instruction, Multiple Data) for optimized computational instructions.
CPU vs. GPU Rationale: While AI is often associated with GPUs, CPUs offer a more reliable and stable target across diverse hardware. GPUs can vary significantly in performance from high-end gaming PCs to low-power mobile devices. For instance, some applications might perform AI operations on the CPU if the GPU is heavily utilized for other tasks.

WebGPU

Definition: The successor to WebGL, WebGPU is a new graphics API that mediates access to the GPU.
Key Feature: Explicit support for compute shaders, which is highly beneficial for running AI operations.
Performance: Generally offers the highest throughput, making it ideal for very large and complex AI operations.
Availability: Expected to be generally available across all browsers (except Firefox mobile) by November.

WebNN

Definition: The latest compute abstraction, offering access to the CPU, GPU, and NPU.
Stage: Currently in a very early stage, with an anticipated origin trial (beta) around Q4 of this year and into Q1 of next year.
Mechanism: Leverages OS-provided frameworks. For example, on Windows, it utilizes the Windows ML runtime, which then determines the optimal CPU, GPU, or NPU to use based on OS knowledge.

How to Use Compute Abstractions for AI

This section explains the practical application of compute abstractions in AI development.

Developer Workflow:
1. Start with the Model: Developers begin with their AI model.
2. Select a Web-Supported AI Runtime: Choose a framework like MediaPipe, ONNX Runtime, or Transformers.
3. Target Compute Abstractions: These frameworks then target one or more compute abstractions (Wasm, WebGPU, WebNN), optionally including JavaScript.
Browser Agnosticism: Compute abstractions are browser-agnostic, meaning they function consistently across different browsers. The framework and model do not need to know the specific browser they are running on.
Browser-Specific Implementation: The browser then handles the execution:
- Chrome Example: The V8 engine executes JavaScript and Wasm. The Dawn implementation executes WebGPU. WebNN ties into OS-specific runtimes like Windows ML, Core ML, or Android NNAPI.
LightRT.js Example: This framework acts as both a web-facing developer framework and a native runtime, used for WebNN operations, creating a "sandwich situation" for complexity.

Compute Abstractions vs. Built-in AI APIs

This section contrasts the use of compute abstractions with built-in AI APIs, outlining when to use each.

Reasons NOT to Use Compute Abstractions

Bundle Size Cost: AI models can be very large (tens to hundreds of megabytes), which is prohibitive for use cases like blogs or e-commerce sites that aim for hundreds of kilobytes or a few megabytes. Larger studio applications (e.g., Figma, Photoshop) may tolerate larger downloads and caching.
Not Ready-to-Use Out-of-the-Box: Compute abstractions are low-level building blocks requiring developers to construct higher-level functionalities. Frameworks help mitigate this complexity.
No Automatic Updates: Developers are responsible for updating models and AI systems when using compute abstractions, unlike built-in APIs where the browser vendor handles updates.

Reasons TO Use Compute Abstractions

Agency and Autonomy: Developers have complete control over model selection, implementation, and update schedules, avoiding dependency on browser vendors for model choices.
Immediate Cross-Browser Support: Wasm and WebGPU are shipping across major browsers, offering immediate compatibility.
Differentiation on Model Choice and Quality: If the AI model is a key differentiator for an application, compute abstractions allow developers to implement their specific models, unlike built-in APIs where everyone uses the same model.
Enabling Niche Models: Allows for the deployment of small, task-specific models that are resource-efficient. An example is a messaging app using a simple text-based model for spam detection on calls, avoiding the high memory and battery consumption of a large LLM.

Use Cases for Compute Abstractions

This section explores various applications where compute abstractions excel.

Media Stream Manipulation:
- Benefit: Client-side processing offers significant latency wins and cost savings compared to server-side operations, especially for real-time video and audio with high data volumes.
Speech to Text (e.g., Whisper):
- Suitability: Ready for shipping today with reasonable-sized models available. Works well on WebGPU.
Image Recognition, Classification, Optical Character Recognition (OCR):
- Suitability: Well-supported and strong areas, integrable into applications today.
- Performance Example: SIMD operations significantly improve Wasm performance compared to older JavaScript implementations.
Photo and Video Editing:
- Examples: Photoshop's smart object selection and recolorization operations can be handled client-side when hardware permits.
- Suitability: Works well and is readily available.
Text Classification:
- Suitability: Works well with small models and is ready for implementation.
Photo and Video Generation:
- Considerations: Models can be larger, requiring more resources, making it less suitable for e-commerce or blogs. However, it's possible to generate images on the fly.
- Real-world Application: A partner performing image manipulation for a wedding found it more scalable to remove specific objects from a thousand photos client-side without token limitations.
Text Manipulation (Summarization, Q&A, LLM Functionality):
- Considerations: Models can become very large, requiring careful consideration of use case feasibility.
Coding:
- Challenges: Considered a complex task with large context windows where quality is paramount. It is expected to take time before this can be reliably run on compute abstractions.

Conclusion

Compute abstractions, encompassing WebAssembly, WebGPU, and WebNN, offer powerful capabilities for running AI and complex computations directly within web browsers. While they present challenges like bundle size and the need for manual updates, they provide unparalleled agency, autonomy, and the potential for differentiation through custom model implementation. For use cases demanding low latency, high throughput, or specific model performance, compute abstractions are a compelling choice, enabling advanced client-side AI functionalities across a wide range of applications.