Web AI leaps forward on Intel AI PCs

Key Concepts

WebNN: An emerging web API for general AI inference on web browsers, designed to run on various execution engines (CPU, GPU, NPU).
WebGPU: A web API for high-performance graphics and general-purpose computation on the GPU.
Intel Lunar Lake: Intel's flagship AI PC product with CPU, GPU, and NPU execution engines.
Intel Panther Lake: Intel's next-generation client platform, built on Intel 18A technology, featuring enhanced CPU, GPU, and NPU capabilities.
W3C Web Machine Learning (WebML) Working Group: A group defining web standards for machine learning.
Execution Provider (EP): A software component that allows AI models to run on specific hardware.
OpenVINO: An Intel toolkit for optimizing and deploying AI inference.
Windows ML: Microsoft's framework for running machine learning models on Windows devices.
XMX (Intel® Matrix Extension): Intel's matrix extension for AI acceleration.
NPU (Neural Processing Unit): A specialized processor designed for AI workloads.
GPU (Graphics Processing Unit): A processor designed for parallel processing, often used for graphics and AI.
CPU (Central Processing Unit): The primary processor of a computer.
Terra Operations Per Second (TOPS): A measure of computing performance for AI workloads.

WebNN Advancements

Performance Doubled on GPU, Increased Coverage on NPU

The Intel Web Platform Engineering team has made significant strides in WebNN, an emerging web API defined at the W3C. Key advancements include:

Doubled Performance on GPU: This improvement is attributed to the new architecture of Windows ML by Microsoft, which supports independent hardware vendor execution providers. Intel's implementation utilizes OpenVINO, an execution provider highly optimized for CPU, GPU, and NPU. Previously, execution involved separate paths (Xnack through TF Lite for CPU, and DirectML for GPU/NPU). Now, a single path through WinML and OpenVINO provides major improvements across all engines.
Major Increase in NPU Coverage: WebNN now has significantly expanded capabilities on NPUs.

Performance Demonstrations

Several demos illustrate the performance gains:

General Inference: Tasks that previously took 30-38 milliseconds are now completed in 12-13 milliseconds, representing more than a twofold speedup.
Whisper-based Speech Recognition: Performance has increased from 43 tokens per second last year to 100 tokens per second on GPU and 98 tokens per second on NPU.
Stable Diffusion (Image Generation): Generation time has been reduced from approximately 900 milliseconds to 400 milliseconds on GPU and 600 milliseconds on NPU. This is significantly faster than cloud-based solutions like ChatGPT.
Depth Estimation: Demos show comparable performance on both GPU and NPU, with task manager visualizations confirming the respective hardware being utilized.
Background Removal: This transformer.js demo showcases quick background removal on both GPU and NPU, highlighting the versatility of the same API across different engines.
Object Detection: The API achieves 30 frames per second on GPU and about 20 frames per second on NPU for object detection tasks.

Expanded Platform Support

Windows 10 and Linux: WebNN is now enabled on GPU for Windows 10 (which lacks Windows ML) and Linux. Previously, it was limited to CPU execution via Xnack. Thanks to optimized kernels from MLCommons, GPU execution is now possible on these platforms.

Ongoing Optimizations

Buffer Reuse: Significant performance improvements are being made in buffer reuse between CPU, GPU, and NPU. This has already resulted in an 18% performance improvement in CPU-GPU communication and a 50% improvement in KV cache. These optimizations are expected to facilitate WebNN and WebGPU interoperability, allowing applications to use both APIs simultaneously with minimized communication overhead.

WebGPU Progress

Three major areas of progress have been made in WebGPU:

Enhanced Memory Bandwidth Utilization: Improved how memory bandwidth is used.
Improved Thread Occupancy with SIMD: Better utilization of threads with Single Instruction, Multiple Data (SIMD) operations.
Enabling Intel Matrix Extension (XMX): This is a key development for AI acceleration.

XMX Integration and Performance Gains

Matrix Multiply Improvement: XMX integration has led to a nearly 2x performance improvement in matrix multiply operations, a fundamental primitive for AI.
Demo Results: A demo showed a significant improvement in token per second from 15 to 20 due to these optimizations.
Vulkan Path: XMX has been enabled on the Vulkan path, providing a 1.8x speedup for WebGPU.
D3D12 Path (Work in Progress): Efforts are underway with Microsoft to enable XMX through D3D12, which will bring these benefits to Chrome on Windows.
Built-in AI: Existing built-in AI features can already leverage these improvements.

Intel Panther Lake: Next-Generation Client Platform

Intel's next-generation client platform, code-named Panther Lake, is a significant leap forward:

Intel 18A Technology: It is the first client platform built on Intel 18A technology.
Performance Enhancements:
- 50% faster CPU
- 50% faster GPU
- Enhanced power-efficient NPU
AI Performance: Delivers 180 tera operations per second (TOPS), a substantial increase from Lunar Lake's up to 120 TOPS.
XC3 GPU: Introduces the XC3 GPU for scaled performance without compromising power efficiency.
- Supports up to 12 XCE (eXtreme Core Engine) cores.
- Each XCE core has 8 XMX engines.
- XMX engines are the primary integrated AI acceleration engines, capable of:
  - Up to 1024 32-bit operations per clock cycle.
  - Up to 2048 16-bit operations per clock cycle.
  - Up to 4096 8-bit operations per clock cycle.
- This represents "supercomputer on an integrated GPU" level performance.
NPU 5: A newly designed NPU 5 delivers high performance with a small footprint, optimized for area and power efficiency. It offers approximately 40% better performance per area compared to Lunar Lake.
Release: Expected to be released in January.
Web AI Focus: Panther Lake is poised to significantly enhance Web AI experiences.

W3C WebML Working Group Progress

The W3C WebML working group, chaired by Anie Castinen, has seen substantial growth and development:

Participation Growth: A 30% increase in participating companies and organizations.
New Community Groups: Development of new community groups for agent-authentic experiences and WebMCP.
Built-in AI: A dedicated group for built-in AI with around 200 participants.
Key Contributors: Notable newcomers and active participants include Hugging Face (represented by Joshua Lochner), Qualcomm, ARM, Nvidia, and others.
Goal: The group is focused on defining ubiquitous APIs that will run across all browsers and hardware platforms.

Conclusion and Call to Action

WebNN is now generally available on Windows ML, allowing users to download and experience these advancements. The speaker expresses excitement about the future of WebNN and encourages participation in shaping its development. The presentation concludes with an invitation for Joshua Lochner to speak, acknowledging his significant contributions.