Real-time video upscaling using CNNs and WebGPU

Key Concepts

WebGPU: A modern JavaScript API for accessing the GPU for general-purpose computing and high-performance graphics.
Compute Shaders: Programs that run on the GPU for general-purpose computation, distinct from traditional graphics shaders.
Super Resolution: A technique to upscale images or video in real-time, often using machine learning models.
Machine Learning (ML) Models: Algorithms that learn from data, used here for super resolution.
Convolutional Neural Networks (CNNs): A type of deep learning model commonly used for image processing tasks.
Frame Transformer: A component that processes video frames, running neural network models.
GPU Texture: A data structure on the GPU used to store image data.
Blitting: Copying pixel data from one texture or surface to another.
Work Group: A collection of threads that execute together on the GPU.
Dispatch Group: A collection of work groups, used to organize GPU computation.
Quantization: Reducing the precision of numbers (e.g., from float32 to float16 or int8) to reduce memory usage and potentially increase speed.
Loop Unrolling: A compiler optimization technique that replicates the body of a loop to reduce loop overhead.
WGSL (WebGPU Shading Language): The shading language used by WebGPU.
HLSL (High-Level Shading Language) / MSL (Metal Shading Language): Shading languages used by DirectX and Metal, respectively, which WebGPU can transpile to.
Request Video Frame Callback: A Chrome API that allows access to video frames from a video element.
MSE (Media Source Extensions): A JavaScript API for creating media sources for HTML5 media elements.
Web Codec: A newer API for decoding and encoding media.
YUV/RGB Color Spaces: Different ways of representing color information in images. YUV is often used in video compression, while RGB is common for displays.
Luminance Plane: The brightness component of a YUV color space.
Multiply-Accumulate (MAC): A fundamental arithmetic operation in neural networks.

Real-time Super Resolution with WebGPU on User Devices

This presentation details how Amazon IVS (Twitch's video platform) is leveraging WebGPU to implement real-time super resolution capabilities directly within user devices. The primary goal is to trade compute power for bandwidth, enabling the generation of higher-resolution video renditions on the fly, thereby reducing data consumption for users and potentially costs for high-volume customers.

Challenges in Video Streaming

Streaming video, especially live streams, presents several challenges:

Variable Network Conditions: Users may be on mobile networks or in developing countries with limited bandwidth.
Metered Connections: Users often have daily or monthly data allotments and need to control data usage.
Data Intensity: High-scale video streaming is inherently data-intensive.

The increasing power of GPUs and the emergence of NPUs (Neural Processing Units) on user devices open up possibilities for on-device processing.

Super Resolution Explained

Super resolution is a set of technologies that upscale images, video, or games in real-time. It can be:

Interpolative: Estimating missing pixels based on existing ones.
Generative: Creating new pixels using ML models, such as GANs (Generative Adversarial Networks).

Examples include Nvidia DLSS and AMD FSR in gaming, where games are rendered at a lower resolution and then upscaled using ML models. Twitch's approach aims for a similar outcome but within a web browser context, not integrated into a game engine.

Requirements and High-Level Overview

Twitch streams user-generated content, which often has high frame rates (e.g., 60 FPS) and resolutions (e.g., 1080p or higher), translating to approximately 125 million pixels per second. The super resolution solution must be performant enough not to overload the user's GPU.

The proposed workflow involves:

Decoding: Capturing the video frame as it's decoded.
Frame Transformation: Passing the frame to a component that runs a neural network for processing.
GPU Texture Output: The neural network outputs a GPU texture.
Blitting to Canvas: This upscaled texture is then blitted to the browser's canvas for display.
Performance Monitoring: The system continuously monitors performance. If deadlines are missed or errors occur, it falls back to a simple pass-through mode to ensure a good user experience.

Accessing Video Frames for GPU Processing

Getting video frames from the decoder into the GPU context is crucial. Two primary methods are discussed:

Web Codec: The newest and most robust API, guaranteeing access to every frame. However, migrating Twitch's existing player, which uses MSE (Media Source Extensions), to Web Codec would be a significant undertaking.
requestVideoFrameCallback (Chrome 116+): A newer API that allows capturing a VideoFrame object from a video element and passing it to WebGPU. While it doesn't guarantee every single frame, it has proven to be sufficiently close for practical use.

Building Neural Networks with WebGPU Compute Shaders

WebGPU enables general-purpose computing on the GPU through compute shaders. These are small programs that run many times in parallel.

A simple compute shader example demonstrates copying a pixel from an input texture to an output texture. For a 1080p image, this operation would run approximately 2 million times.

GPU Execution Model:

Work Group: A collection of threads (e.g., 256 threads) that execute together. Each invocation runs on a single thread.
Dispatch Group: Composed of any number of work groups, organized in a 3D volume. Multiple work groups can run concurrently.

The Super Resolution Neural Network Architecture

The implemented neural network is a simple Convolutional Neural Network (CNN) with six convolutional layers, separated by Leaky ReLU activations. Each layer is relatively small, requiring about 3,000 multiply-accumulates (MACs) per pixel.

The network structure includes:

RGB to YUV Conversion: Performed at the beginning to process only the luminance plane, reducing computational load.
Convolutional Layers: Standard depthwise 2D convolutions for feature extraction and upscaling.
YUV to RGB Conversion: Performed at the end to convert the processed luminance back to RGB for display.

The network is broken down into three compute shaders for implementation:

First Compute Shader: Handles bilinear upscaling, YUV conversion, and the first convolutional layer.
Second Compute Shader: Run multiple times to process the intermediate convolutional layers.
Final Exit Shader: Performs the last convolution and the YUV to RGB conversion.

WebGPU Integration and Data Handling

The process of integrating the neural network with the browser involves:

GPU.importInternalTexture: Used to convert a video frame into a GPU texture. On most devices, this is a zero-copy operation as the video frame is already on the GPU.
Automatic YUV to RGB Conversion: Chrome automatically applies a YUV to RGB shader when accessing the texture if the video frame is in YUV format.
Model Weights: Weights are passed as binary data, typically a Base64 blob, with a JSON sidecar file providing indices to locate specific layer weights within the blob.
Output: The process yields an upscaled GPU texture (e.g., 720p to 1080p).
Blitting to Canvas: The upscaled texture is blitted to a canvas. The canvas is instantiated to match the texture size to avoid multiple scaling operations and allow the browser to handle the final viewport scaling.

Performance Benchmarks

Performance testing shows:

The network runs in approximately 8-9 milliseconds when executed within the browser.
When compiled from PyTorch to ONNX and profiled with Xcode, the same network runs in about 7.2 milliseconds.
An NPU version of the model runs in approximately 5 milliseconds, suggesting WebN (Web Neural Network API) could be a future option for Macs.

Lessons Learned and Optimization Strategies

Several key lessons were learned during development:

Quantization:
- Float16 Quantization: Highly recommended. It was easy to implement (simple truncation) and significantly reduced memory requirements without accuracy loss. It also leverages hardware optimizations like FP16 MACs on Nvidia GPUs.
- Int8 Quantization: An option, especially with hardware support for operations like DP4A (4 int8 operations per FP32 cost). Wider support for cooperative matrices will make this more attractive.
Loop Unrolling: Crucial for performance in hot inner loops. WebGPU doesn't provide explicit hints for unrolling, so manual unrolling in the shader code is necessary. This is particularly important for Metal, as its compiler is less effective at automatic loop unrolling compared to DirectX. Explicit unrolling is always better.
Meta-programming: The need for meta-programming arises to manage explicit loop unrolling and other optimizations.
Shader Debugging: Chrome offers a command-line switch to dump the generated WGSL, MSL, or HLSL code, which is invaluable for debugging and understanding the transpilation process.

Conclusion

By utilizing WebGPU and compute shaders, Twitch has successfully implemented real-time super resolution on user devices. This approach allows for bandwidth savings and improved user experience by generating higher-quality video renditions dynamically. Key optimizations include float16 quantization and explicit loop unrolling, with ongoing exploration into future APIs like WebN. The ability to access video frames via requestVideoFrameCallback and efficiently process them on the GPU is central to this innovation.