Serving JAX Models with vLLM & SGLang

Key Concepts

Jax: A Python library for high-performance numerical computation and machine learning research.
Enterprise Machine Learning Workflows: Existing systems and processes used by organizations to develop, deploy, and manage machine learning models.
Open-Source Serving Frameworks: Tools like VLM and SGLANG that facilitate the deployment and serving of machine learning models.
PyTorch: A popular deep learning framework, often the basis of existing enterprise infrastructure.
Safe Tensors: A file format for storing tensor weights, designed for safety and efficiency.
Flax: A neural network library built on Jax, often used for implementing models.
Pi Tree: A data structure in Jax that represents nested Python structures (like dictionaries and lists) as a unified tree.
NNX (Neural Network eXtensions): A newer API in Flax that makes modules native Jax pi trees, simplifying state management.
VLM (Vector Language Model): A serving framework that can load and run language models.
SGLANG: Another serving framework for language models.
NCHW vs. NHWC: Different data formats for image or tensor dimensions. NCHW (Batch, Channels, Height, Width) is common in PyTorch, while NHWC (Batch, Height, Width, Channels) is common in Jax/Flax.
Weight Transposition/Permutation: Adjusting the order of dimensions in weight tensors to match the expectations of different frameworks or layer implementations.
Config.json: A configuration file that defines a model's architecture, size, and other parameters.
Tokenizer.json: A file that defines how to convert text into numerical tokens for a model.

Integrating Jax Models into Enterprise Workflows with VLM and SGLANG

This discussion focuses on how to integrate Jax models into existing enterprise machine learning workflows by leveraging popular open-source serving frameworks like VLM and SGLANG. The primary goal is to enable organizations to utilize their current infrastructure and gradually adopt Jax, lowering the barrier to entry.

The General Process for Serving Jax Models

The core process involves several key steps:

Load Model Weights into Jax: This typically involves loading pre-trained weights, often from sources like Hugging Face, into an equivalent Jax implementation.
Convert Jax Weights for Serving: The crucial step for serving is converting these Jax weights into a format that serving frameworks like VLM and SGLANG expect. For these frameworks, this usually means a flattened dictionary of tensors stored in safe_tensors files.
Load Converted Model into Server: The converted model is then loaded into the chosen serving framework.
Generate Outputs: The server is used to generate outputs from the loaded model.

Technical Detail: Some layer types may require weights to be transposed or permuted during the conversion process to align with the expected format.

Weight Conversion: Handling Shape Differences

A significant aspect of weight conversion is addressing differences in layer definitions and data formats between frameworks like PyTorch and Jax/Flax.

Python Function load_safe_tensors: This function is used in both VLM and SGLANG examples to load model weights. It iterates through safe_tensors files in a specified directory. The safe_open context manager from the safe_tensors library, with the framework='flax' argument, ensures tensors are loaded in a format suitable for Jax.
Weight Alterations by Layer Type:
- Linear/Fully Connected Layers: Only require a transpose of the weights.
- Convolutional Layers: Require a more complex permutation of weights.
- Batch Normalization Layers: Typically require no changes to the weights.
Handling Convolution-to-Linear Transitions (e.g., ResNet, VGG):
- PyTorch: Activations after convolutions are NCHW. They are then reshaped to N * C * H * W before being fed to a fully connected layer.
- Jax/Flax: Activations after convolutions are NHWC. Before feeding to a fully connected layer, these activations must be transposed to NCHW.

Preparing Jax Models for Serving

Once pre-trained weights are loaded and potentially fine-tuned (e.g., with Llama 3.2), the Jax model needs to be prepared for serving.

Flattening Weight Dictionaries: Servers like VLM expect a flat dictionary of weights. The flatten_weight_dict function performs this transformation.
- Methodology: This is essentially a "pi tree" traversal. It navigates the nested dictionary structure of the Jax model's state.
- NNX Simplification: With the latest NNX API, this process is more direct as the entire NNX module is a native Jax pi tree, allowing for straightforward flattening of its contained state.
Saving Processed Weights: The save_file function from safe_tensors.flax is used to save these processed weights into a safe_tensors file, making them ready for serving.

Serving with VLM

VLM has specific requirements for serving models:

Required Files:
- safe_tensors file: Contains the model weights.
- config.json: Defines the model's architecture, size, and other essential parameters. VLM reads this first for compatibility checks.
- tokenizer.json and related files: Define how text is converted into tokens.
Custom Models: If a model is not in VLM's list of supported models, it can be treated as a custom model by meeting VLM's requirements. Refer to VLM documentation for details.
Serving Steps with VLM:
1. Install VLM and prepare the weights.
2. Initialize the LLM class from the VLM library.
3. Point the LLM class to the directory containing the converted safe_tensors file and specify the format and data type.
4. Define input prompts and sampling parameters (e.g., temperature, top_p).
5. Call lm.generate() to perform inference.
6. The output object contains both the original prompt and the generated text.

Serving with SGLANG

SGLANG also has similar requirements for serving models:

Required Files:
- config.json
- tokenizer.json
- Potentially tokenizer_config.json and special_tokens_map.json depending on the model.
Custom Models: Custom models can be supported by registering them. Refer to SGLANG documentation for details.
Serving Steps with SGLANG:
1. Install SGLANG and prepare the weights.
2. Initialize the SGLang engine, passing the path to the model directory.
3. Define prompts and sampling parameters, passed as a dictionary to the generate method.
4. Call llm.generate() to produce outputs.
5. Iterate through the results to display the prompt and generated text (accessed via output.text).

Conclusion and Future Directions

Serving Jax models using standard open-source tools like VLM and SGLANG is achievable by converting Jax weights to a compatible safe_tensors format. This approach is significant as it allows organizations to leverage their existing infrastructure and expertise when exploring or adopting Jax.

Key Takeaways:

Jax models can be served using familiar enterprise tools by converting their weights to a safe_tensors format.
Weight conversion requires careful handling of shape differences, especially for convolutional and linear layers.
Both VLM and SGLANG require specific configuration and tokenizer files in addition to the model weights.
This integration strategy supports a gradual adoption of Jax within organizations.

Areas for Improvement:

Adding TPU support for VLM and SGLANG.
Optimizing the weight conversion process for better performance.

The video also points to resources for learning more about Jax, including coding exercises, documentation, and a Discord community. Upcoming episodes will cover more of the Jax AI stack.