Serving JAX Models with vLLM & SGLang
By Google for Developers
Key Concepts
- Jax: A Python library for high-performance numerical computation and machine learning research.
- Enterprise Machine Learning Workflows: Existing systems and processes used by organizations to develop, deploy, and manage machine learning models.
- Open-Source Serving Frameworks: Tools like VLM and SGLANG that facilitate the deployment and serving of machine learning models.
- PyTorch: A popular deep learning framework, often the basis of existing enterprise infrastructure.
- Safe Tensors: A file format for storing tensor weights, designed for safety and efficiency.
- Flax: A neural network library built on Jax, often used for implementing models.
- Pi Tree: A data structure in Jax that represents nested Python structures (like dictionaries and lists) as a unified tree.
- NNX (Neural Network eXtensions): A newer API in Flax that makes modules native Jax pi trees, simplifying state management.
- VLM (Vector Language Model): A serving framework that can load and run language models.
- SGLANG: Another serving framework for language models.
- NCHW vs. NHWC: Different data formats for image or tensor dimensions. NCHW (Batch, Channels, Height, Width) is common in PyTorch, while NHWC (Batch, Height, Width, Channels) is common in Jax/Flax.
- Weight Transposition/Permutation: Adjusting the order of dimensions in weight tensors to match the expectations of different frameworks or layer implementations.
- Config.json: A configuration file that defines a model's architecture, size, and other parameters.
- Tokenizer.json: A file that defines how to convert text into numerical tokens for a model.
Integrating Jax Models into Enterprise Workflows with VLM and SGLANG
This discussion focuses on how to integrate Jax models into existing enterprise machine learning workflows by leveraging popular open-source serving frameworks like VLM and SGLANG. The primary goal is to enable organizations to utilize their current infrastructure and gradually adopt Jax, lowering the barrier to entry.
The General Process for Serving Jax Models
The core process involves several key steps:
- Load Model Weights into Jax: This typically involves loading pre-trained weights, often from sources like Hugging Face, into an equivalent Jax implementation.
- Convert Jax Weights for Serving: The crucial step for serving is converting these Jax weights into a format that serving frameworks like VLM and SGLANG expect. For these frameworks, this usually means a flattened dictionary of tensors stored in
safe_tensorsfiles. - Load Converted Model into Server: The converted model is then loaded into the chosen serving framework.
- Generate Outputs: The server is used to generate outputs from the loaded model.
Technical Detail: Some layer types may require weights to be transposed or permuted during the conversion process to align with the expected format.
Weight Conversion: Handling Shape Differences
A significant aspect of weight conversion is addressing differences in layer definitions and data formats between frameworks like PyTorch and Jax/Flax.
- Python Function
load_safe_tensors: This function is used in both VLM and SGLANG examples to load model weights. It iterates throughsafe_tensorsfiles in a specified directory. Thesafe_opencontext manager from thesafe_tensorslibrary, with theframework='flax'argument, ensures tensors are loaded in a format suitable for Jax. - Weight Alterations by Layer Type:
- Linear/Fully Connected Layers: Only require a transpose of the weights.
- Convolutional Layers: Require a more complex permutation of weights.
- Batch Normalization Layers: Typically require no changes to the weights.
- Handling Convolution-to-Linear Transitions (e.g., ResNet, VGG):
- PyTorch: Activations after convolutions are NCHW. They are then reshaped to
N * C * H * Wbefore being fed to a fully connected layer. - Jax/Flax: Activations after convolutions are NHWC. Before feeding to a fully connected layer, these activations must be transposed to NCHW.
- PyTorch: Activations after convolutions are NCHW. They are then reshaped to
Preparing Jax Models for Serving
Once pre-trained weights are loaded and potentially fine-tuned (e.g., with Llama 3.2), the Jax model needs to be prepared for serving.
- Flattening Weight Dictionaries: Servers like VLM expect a flat dictionary of weights. The
flatten_weight_dictfunction performs this transformation.- Methodology: This is essentially a "pi tree" traversal. It navigates the nested dictionary structure of the Jax model's state.
- NNX Simplification: With the latest NNX API, this process is more direct as the entire NNX module is a native Jax pi tree, allowing for straightforward flattening of its contained state.
- Saving Processed Weights: The
save_filefunction fromsafe_tensors.flaxis used to save these processed weights into asafe_tensorsfile, making them ready for serving.
Serving with VLM
VLM has specific requirements for serving models:
- Required Files:
safe_tensorsfile: Contains the model weights.config.json: Defines the model's architecture, size, and other essential parameters. VLM reads this first for compatibility checks.tokenizer.jsonand related files: Define how text is converted into tokens.
- Custom Models: If a model is not in VLM's list of supported models, it can be treated as a custom model by meeting VLM's requirements. Refer to VLM documentation for details.
- Serving Steps with VLM:
- Install VLM and prepare the weights.
- Initialize the
LLMclass from the VLM library. - Point the
LLMclass to the directory containing the convertedsafe_tensorsfile and specify the format and data type. - Define input prompts and sampling parameters (e.g.,
temperature,top_p). - Call
lm.generate()to perform inference. - The output object contains both the original prompt and the generated text.
Serving with SGLANG
SGLANG also has similar requirements for serving models:
- Required Files:
config.jsontokenizer.json- Potentially
tokenizer_config.jsonandspecial_tokens_map.jsondepending on the model.
- Custom Models: Custom models can be supported by registering them. Refer to SGLANG documentation for details.
- Serving Steps with SGLANG:
- Install SGLANG and prepare the weights.
- Initialize the
SGLangengine, passing the path to the model directory. - Define prompts and sampling parameters, passed as a dictionary to the
generatemethod. - Call
llm.generate()to produce outputs. - Iterate through the results to display the prompt and generated text (accessed via
output.text).
Conclusion and Future Directions
Serving Jax models using standard open-source tools like VLM and SGLANG is achievable by converting Jax weights to a compatible safe_tensors format. This approach is significant as it allows organizations to leverage their existing infrastructure and expertise when exploring or adopting Jax.
Key Takeaways:
- Jax models can be served using familiar enterprise tools by converting their weights to a
safe_tensorsformat. - Weight conversion requires careful handling of shape differences, especially for convolutional and linear layers.
- Both VLM and SGLANG require specific configuration and tokenizer files in addition to the model weights.
- This integration strategy supports a gradual adoption of Jax within organizations.
Areas for Improvement:
- Adding TPU support for VLM and SGLANG.
- Optimizing the weight conversion process for better performance.
The video also points to resources for learning more about Jax, including coding exercises, documentation, and a Discord community. Upcoming episodes will cover more of the Jax AI stack.
Chat with this Video
AI-PoweredHi! I can answer questions about this video "Serving JAX Models with vLLM & SGLang". What would you like to know?