Compilers in the Age of LLMs — Yusuf Olokoba, Muna

Key Concepts

AI Deployment Challenges: Difficulty in integrating diverse AI models (open-source, proprietary) into existing infrastructure without extensive re-engineering.
Hybrid Inference: The future paradigm of AI where smaller, local models work in conjunction with larger, cloud-based models.
Python Compiler: A tool that translates Python code into self-contained, executable binaries for various platforms.
Tracing: The process of analyzing Python code to generate a graph representation of its operations.
Intermediate Representation (IR): An internal representation of the Python function's logic, used for translation.
Type Propagation: A compiler technique to infer and assign static types to variables in dynamically typed Python code for lower-level language generation.
LLM-Assisted Code Generation: Using Large Language Models to generate C++ or Rust code for elementary operations, reducing manual effort.
Foreign Function Interface (FFI): A mechanism allowing code in one language to call functions written in another language.
OpenAI-Style Client: Recreating the familiar API interface of OpenAI's client to interact with any compiled AI model.

AI Deployment: Bridging the Gap Between Hype and Reality

The current landscape for AI engineers often involves juggling multiple codebases, Hugging Face models, and agentic workflows that are essentially HTTP call chains. While advanced technologies like voice agents and MCPs are gaining traction, many engineering teams are still grappling with fundamental deployment issues. The core problem is how to integrate a wider variety of AI models into diverse environments without constant infrastructure overhauls.

The Problem with Current Deployment Workflows

Model Integration Complexity: Introducing a new open-source model typically requires writing Dockerfiles, setting up containers, and managing infrastructure.
Agentic Workflow Overhead: Integrating these models into AI agents necessitates additional tooling and exposure mechanisms (e.g., MCP).
Escalating Complexity: Each new model or integration adds to the overall complexity of the system, which grows over time.

The Developer's Ideal Scenario

Developers desire a simplified, standardized approach: an "open-style client" that can connect to any model, regardless of its deployment location (local, remote) or underlying framework (Llama CPP, Tensor RT), with minimal code modifications.

A Python Compiler for Universal AI Deployment

The presented solution involves building a compiler for Python that transforms plain Python inference code into tiny, self-contained binaries. These binaries can then be executed across a wide range of environments, from cloud servers to Apple Silicon devices. A key aspect of this compiler pipeline is the integration of LLMs for various stages, including testing and code generation. This infrastructure aims to enable the execution of any AI model not just on servers, but in a multitude of other locations.

Motivation for a Python Compiler Approach

Simplified Model Integration: The goal is to provide a standardized method for developers to easily execute internal or open-source AI models within their codebases. This mirrors the ease of switching models in services like OpenAI, where only a model argument needs to be changed. The compiler ingests Python inference code and outputs an executable artifact.
Enabling Hybrid Inference: The future of AI deployment is expected to be hybrid, with smaller models running closer to the user (on-device, edge) complementing larger, more capable cloud AI models. This necessitates a shift from Python code and Docker containers to lower-level, hardware-closer, and more responsive execution environments.

Case Study: Compiling an Embedding Model

The talk demonstrates this process with a Python function that runs Google's embedding Gemma 270 million parameter model. This model is suitable for tasks like text search and retrieval-augmented generation. The objective is to convert this Python function into equivalent C++ or Rust code, compile it into a self-contained binary, and then consume it using a familiar OpenAI-style client.

Step 1: Tracing the Python Function

The initial step involves generating a graph representation of the Python function's operations, a process called tracing.

Initial Attempts with PyTorch 2: Early prototypes utilized PyTorch 2's torch.compile and torch.fx for symbolic tracing. torch.fx takes Python source code and runs it with "fake" inputs to generate an execution graph without memory allocation.
Challenges with PyTorch FX:
- PyTorch-Centric: The PyTorch tracer is primarily designed for PyTorch code. Tracing arbitrary code involving libraries like NumPy or OpenCV would require significant extensions.
- Fake Input Generation: While creating fake tensors is straightforward, generating realistic fake inputs for complex data types like images or dictionaries proved challenging.
LLM-Based Tracing: An alternative approach explored was using LLMs for trace generation, leveraging their structured output capabilities. This achieved high accuracy but was too time-consuming.
In-House Tracing Infrastructure: The team ultimately built their own tracing infrastructure. This involves:
1. Code Analysis: Analyzing the Abstract Syntax Tree (AST) of the Python code.
2. Heuristics: Applying internal heuristics to construct an internal representation (IR) of the user's function.

Example IR Snippet: The IR for the embedding function includes input nodes for the list of strings, function calls to the tokenizer and the model, and output nodes for the embedding vectors.

Step 2: Translating to Lower-Level Languages (C++/Rust)

The next phase is translating the IR into lower-level languages like C++ or Rust. A key challenge is bridging the gap between Python's dynamic typing and the static typing of C++ and Rust.

Dynamic vs. Static Typing: Python allows variables to change types (e.g., integer to string), while C++ and Rust require variables to have a fixed, declared type.
Type Propagation: This compiler technique is crucial for inferring and constraining types.
- Process: By analyzing input types (from function signatures) and the types of global constants, the compiler can propagate type information through operations.
- Example: If a function takes a list of strings and a prefix map containing strings, and an addition operation is performed between a prefix and a string from the list, type propagation determines that the output of this operation will also be a string. This allows for the generation of a C++ function that takes two strings and performs string concatenation.
Handling Elementary Operations:
- The Challenge: Manually implementing C++ or Rust equivalents for every possible Python operation and library function (e.g., NumPy, PyTorch operations) is a monumental task.
- Tractability through LLMs: The key to making this manageable is using LLMs to generate the C++ and Rust code for these elementary operations. This allows for mass production of the necessary native code.

Step 3: Generating C++ Code and Compilation

With type information propagated through the IR, the compiler can generate correct C++ source code.

Side-by-Side Comparison: The generated C++ code mirrors the Python function's logic, including list comprehension for prefixing, tokenization, model execution, and returning embedding vectors.
Compilation to Native Binary: The generated C++ code can be compiled using any standard C/C++ compiler, producing a self-contained dynamic library (shared object). This binary can then run on any platform with a C/C++ compiler.

Step 4: Consuming the Compiled Model

The compiled binary can be invoked from various languages and environments.

Example: JavaScript on Node.js:
- Foreign Function Interface (FFI): Node.js can use FFI to load and call functions from the compiled native library.
- Scaffolding: This involves declaring the native library, its functions, and their signatures.
- Execution: Once loaded, the native function can be invoked directly from JavaScript, returning the embedding matrix.
Exposing via OpenAI-Style Client:
- Client Class: A Client class with a nested Embeddings class and a create function is implemented, mirroring the OpenAI client structure.
- Model Resolution: When a model name is provided, it's mapped to the path of the corresponding compiled binary.
- Invocation: The FFI mechanism is used to load and execute the compiled library.
- Output Formatting: The output from the compiled model is then massaged to match the format of the official OpenAI client's output.

Conclusion and Key Takeaways

The presented approach offers a robust solution to the challenges of AI model deployment. By building a Python compiler that translates high-level Python inference code into self-contained, native binaries, developers can achieve:

Universal Compatibility: Run AI models on any platform with a C/C++ compiler.
Simplified Integration: Easily swap models by changing a model argument, similar to existing cloud services.
Support for Hybrid Inference: Enable the deployment of models closer to the user, paving the way for future AI architectures.
Reduced Infrastructure Overhead: Eliminate the need for extensive Dockerization and infrastructure management for each new model.
LLM-Powered Efficiency: Leverage LLMs to automate the generation of low-level code, making the compilation process scalable.

This system effectively recreates the familiar OpenAI client experience but extends its capabilities to any open-source model that can be integrated into a Python function and subsequently compiled.