Self host Gemma 4: Deploy LLMs on Cloud Run GPUs

Key Concepts

Agentic System: An AI architecture where a model acts as the "brain," performing reasoning to select tools and make decisions.
Ollama: A serving framework optimized for development, rapid prototyping, and local experimentation.
vLLM: A high-performance serving framework for production, featuring PagedAttention for memory efficiency and dynamic batching.
Cloud Run: A serverless platform for deploying containerized applications, capable of utilizing GPU accelerators (e.g., NVIDIA L4).
GCS FUSE: A tool that mounts Google Cloud Storage (GCS) buckets as local file systems, allowing Cloud Run to access model weights stored in the cloud as if they were local files.
Artifact Registry: A managed service for storing and managing container images.
Secret Manager: A secure service for storing sensitive information like API tokens (e.g., Hugging Face tokens).
Private Google Access: A networking configuration that allows services to communicate with Google Cloud APIs (like GCS) over a private network rather than the public internet.

1. Model Strategy: Open vs. Closed

The presenters emphasize that choosing between open models (e.g., Gemma) and closed models (e.g., Gemini) depends on the use case:

Closed Models: State-of-the-art, fully managed, and easy to start, but offer limited customization beyond prompting. Costs scale linearly with API calls.
Open Models: Ideal for industries like finance or healthcare where data privacy requires on-premise or isolated hosting. They allow for fine-tuning and domain-specific customization. Costs are tied to infrastructure rather than per-call usage.

2. Deployment Methodologies

The lab covers two distinct architectural approaches for hosting Gemma 4 on Cloud Run:

Approach A: Ollama (Development/Prototyping)

Process: The model is "baked" directly into the Docker container image.
Pros: Extremely fast cold starts because the model is pre-loaded in the image.
Cons: Inflexible; changing the model version or parameters requires a full rebuild and redeployment of the container image.

Approach B: vLLM (Production)

Process: The container image contains only the vLLM serving code. Model weights are stored in a GCS bucket and mounted via GCS FUSE.
Pros: Highly flexible; swapping models simply involves updating the files in the GCS bucket. Optimized for high throughput and multi-user concurrency.
Cons: Longer cold start times due to the need to mount and load weights from storage upon initial invocation.

3. Step-by-Step Implementation Framework

Environment Setup: Initialize a Google Cloud project, enable necessary APIs (Storage, Cloud Build, Artifact Registry, Secret Manager), and configure IAM permissions for the default service account.
Secret Management: Store Hugging Face tokens in Secret Manager to securely authenticate model downloads.
CI/CD Pipeline (Cloud Build):
- Define a cloudbuild.yaml blueprint.
- Build: Create the Docker image.
- Push: Upload the image to the Artifact Registry.
- Deploy: Deploy the image to Cloud Run with specific resource allocations (e.g., 4 CPUs, 16GB RAM, NVIDIA L4 GPU).
Verification: Use curl commands to send POST requests to the Cloud Run endpoint and verify the model's reasoning capabilities.

4. Key Technical Considerations

Resource Allocation: For Gemma 4 (2B version), a minimum of 16GB of memory is recommended.
Networking: Enabling Private Google Access is critical when using vLLM to ensure that model weights are pulled from GCS over a private network, enhancing security and performance.
Stochasticity: The presenters note that AI models are stochastic; therefore, identical prompts may yield different, yet valid, responses.

5. Notable Quotes

"The model you're choosing really determines the upper bound, the capability of your agentic system." — Annie
"Secret Manager is the best practice way for storing and managing your application secrets... rather than storing your API keys as environmental variables." — IO

6. Synthesis and Conclusion

The lab demonstrates that while Ollama is superior for rapid development and local testing due to its "baked-in" model approach, vLLM is the preferred choice for production environments requiring high concurrency and model flexibility. By leveraging Google Cloud’s CI/CD tools (Cloud Build) and storage solutions (GCS FUSE), developers can build robust, scalable agentic systems. Future sessions will focus on scaling via load balancers, security (Model Armor), and observability (Prometheus sidecars).