This AI agent runs on Cloud Run + NVIDIA GPUs

By Google Cloud Tech

Share:

Key Concepts

  • Cloud Run: A fully managed serverless platform that enables you to run stateless containers that are invocable via HTTP.
  • NVIDIA GPUs: Graphics Processing Units designed to accelerate complex computations, particularly relevant for AI and machine learning tasks.
  • Large Language Models (LLMs): AI models trained on vast amounts of text data, capable of understanding and generating human-like text.
  • Gemma3: A specific LLM developed by Google DeepMind, optimized for GPU usage.
  • Ollama: A tool that simplifies the process of downloading and running LLMs locally or in cloud environments.
  • RAG (Retrieval-Augmented Generation): A technique that enhances LLM responses by retrieving relevant information from external data sources before generating an answer.
  • Vectorization: The process of converting data (like text) into numerical vectors, enabling AI models to understand and search through it.
  • Agent Workflow: A system where multiple AI agents collaborate to achieve a common goal, each performing a specific task.
  • Langgraph: A Python library for building complex agent workflows and state machines.
  • Gradio: A Python library for creating simple, customizable UI for machine learning models.
  • Federated Learning: A machine learning approach that trains an algorithm across multiple decentralized edge devices or servers holding local data samples, without exchanging their data.

Smart Health Agent Application

This section details a practical application built using NVIDIA GPUs on Google Cloud Run, designed to provide personalized wellness recommendations.

1. Application Functionality:

  • Personalized Recommendations: The AI agent generates tailored advice on exercises and diet based on user input.
  • User Input: The application collects information about the user's daily routine (e.g., desk job, daily walks, bedtime) and location (e.g., San Francisco).
  • Medical Records Integration: Users can upload dummy medical records (for demonstration purposes) which the AI agent processes.
  • Real-time Weather Consideration: The agent incorporates current weather conditions into its recommendations.
  • Follow-up Questions: Users can ask follow-up questions, such as about cholesterol levels, and receive relevant answers.

2. AI Model and Workflow:

  • AI Model: The application utilizes the Gemma3 model, developed by Google DeepMind and optimized for GPU performance.
  • Agent Workflow: The core of the application involves an agent workflow that orchestrates multiple specialized agents.
    • RAG Implementation: The application employs RAG to process uploaded medical records. This involves:
      • Document Retrieval: Identifying relevant medical records.
      • Chunking: Splitting documents into smaller, manageable segments.
      • Vectorization: Converting these chunks into numerical vectors to make them searchable by the AI.
      • Vector Store: Storing these vectorized chunks for efficient retrieval.
    • Specialized Agents:
      • Weather Agent: Fetches current weather data for the user's specified location.
      • Routine Analysis Agent: Processes the user's daily routine information.
      • Knowledge Agent: Analyzes the medical reports.
    • Collaboration: These agents work together to generate a comprehensive and personalized health plan.
    • Real-time Streaming: The generated plan is streamed in real-time to the user interface.

3. Technical Implementation on Google Cloud Run:

  • Two-Service Architecture: The application is split into two distinct services on Cloud Run to cater to different hardware and scaling requirements:
    • smart-health-app-cpu Service:
      • Purpose: Hosts the user interface (UI) and handles user input.
      • Hardware: Utilizes CPUs only, no GPUs.
      • Technology: A traditional web application.
    • ollama-gemma Service:
      • Purpose: Runs the Gemma3 LLM using Ollama.
      • Hardware: Leverages NVIDIA L4 GPUs for accelerated AI processing.
      • Ollama Integration: Ollama simplifies the deployment and execution of the Gemma3 model.
  • GPU Utilization: The ollama-gemma service is configured to use a single NVIDIA L4 GPU, as observed in the Google Cloud Console.

Developer Experience and Tooling

This section discusses the tools and frameworks used in building the application and the overall developer experience.

1. Agent Orchestration:

  • Langgraph Library: The developer chose to use the langgraph library for orchestrating the multiple agents within the workflow. This library allows for the creation of complex state machines and agent interactions.
  • Alternative: Google's ADK: Google's Agent Development Kit (ADK) was mentioned as another viable option for agent orchestration, but the developer's familiarity with langgraph led to its selection.

2. Model Hosting:

  • Self-Hosting Gemma3: The application hosts its own copy of the Gemma3 model rather than relying on an external API like Gemini.
  • Reasons for Self-Hosting:
    • Control: Provides greater control over the model's execution.
    • Fine-tuning: Enables advanced customization, such as fine-tuning for federated learning.
  • External API Alternative: While calling the Gemini API is suitable for many applications, self-hosting is preferred for scenarios requiring more granular control.

3. User Interface (UI) Development:

  • Gradio Library: The UI was built using the Gradio library, which allows for the definition of web UIs directly in Python. This simplifies the process of creating interactive interfaces for AI models.

4. Developer Experience:

  • Serverless Benefits: The use of Cloud Run provided a serverless experience, eliminating the need for manual GPU reservation or infrastructure provisioning.
  • Seamless Integration: The seamless integration of NVIDIA GPUs with the open-source AI ecosystem and Google Cloud's infrastructure contributed to a simplified developer workflow.

Conclusion and Key Takeaways

The video demonstrates a practical and sophisticated application of NVIDIA GPUs within Google Cloud Run, showcasing how to build a smart health agent that leverages LLMs for personalized recommendations. The key takeaways highlight the power of combining specialized hardware (NVIDIA GPUs) with robust cloud infrastructure (Google Cloud Run) and flexible open-source AI tools (Gemma3, Ollama, Langgraph, Gradio) to create advanced AI solutions. The application's architecture, with its distinct CPU and GPU services, exemplifies efficient resource utilization. The developer's positive experience underscores the ease of building and deploying complex AI applications in a serverless environment.

Chat with this Video

AI-Powered

Hi! I can answer questions about this video "This AI agent runs on Cloud Run + NVIDIA GPUs". What would you like to know?

Chat is based on the transcript of this video and may not be 100% accurate.

Related Videos

Ready to summarize another video?

Summarize YouTube Video