AgentOps: Operationalize AI Agents

Key Concepts

DevOps: Foundation of operational practices focusing on development best practices like repositories, pipelines, and automated testing.
MLOps: Extension of DevOps for productionizing machine learning solutions, addressing the non-deterministic nature of ML.
GenAI Ops: Umbrella term for operationalizing generative AI applications, encompassing prompt engineering, agent operations, and retrieval-augmented generation (RAG).
Foundation Model Operations: Operationalizing the development and deployment of large foundation models.
Prompt Engineering: Designing effective prompts for foundation models.
Agent Ops: Operationalizing AI agents, including tool selection, function calling, and memory management.
RAG (Retrieval-Augmented Generation): Enhancing LLMs with external knowledge retrieval for improved context and accuracy.
Tool Registry: Centralized catalog for storing and managing tools (functions, APIs) used by AI agents.
Agent Frameworks: Tools that simplify agent development by integrating models, tools, memory, and other components.
Multi-Agent Systems: Systems composed of multiple specialized agents working together to achieve complex tasks.

1. Ops Definitions and Their Evolution

DevOps: Emphasizes repository usage, pipelines, and automated testing.
MLOps: Extends DevOps to handle the non-deterministic nature of machine learning, requiring model evaluation and specialized technologies.
GenAI Ops: Focuses on productionizing AI-powered applications built on foundation models, encompassing prompt engineering, agent operations, and RAG.
Foundation Model Operations: Pertains to the operationalization of foundation models themselves, typically by providers like Google.
GenAI Ops as an Umbrella: Encompasses various sub-domains like "PromptOps," "AgentOps," and "RAG Ops," acknowledging the evolving landscape of generative AI.

2. MLOps: People, Process, and Technology

Focus on People and Processes First: Technology should be derived from understanding the needs of people and processes.
MLOps Platform Goals: Reduce time to value, ensure security, support private clouds, and standardize repositories and CI/CD pipelines.
Templatization: Aim for 80% templization to ensure consistent practices across the organization.
MLOps Environments:
- Cloud Architects and Security: Responsible for infrastructure, networking, and security.
- Data Engineering: Ingest, pre-process, catalog, and share data.
- Data Science: Experiment with models and collaborate with ML engineers.
- ML Governance: Centralized repository for models, data, performance metrics, and artifacts.
MLOps Architecture Design:
- Infrastructure Project: Managed by architects and engineers, using Terraform for VPCs, networking, and IAM roles.
- Data Lake Project: Data engineers create ETL pipelines and make data available.
- Computational Layer: Data scientists experiment in a sandbox environment.
- Development, Staging, and Production Projects: Productionize ML solutions, with CI/CD pipelines for pre-processing, training, and post-processing.
- Model Registry: Stores model versions and metadata within the governance project.
Model Productionization: Models can be deployed as pre-baked artifacts or by productionizing the source code of the pipelines.

3. GenAI Ops: Specific Processes and Workflows

Consumer vs. Provider/Fin-tuner: GenAI developers primarily consume models, while providers and fin-tuners train them.
Model Selection: Filter models based on terms and conditions, EULAs, and leaderboard performance.
Deep Evaluation: Conduct use-case-specific evaluations using internal data, considering precision, latency, and cost.
GenAI Application Layer: Includes prompt engineers and AI engineers who collaborate to optimize foundation model usage.
Prompt Catalog: A repository for storing, versioning, and managing prompts used for model evaluation and testing.
Evaluation Steps:
1. Product owner defines use case.
2. AI engineer and prompt engineer select top models and create initial prompts.
3. Extend prompt catalog with hundreds of prompts.
4. Evaluate models based on precision, speed, and cost.
Synthetic Prompts: GenAI can be used to generate synthetic prompts for evaluation, especially when labeled data is scarce.
LLM as Judges/Evaluators: LLMs can be used to evaluate model outputs instead of human evaluators.
Back-End Components:
- Guardrails: Filter inputs and outputs to prevent irrelevant questions or toxic responses.
- Caching Mechanisms: Store frequent answers to reduce LLM calls.
- Context Retrieval: Implement RAG or agents to gather real-time context.
- Rating Mechanism: Translate user feedback into prompts for further model testing.
- Monitoring: Continuously check for toxicity, hallucination, and grounding.
Front-End: User interface for interacting with the AI application.
GenAI Ops Architecture Design:
- Extends the MLOps architecture with GenAI application development, staging, and production projects.
- Prompt engineers and AI developers experiment in the development project.
- The back-end implements guardrails (e.g., using Model Armor in GCP), RAG, and connections to data tools.
- CI/CD pipelines promote code to staging and production.

4. Agent Ops: Operationalizing AI Agents

Agent Definition: A prompt that instructs a model on how to call different tools.
Building an Agent:
1. Define tools as functions with parameters.
2. Combine tools with a selected model and instructions.
3. Execute the agent by providing an input query.
4. The agent calls the appropriate function, retrieves the result, and generates a final response.
Agent Evaluation:
- Tool Selection: Evaluate the success rate of tool selection and parameter creation.
- Answer Quality: Evaluate the accuracy and grounding of the final answer.
- Operational Metrics: Evaluate speed, cost, and latency.
Prompt Catalog Extension: Extend the prompt catalog to include tools, tool calls, and parameters for agent evaluation.
Tool Design Optimization:
- Provide proper descriptions of functions and parameters.
- Select tools that perform specific and non-overlapping tasks.
Tool Registry: A centralized catalog for storing and managing tools, including metadata, authentication, and authorization details.
Tool Types:
- Code Tools: Python or other language functions.
- API Tools: APIs in private or public clouds.
- Data Tools: Tools that access databases.
Standardized Repository Structure: Organize code into folders for tools, agents, evaluation, and deployment.
CI/CD Pipeline for Agents: Automate the process of building, testing, and deploying agents.
Agent Ops Architecture Design:
- Integrate the agent into the back-end or deploy it as a separate service.
- Store tool details in the tool registry.
- Extend repositories to accommodate agents and tools.

5. Memory and Multi-Turn Conversations

Multi-Turn Interactions: Agents can engage in multiple loops of function calls to gather enough information.
Memory: Keeps track of events within an agent's interaction.
Evaluation of Multi-Turn Interactions:
- Evaluate single-turn performance.
- Check the sequence of function calls.
- Evaluate the final output.
- Check for topic relevance.
Short-Term Memory: Stores recent interactions, located close to the agent.
Long-Term Memory: Stores completed interactions, located in the data lake.
Long-Term Memory and RAG: Ingest long-term memory data into the RAG system for efficient context retrieval.
Multi-Agent Systems: Parallelize multiple agents for specific tasks, requiring orchestration.
Routing Among Agents: Use a router agent or other mechanisms to direct users to the appropriate sub-agent.
Agent Template Catalog: Store code templates for different agent types to accelerate development.
Agent Frameworks: Simplify agent development by integrating models, tools, memory, and other components.

6. End-to-End Architecture and Resources

Complete Architecture: Integrates GenAI application development, data lake, MLOps (for fine-tuning), and AI governance.
Resources:
- Blog posts with detailed information and diagrams.
- End-to-end starter pack for productionizing agents.

7. Synthesis/Conclusion

The video provides a comprehensive overview of operationalizing AI agents, starting from the foundations of DevOps and MLOps, progressing to GenAI Ops, and culminating in Agent Ops. It emphasizes the importance of people, processes, and technology, and provides actionable insights into building, evaluating, and deploying AI agents in a robust and scalable manner. The discussion covers key concepts like prompt engineering, tool registries, memory management, and multi-agent systems, offering a practical guide for organizations looking to leverage AI agents in their applications. The provided resources, including blog posts and a starter pack, enable viewers to begin their journey towards productionizing AI agents.