The Agent Factory - Episode 11: AI agents for data engineering and data science

By Google Cloud Tech

TechnologyAIBusiness
Share:

Key Concepts

  • AI Agents: Autonomous AI programs designed to perform tasks, often by interacting with other systems or data.
  • Gemini API: A platform for accessing Google's Gemini models, enabling various AI capabilities.
  • Computer Vision Model (Gemini): A model that can "see" and interact with a computer screen by taking screenshots and executing UI actions.
  • Codemen: An autonomous AI agent focused on code security, offering both reactive patching and proactive code rewriting.
  • BigQuery Data Engineering Agent: An AI agent that assists data engineers by automating pipeline creation, data checks, and query generation using natural language.
  • Data Science Agent: An AI agent that helps data scientists with tasks like anomaly detection, data preprocessing, model training, and visualization using natural language.
  • ADK (Agent Development Kit): A framework or set of tools for building and deploying AI agents.
  • Spanner: A globally distributed, strongly consistent, and highly available database service.
  • Graph Database: A database that uses graph structures with nodes and edges to represent and store data.
  • RAG (Retrieval-Augmented Generation): A technique that combines retrieval of information from a knowledge base with generative AI models to produce more informed and accurate responses.
  • Data Form: A declarative language for defining and managing data engineering pipelines, similar to software delivery lifecycle pipelines.
  • Data Quality Assertions: Rules or checks implemented to ensure the accuracy, completeness, and consistency of data.
  • Time Dimension: A dataset that provides various temporal attributes (year, month, day, etc.) for time-based analysis.
  • Isolation Forest: An anomaly detection algorithm that isolates outliers by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of that feature.
  • Nano Banana: A tool or agent used for generating comics.

Agent Industry Polls: New Releases

Gemini API with Computer Vision Model

  • Main Topic: Introduction of a new computer vision model within the Gemini API, enabling AI agents to "see" and act on a computer screen.
  • Key Points:
    • The model takes screenshots and determines the next UI action (click, scroll, type, open web page).
    • Developers write a local client to execute these actions.
    • The loop continues until the task is completed.
    • Enables automation of real-world browser tasks like form filling, data scraping, and user flow testing.
    • Includes robust safety layers with approval, blocking, or human confirmation for risky actions.
    • Ideal for research, testing, and prototyping visual agent interactions.
  • Example/Application: A demo shows the model navigating to the Gemini documentation page and then finding the pricing page.
  • Technical Terms: Computer Vision, UI Action, Screenshot, Multimodal AI.
  • Quote: "Think of it as giving an agent a pair of eyes and hands on a computer." (Smitha Colan)

Codemen: Autonomous AI Agent for Code Security

  • Main Topic: Introduction of Codemen, an autonomous AI agent for code security.
  • Key Points:
    • Reactive Mode: Instantly patches new vulnerabilities as they are discovered.
    • Proactive Mode: Rewrites existing code to secure entire classes of flaws.
    • Addresses the challenge of human developers being unable to keep pace with AI-accelerated vulnerability discovery and patch volume.
    • Automates the creation and validation of high-quality security patches at scale.
    • Utilizes Gemini's reasoning power and sophisticated self-correction/validation tools (static analysis, fuzzing).
    • Ensures functional equivalence of patches through a multi-agent system.
    • Patches are currently human-reviewed.
  • Data/Statistics: Has already upstreamed 72 security fixes to open-source projects.
  • Technical Terms: Vulnerabilities, Zero-days, Patch Volume, Static Analysis, Fuzzing, Functional Equivalence.
  • Argument: AI is accelerating vulnerability discovery, but human developers struggle to keep up with patching, necessitating automated solutions like Codemen.

The Factory Floor: Data Agents in Action

BigQuery Data Engineering Agent

  • Main Topic: Demonstration of the BigQuery Data Engineering Agent for automating data pipeline creation and management.
  • Key Points:
    • Assists data engineers and analysts with massive datasets.
    • Takes natural language prompts via Gemini to perform tasks.
    • Can generate AI-powered SQL queries using the AI.GENERATE function in BigQuery.
    • Leverages Data Form for declarative pipeline definition and version control in a Git repository.
    • Supports tasks like adding new fields based on existing data, generating time dimensions, and creating data quality assertions.
  • Example/Application:
    • Prompt: "Using the account table that we just looked at, add a field to the account table that shows the sales region based on the billing country."
    • The agent creates a pipeline that uses AI.GENERATE to map countries (Argentina, Canada, Italy, Japan) to sales regions (North America, Latin America, EMIA, APAC).
    • Generates a time dimension table with various date attributes (date, year, quarter, month name, day name) for enhanced natural language to SQL queries.
    • Generates data quality assertions for tables, such as ensuring IDs are not null and account names are not null, and accounts are unique.
  • Step-by-Step Process (Pipeline Creation):
    1. Initiate a "new pipeline" in BigQuery.
    2. Use Gemini to generate SQL code, often by calling AI.GENERATE functions.
    3. Define the pipeline using Data Form's declarative language.
    4. Store system instructions in a create_instructions.yaml file.
    5. Declare tables and pipeline results in the definitions folder.
    6. Commit changes to a Git repository for version control and collaboration.
  • Technical Terms: AI.GENERATE, Data Form, Declarative Language, Git Repository, System Instructions, Time Dimension, Data Quality Assertions.
  • Quote: "I'm going to use AI to generate AI because BigQuery has the ability to call Gemini through the AI functions directly from SQL and I think this is mind-blowing." (Lucia Sububarian)

Data Science Agent

  • Main Topic: Demonstration of the Data Science Agent for assisting with data analysis, anomaly detection, and model training.
  • Key Points:
    • Operates in Vertex AI Colab Enterprise.
    • Takes natural language prompts to perform data science tasks.
    • Can load, describe, and preprocess data.
    • Automates the training of models like Isolation Forest for anomaly detection.
    • Provides visualizations and summaries of findings.
    • Can identify anomalous data points and provide reasons or key findings.
    • Helps in understanding root causes of anomalies.
  • Example/Application:
    • Prompt: "Detect anomalies in the case table." (Later refined to specify the table).
    • The agent plans to load and describe data, preprocess it, train an Isolation Forest model, and provide visualizations.
    • Identified that 70% of the data set was anomalous, with specific findings about "word records" in the case table.
    • Provided insights and next steps, including understanding feature combinations of anomalous records and root causes.
  • Step-by-Step Process (Anomaly Detection):
    1. Provide a natural language prompt specifying the task (e.g., anomaly detection) and the target table.
    2. The agent generates a plan including data loading, preprocessing, model training, and visualization.
    3. Accept and run the generated steps.
    4. The agent trains a model (e.g., Isolation Forest).
    5. The agent displays anomalous data and provides a summary of findings, including the percentage of anomalous data and potential root causes.
  • Technical Terms: Anomaly Detection, Isolation Forest, Preprocessing, Boilerplate Code, Pandas DataFrames, BigQuery DataFrames.
  • Data/Statistics: 70% of the data set was identified as anomalous.
  • Argument: Automating boilerplate data science tasks like preprocessing and model training significantly saves time for data scientists and makes data science more accessible.

Connecting Agents to BigQuery and Spanner with ADK

  • Main Topic: Demonstrating how to use the ADK to connect AI agents to Spanner and BigQuery databases for data traversal and content generation.
  • Key Points:
    • Spanner: A globally distributed, consistent, and highly available database. Used here for its graph capabilities.
    • Graph Database: Data is represented as nodes and edges, allowing for traversal and knowledge extraction.
    • ADK: Used to build agents that interact with the Spanner graph database.
    • RAG Application: Agents traverse the graph database to retrieve information and then use that information to generate content (e.g., comics).
    • Nano Banana: An agent used to generate comics based on prompts derived from the retrieved data.
    • Iterative Image Generation: The system includes sub-agents to check image quality and text accuracy, with multiple iterations to refine the output.
  • Example/Application:
    • An agent is built using ADK to query a Spanner graph database containing Spanner documentation.
    • Prompt: "What are regions?"
    • The agent generates a graph query to traverse the knowledge graph.
    • The retrieved information about Spanner regions is used to generate a prompt for Nano Banana.
    • Nano Banana creates a six-panel comic strip explaining Spanner regions with characters Ada (a developer) and a robot.
    • The comic generation process includes checks for image quality and text clarity, with up to three iterations.
  • Step-by-Step Process (Comic Generation):
    1. Define an agent using ADK to interact with a Spanner graph database.
    2. Formulate a natural language question for the agent.
    3. The agent traverses the graph database to retrieve relevant information.
    4. The retrieved information is used to generate a detailed prompt for a comic generation agent (Nano Banana).
    5. Nano Banana generates a comic strip.
    6. Sub-agents check the quality and text accuracy of the generated comic.
    7. If necessary, the comic is regenerated with refinements based on the quality checks.
  • Technical Terms: Spanner, Graph Database, ADK, Knowledge Graph, Nano Banana, Iterations, Image Quality Checker, Text Accuracy.
  • Quote: "So now imagine if they had AI Agents which could help them by automatically building pipelines, running checks, generating queries and even code for things like visualization and keeping everything tied in their warehouse or notebooks." (Smitha Colan)

Developer Q&A

  • Question 1: Are the Data Science Agent and Data Engineering Agent generally available?
    • Answer: Both are currently in public preview. The Data Science Agent is available, and access to the Data Engineering Agent requires following a specific link.
  • Question 2: How scalable is the Data Engineering Agent for multiple tables and datasets, and what is the deployment strategy to higher environments?
    • Answer:
      • Scalability: The agent is built on BigQuery and Data Form, which are highly scalable. It can handle multiple tables and datasets as long as the executing pipeline has the necessary permissions. New tables/datasets need to be declared.
      • Deployment Strategy: Data Form excels at this. Data Form artifacts can be released and configured with workflows to deploy them to different projects, datasets, and environments (dev, staging, prod, QA).

Synthesis/Conclusion

This episode of the Asian Factory podcast highlights the transformative potential of AI agents in the data engineering and data science domains. The introduction of Gemini's computer vision model marks a significant step towards multimodal AI that can interact with the digital world. Codemen showcases advancements in AI-driven code security, addressing critical industry needs.

The core of the discussion revolves around practical applications of AI agents for data professionals. The BigQuery Data Engineering Agent and Data Science Agent, both in preview, demonstrate how natural language prompts can automate complex tasks like pipeline creation, data quality checks, anomaly detection, and model training. These agents leverage powerful underlying technologies like BigQuery and Data Form, offering a scalable and version-controlled approach to data management.

Furthermore, the integration of ADK with databases like Spanner, particularly its graph capabilities, illustrates how agents can traverse complex data structures to retrieve information and generate creative content, such as comics. This showcases a powerful paradigm for building RAG applications and knowledge traversal agents. The discussion emphasizes that these tools are not just about automation but also about augmenting human capabilities, saving time, and making advanced data tasks more accessible. The Q&A session addresses practical concerns about availability and deployment, reinforcing the ongoing development and accessibility of these AI-powered solutions.

Chat with this Video

AI-Powered

Hi! I can answer questions about this video "The Agent Factory - Episode 11: AI agents for data engineering and data science". What would you like to know?

Chat is based on the transcript of this video and may not be 100% accurate.

Related Videos

Ready to summarize another video?

Summarize YouTube Video