Build an AI Agent knowledge base using SQL (BigQuery + Gemini)

Key Concepts

ETL (Extract, Transform, Load): The process of gathering data from various sources, converting it into a usable format, and storing it for analysis.
Unstructured Data: Information that lacks a pre-defined data model (e.g., PDFs, text files, Word documents).
Structured Data: Data organized into a defined format (e.g., JSON, SQL tables) that allows for efficient querying and analysis.
BigQuery External Tables: A feature that allows querying data stored in Google Cloud Storage (GCS) without moving or copying the files into BigQuery.
Gemini (AI Model): A generative AI model used here to perform the "transformation" step by extracting structured JSON from unstructured text.
RAG (Retrieval-Augmented Generation): A framework for improving LLM responses by grounding them in external data (to be covered in Part 2).
Cloud Shell: A browser-based terminal environment for managing Google Cloud resources.

1. Lab Setup and Environment Preparation

The presenters emphasize using a personal Gmail account rather than corporate/educational accounts to avoid permission restrictions.

Authentication: Users must verify their identity using gcloud auth list and ensure the correct project ID is active using gcloud config set project [PROJECT_ID].
Repository Cloning: Two repositories are used: Agentverse-Agent-Engineer (starter code) and Agentverse-Dungeon (for the final boss fight deployment).
API Enablement: Essential services including BigQuery, Cloud Storage, AI Platform (for Gemini), and Dataflow must be enabled.
IAM Permissions: The default service account is granted storage.objectViewer (to read GCS files) and aiplatform.user (to invoke Gemini models).

2. The ETL Process with Gemini

The core of the lab is converting unstructured battle reports into structured, queryable data.

Step-by-Step Methodology:

Extraction & Loading: Instead of moving files, an External Table is created in BigQuery. This acts as a "pointer" to the GCS bucket, allowing real-time querying of the raw text files.
Transformation: A Gemini Flash model is created within BigQuery using a remote connection.
Prompt Engineering: The model is prompted to parse the raw text and output a strictly formatted JSON object containing specific keys: monsters, battles, and adventurers.
Normalization: The resulting JSON is parsed using BigQuery’s JSON_EXTRACT functions to create individual, clean tables for each entity type.

3. Real-World Application

The presenters use a "gamified" scenario (adventurers fighting monsters) to illustrate a common data engineering challenge:

Scenario: A researcher has hundreds of PDF articles and needs to extract specific insights (e.g., monster hit points or battle outcomes) to perform statistical analysis.
Advantage: By converting unstructured text to structured tables, complex questions—such as "Which adventurer defeated the most powerful monster?"—can be answered in milliseconds using standard SQL, rather than repeatedly calling an LLM for every query.

4. Key Arguments and Technical Insights

Data Governance: Using external tables prevents the "data sprawl" associated with copying files across development, staging, and production environments.
Dynamic Updates: BigQuery external tables are dynamic; if new files are added to the GCS bucket, the external table reflects those changes immediately without manual re-importing.
Temperature Settings: The presenters note that while default temperatures are sufficient for most tasks, higher temperatures (closer to 1.0) are better for creative writing, while lower temperatures are preferred for predictable, structured data extraction.
Metadata Management: When using ML.GENERATE_TEXT, the output includes significant metadata (token counts, log probabilities). The presenters demonstrate that a secondary parsing step is necessary to isolate the desired JSON payload.

5. Notable Quotes

"BigQuery external tables allow you to leave your data in one place... you can query it without having to move or copy the files." — Io
"We're using Gemini to convert unstructured data to structured data... so that you can eventually analyze this structured data." — Annie

6. Synthesis and Conclusion

The lab successfully demonstrates a modern data pipeline where generative AI acts as a transformation engine. By leveraging BigQuery’s ability to interface with GCS and Gemini, users can bridge the gap between raw, unstructured information and high-performance analytical databases.

Main Takeaway: The combination of BigQuery and Gemini allows for scalable, cost-effective transformation of unstructured data into structured insights. The presenters conclude by noting that while this covers structured analytics, the next episode will address semantic search and RAG, which are required for more nuanced, context-aware reasoning.