How to build a data agent with BigQuery and CloudSQL

Transforming Battle Reports into a Powerful Knowledge Engine: A Data Engineer's Mission

Key Concepts:

RAG (Retrieval-Augmented Generation): A framework for enhancing LLM responses with information retrieved from an external knowledge source.
Vector Embeddings: Numerical representations of text data capturing semantic meaning, enabling similarity searches.
BigQuery (BQML): Google Cloud’s serverless, highly scalable, and cost-effective multi-cloud data warehouse with built-in machine learning capabilities.
Cloud SQL for PostgreSQL with PGVector: A fully-managed relational database service with the PGVector extension for efficient vector storage and search.
Dataflow: Google Cloud’s managed service for executing Apache Beam pipelines, enabling scalable data processing.
ELT (Extract, Load, Transform): A data integration process where data is first loaded into the target system and then transformed.
HNSW (Hierarchical Navigable Small World): An indexing algorithm used for fast approximate nearest neighbor search in vector databases.
Cosine Distance: A metric used to measure the similarity between two vectors, commonly used in semantic search.

I. Initial Data Wrangling & Structuring with BigQuery

The mission begins with transforming unstructured battle reports stored in Google Cloud Storage into a structured format suitable for analysis. This is achieved using BigQuery and Gemini through an ELT approach.

External Table Creation: An external table is created in BigQuery, acting as a “magic lens” to directly query the raw text files in Cloud Storage without importing them. This leverages BigQuery’s schema-on-read capability.
BQML Remote Model & Gemini Integration: A BQML remote model is created, utilizing the MLG_GENERATE_TEXT function to call the Gemini model directly from a SQL query. This transforms the unstructured reports into JSON objects containing information about monsters, battles, and adventurers.
JSON Parsing & Table Creation: A script parses, cleans, and normalizes the JSON data, populating three separate tables: monsters, adventurers, and battles. These tables contain structured data extracted from the original reports.
Strategic Analysis via Joins: The three tables are joined to create a champion's_greatest_feat table, answering complex questions like identifying the most powerful monster defeated by each adventurer and the time taken. An example query demonstrates this, revealing that Ara is particularly effective against perfectionism.

II. Enhancing Semantic Understanding with Vector Embeddings

While structured tables are useful, they lack the nuanced understanding needed for specific queries. The next step involves building a RAG pipeline to unlock deeper semantic meaning through vector embeddings.

Chunking: The text data is chunked into smaller, meaningful sentences using the SPLIT function, creating a chunked_intel table. The importance of sensible chunking strategies is highlighted.
Embedding Generation with BQML: The MLG_GENERATE_EMBEDDING function is used to generate vector embeddings for each chunk in the chunked_intel table. These embeddings are stored in a new embedded_intel table, forming the initial knowledge base.
Semantic Search in BigQuery: A prompt, such as "what are the tactics against a foe that causes paralysis?", is converted into an embedding using the same BQML function. BigQuery then performs a vector search, calculating the cosine distance between the prompt embedding and the embeddings in the embedded_intel table. Results with the smallest cosine distance (highest similarity) are returned.
Cosine Distance Explained: Cosine distance is favored for semantic search as it identifies texts with similar meaning even if they use different words. A smaller distance indicates greater semantic alignment.

III. Transitioning to an Operational Database: Cloud SQL for PostgreSQL

BigQuery is excellent for large-scale analytics, but a faster, more responsive database is needed for real-time agent interactions. The solution is Cloud SQL for PostgreSQL with the PGVector extension.

PGVector Extension: The PGVector extension is added to the Cloud SQL instance, enabling the storage and querying of vector embeddings.
Ancient Scrolls Table: An ancient_scrolls table is created to store both the original battle report text and its corresponding vector embedding.
Manual Test & Indexing: A manual test is performed to verify the connection and logic before building a pipeline. An HNSW index is then created on the ancient_scrolls table to significantly accelerate vector search performance. The execution time difference with and without the index is demonstrably significant.

IV. Automating the Pipeline with Dataflow

To handle large volumes of data and automate the embedding process, a Dataflow pipeline is constructed.

Pipeline Components: The pipeline consists of two main components: Embed Text Batch (generates embeddings using Vertex AI’s Gemini model) and Write Essence to Spellbook (inserts data into the Cloud SQL database).
Pipeline Flow: The pipeline reads files from Cloud Storage, batches them, generates embeddings using Embed Text Batch, and then saves the text and embeddings to the ancient_scrolls table using Write Essence to Spellbook.
Scalability & Automation: Dataflow enables scalable and automated processing of both existing and new battle reports.

V. Crafting the Master RAG Agent

Finally, all the components are integrated to create a powerful RAG agent.

Agent Architecture: The agent follows a traditional RAG loop: instructions, model, and tools.
Grimoire Lookup Tool: The grimoire_lookup tool retrieves relevant information from the Cloud SQL database by converting the user’s question into an embedding, searching the ancient_scrolls table, and returning the top three most similar scrolls.
Augmentation & Generation: The retrieved scrolls are used to augment the original question, providing context for the Gemini model to generate a final, grounded answer.
Example Query: A query about the "hydra of scope creep" demonstrates the agent’s ability to provide context-aware information.

Notable Quote:

“As a data engineer, our primary weapon is knowledge. We transform raw chaotic data into actionable intelligence.”

Conclusion:

This mission successfully demonstrates the power of combining BigQuery, Gemini, Cloud SQL, Dataflow, and RAG principles to transform unstructured data into a valuable knowledge engine. The resulting agent is capable of providing nuanced, context-aware answers, empowering users to overcome challenges and achieve victory. The emphasis on automation, scalability, and efficient vector search highlights the critical role of data engineering in building intelligent applications. The lab link provided encourages further exploration and experimentation with these technologies.

How to build a data agent with BigQuery and CloudSQL

Transforming Battle Reports into a Powerful Knowledge Engine: A Data Engineer's Mission

Chat with this Video

Related Videos

Ready to summarize another video?