How to build a data agent with BigQuery and CloudSQL
By Google Cloud Tech
Transforming Battle Reports into a Powerful Knowledge Engine: A Data Engineer's Mission
Key Concepts:
- RAG (Retrieval-Augmented Generation): A framework for enhancing LLM responses with information retrieved from an external knowledge source.
- Vector Embeddings: Numerical representations of text data capturing semantic meaning, enabling similarity searches.
- BigQuery (BQML): Google Cloud’s serverless, highly scalable, and cost-effective multi-cloud data warehouse with built-in machine learning capabilities.
- Cloud SQL for PostgreSQL with PGVector: A fully-managed relational database service with the PGVector extension for efficient vector storage and search.
- Dataflow: Google Cloud’s managed service for executing Apache Beam pipelines, enabling scalable data processing.
- ELT (Extract, Load, Transform): A data integration process where data is first loaded into the target system and then transformed.
- HNSW (Hierarchical Navigable Small World): An indexing algorithm used for fast approximate nearest neighbor search in vector databases.
- Cosine Distance: A metric used to measure the similarity between two vectors, commonly used in semantic search.
I. Initial Data Wrangling & Structuring with BigQuery
The mission begins with transforming unstructured battle reports stored in Google Cloud Storage into a structured format suitable for analysis. This is achieved using BigQuery and Gemini through an ELT approach.
- External Table Creation: An external table is created in BigQuery, acting as a “magic lens” to directly query the raw text files in Cloud Storage without importing them. This leverages BigQuery’s schema-on-read capability.
- BQML Remote Model & Gemini Integration: A BQML remote model is created, utilizing the
MLG_GENERATE_TEXTfunction to call the Gemini model directly from a SQL query. This transforms the unstructured reports into JSON objects containing information about monsters, battles, and adventurers. - JSON Parsing & Table Creation: A script parses, cleans, and normalizes the JSON data, populating three separate tables:
monsters,adventurers, andbattles. These tables contain structured data extracted from the original reports. - Strategic Analysis via Joins: The three tables are joined to create a
champion's_greatest_feattable, answering complex questions like identifying the most powerful monster defeated by each adventurer and the time taken. An example query demonstrates this, revealing that Ara is particularly effective against perfectionism.
II. Enhancing Semantic Understanding with Vector Embeddings
While structured tables are useful, they lack the nuanced understanding needed for specific queries. The next step involves building a RAG pipeline to unlock deeper semantic meaning through vector embeddings.
- Chunking: The text data is chunked into smaller, meaningful sentences using the
SPLITfunction, creating achunked_inteltable. The importance of sensible chunking strategies is highlighted. - Embedding Generation with BQML: The
MLG_GENERATE_EMBEDDINGfunction is used to generate vector embeddings for each chunk in thechunked_inteltable. These embeddings are stored in a newembedded_inteltable, forming the initial knowledge base. - Semantic Search in BigQuery: A prompt, such as "what are the tactics against a foe that causes paralysis?", is converted into an embedding using the same BQML function. BigQuery then performs a vector search, calculating the cosine distance between the prompt embedding and the embeddings in the
embedded_inteltable. Results with the smallest cosine distance (highest similarity) are returned. - Cosine Distance Explained: Cosine distance is favored for semantic search as it identifies texts with similar meaning even if they use different words. A smaller distance indicates greater semantic alignment.
III. Transitioning to an Operational Database: Cloud SQL for PostgreSQL
BigQuery is excellent for large-scale analytics, but a faster, more responsive database is needed for real-time agent interactions. The solution is Cloud SQL for PostgreSQL with the PGVector extension.
- PGVector Extension: The PGVector extension is added to the Cloud SQL instance, enabling the storage and querying of vector embeddings.
- Ancient Scrolls Table: An
ancient_scrollstable is created to store both the original battle report text and its corresponding vector embedding. - Manual Test & Indexing: A manual test is performed to verify the connection and logic before building a pipeline. An HNSW index is then created on the
ancient_scrollstable to significantly accelerate vector search performance. The execution time difference with and without the index is demonstrably significant.
IV. Automating the Pipeline with Dataflow
To handle large volumes of data and automate the embedding process, a Dataflow pipeline is constructed.
- Pipeline Components: The pipeline consists of two main components:
Embed Text Batch(generates embeddings using Vertex AI’s Gemini model) andWrite Essence to Spellbook(inserts data into the Cloud SQL database). - Pipeline Flow: The pipeline reads files from Cloud Storage, batches them, generates embeddings using
Embed Text Batch, and then saves the text and embeddings to theancient_scrollstable usingWrite Essence to Spellbook. - Scalability & Automation: Dataflow enables scalable and automated processing of both existing and new battle reports.
V. Crafting the Master RAG Agent
Finally, all the components are integrated to create a powerful RAG agent.
- Agent Architecture: The agent follows a traditional RAG loop: instructions, model, and tools.
- Grimoire Lookup Tool: The
grimoire_lookuptool retrieves relevant information from the Cloud SQL database by converting the user’s question into an embedding, searching theancient_scrollstable, and returning the top three most similar scrolls. - Augmentation & Generation: The retrieved scrolls are used to augment the original question, providing context for the Gemini model to generate a final, grounded answer.
- Example Query: A query about the "hydra of scope creep" demonstrates the agent’s ability to provide context-aware information.
Notable Quote:
“As a data engineer, our primary weapon is knowledge. We transform raw chaotic data into actionable intelligence.”
Conclusion:
This mission successfully demonstrates the power of combining BigQuery, Gemini, Cloud SQL, Dataflow, and RAG principles to transform unstructured data into a valuable knowledge engine. The resulting agent is capable of providing nuanced, context-aware answers, empowering users to overcome challenges and achieve victory. The emphasis on automation, scalability, and efficient vector search highlights the critical role of data engineering in building intelligent applications. The lab link provided encourages further exploration and experimentation with these technologies.
Chat with this Video
AI-PoweredHi! I can answer questions about this video "How to build a data agent with BigQuery and CloudSQL". What would you like to know?