Programmable data quality with Dataplex and generative AI

By Google Cloud Tech

TechnologyAIBusiness
Share:

Key Concepts

  • Data Quality Rules
  • Data Profiling
  • Data Plex
  • Gemini CLI
  • Policy as Code
  • Materialized Views (MVs)
  • Nested Data Structures
  • BigQuery
  • CI/CD Pipeline
  • Human Review/Validation

Data Quality Rule Creation Automation Workflow

1. Problem: Manual Data Quality Rule Creation

  • Manual creation of data quality rules is slow and error-prone.

2. Solution: Programmatic Data Quality Workflow on Google Cloud

  • Automate data quality rule generation using Data Plex for data profiling and the Gemini CLI.
  • Adopt a "policy as code" approach: generate and deploy quality rules with human validation.

3. Understanding Raw Data Structure

  • Example: J4 transactions table in BigQuery with nested record and array types (e.g., event_params with key and value columns, and value column having other columns).

4. Data Profiling with Data Plex

  • Data Plex is used for discovering statistical metadata (e.g., NER percentages, value distributions).
  • Limitation: Data Plex cannot deeply inspect nested data structures. It identifies complex types but doesn't provide details on fields within them.

5. Overcoming Data Plex Limitations: Flattening Data with Materialized Views (MVs)

  • Purpose-built Materialized Views: Flatten the data to overcome Data Plex limitations.
  • Thoughtful Data Modeling: Avoid unsting multiple arrays simultaneously, as it can corrupt data by multiplying rows.
  • Correct Strategy: Create separate, clean materialized views for each analytical need (e.g., sessions, transactions, items).
  • Benefits of MVs:
    • Makes every field a top-level column that Data Plex can profile individually.
    • Significant advantages in query performance and cost.
  • Implementation: Execute SQL templates using BQ query command to inject project and dataset IDs into the SQL.

6. Programmatically Triggering Data Plex Profile Scans

  • Use a Python script with the Google Cloud Data Plex client library to create and run a scan for each view.
  • This automates the process and integrates it into a larger CI/CD workflow.

7. Obtaining Statistical Metadata

  • Completed profile scans are visible in the Google Cloud console under the Data Plex govern section.
  • Each scan provides a rich, machine-readable statistical profile for the data.

8. Translating Statistical Metadata to YAML Configuration

  • The profiler provides a detailed JSON output with statistics.
  • The goal is to translate this information into a declarative YAML configuration file for data quality scans.
  • Manually writing the YAML is tedious and should be automated.

9. Leveraging Gemini CLI for YAML Generation

  • Use a large language model (LLM) like Gemini for the translation step.
  • Two-Step Process (Recommended):
    1. Plan Generation: Ask Gemini to analyze the DQ profile result.json file and propose a step-by-step plan to create data quality rules, explaining the reasoning based on the statistics. Do not write the YAML file yet.
    2. YAML Generation: After human review and feedback, ask Gemini to generate the final DQ rules.yamel file.
  • Python Script: Pull the latest successful profile result from the Data Plex API and save it to a local JSON file (DQ profile result.json). This file provides context to the Gemini CLI.
  • Example Prompt: "You are an expert Google Cloud Data Plex engineer. Analyze this DQ profile result.json JSON file and propose a step-by-step plan to create data data quality rules explaining the why based on the statistics. Do not write the YAML file yet. Just provide a plan."
  • Model Output: The model returns a structured step-by-step plan (e.g., set expectation rule for the platform column because the profile shows only one distinct value web).

10. Importance of Human Review and Business Context

  • The model is a powerful pattern matcher but lacks understanding of the data's purpose or business nuances.
  • Deploying an AI-generated configuration without rigorous human review is a significant risk.
  • Treat the output like a pull request from a new, fast but inexperienced team member.
  • Feedback Loop: Provide feedback to the model to correct or remove rules that don't align with domain knowledge (e.g., adding iOS and Android to the set expectation for the platform column, even if the profile currently only shows "web").

11. Generating the Final DQ Rules YAML File

  • After human review and feedback, ask Gemini to generate the final DQ rules.yamel file, strictly conforming to the data rule schema.
  • This final configuration is more reliable because it combines machine-generated code with human expertise.

12. Deploying the DQ Rules YAML File

  • Use a standard G-Cloud command to create the data scan resource from the local YAML configuration file.
  • Use another command to trigger the run.
  • This process can be integrated into any CI/CD pipeline.

13. Analyzing Results

  • Once the job completes, the results are written to a BigQuery table for detailed analysis.
  • A summary of passed and failed rules is available in the Data Plex UI.
  • Query the data's quality score over time, build custom dashboards in Lucer Studio, or set up automated alerts based on the results.

14. Workflow Recap

  1. Solve profiler challenges with nested data by creating specialized materialized views.
  2. Automate Data Plex profile scans using a Python script.
  3. Leverage the Gemini CLI to generate a draft policy as code file.
  4. Perform a thorough human review before deploying the quality scan with the G-Cloud CLI.

15. Conclusion

  • The entire process automates the manual, time-consuming steps of data quality.
  • This frees up time to focus on high-value tasks: applying critical business context, checking the AI's logic, and making strategic decisions about which rules are truly important.
  • "AI doesn't replace the human, it enhances the human."

Chat with this Video

AI-Powered

Hi! I can answer questions about this video "Programmable data quality with Dataplex and generative AI". What would you like to know?

Chat is based on the transcript of this video and may not be 100% accurate.

Related Videos

Ready to summarize another video?

Summarize YouTube Video