How to assess data lake and data warehouse migrations to BigQuery

By Google Cloud Tech

Share:

Key Concepts

  • BigQuery Migration: The process of moving data and workloads from existing data warehouses or data lakes to Google Cloud's BigQuery.
  • Assessment and Planning: The initial phase of migration, involving understanding the current environment, defining scope, and creating a migration strategy.
  • SQL and Pipeline Conversion: The process of translating existing SQL queries and data pipelines to be compatible with BigQuery.
  • Data Transfer: Moving the actual data from the source system to BigQuery.
  • Validation and Optimization: Verifying the migrated data and workloads, and fine-tuning BigQuery for performance and cost-efficiency.
  • Dumper Tool: An open-source tool used to extract metadata and logs from source systems for assessment.
  • Assessment Service: A Google Cloud service that analyzes metadata and logs to estimate the BigQuery footprint and migration complexity.
  • Looker Studio Reports: Visualizations generated by the assessment service to provide insights into the source system and migration projections.
  • Segmentation Report: A report that groups data artifacts (tables, views) into segments for phased migration.
  • Total Cost of Ownership (TCO): An estimate of the overall cost of using BigQuery, including platform and operational expenses.
  • Staged Migration: A migration approach where workloads are moved incrementally, starting with less critical ones.
  • Proof of Concept (PoC): A small-scale test to validate the migration approach and tools.

Migration Journey: Assessment and Planning

The migration journey to BigQuery begins with a comprehensive plan and clearly defined scope. The scope is determined by the size and complexity of the existing data warehouse or data lake. A strategic decision needs to be made on whether to migrate all workloads at once or to adopt an incremental, staged approach. The latter is generally recommended, especially for complex architectures, as it allows teams to build confidence in the new data warehouse or data lake operations while gradually incorporating and transforming components of the current architecture. This phased approach informs the overall migration plan, which is structured into four main phases: assessment and planning, SQL and pipeline conversion, data transfer, and validation and optimization.

Google Cloud offers a suite of services and expert guidance to facilitate migrations from various sources like Snowflake, Teradata, Cloudera, and Databricks. A Google Cloud Customer Engineer can assist with an initial assessment report, which provides a ballpark estimate of time and costs. This report considers the existing system and the projected state in BigQuery, Dataproc, and Google Cloud Storage (GCS). During this initial phase, it's crucial to familiarize yourself with BigQuery terminology and other relevant Google Cloud services. The Google Cloud team can recommend online training through Cloud Skills Boost or in-person workshops.

Following the initial assessment, full migration services can be employed to gain a deeper understanding of the compatibility between the current and target environments. The primary goal is to clearly ascertain the migration's complexity, considering current usage, data volume, compute requirements, cost distribution, and, most importantly, to confirm if the chosen migration approach is appropriate or requires revision.

Utilizing Migration Services and Tools

To commence the migration process, the dumper tool is utilized. Specific technical prerequisites, such as user or network access, will vary depending on the source data warehouse or lake. Detailed instructions for each source system can be found via a link in the video description. The dumper tool is an open-source utility that can be self-compiled if organizational policies necessitate it.

A sample execution of the dumper tool for systems like Teradata, Snowflake, and Cloudera involves running a command in assessment mode. This generates two ZIP files: one containing metadata and the other containing logs. For Databricks, an assessment notebook can be executed directly within the workspace, with results stored in a storage bucket.

The output from these tools serves as input for the assessment services. These services analyze existing artifacts in the source system, including tables and views, along with query logs, to understand the current infrastructure footprint, data processing patterns, data volume, and frequency of operations. The assessment service then provides an estimate of what the data warehouse or data lake would look like in BigQuery.

The assessment results are delivered in a storage bucket and a BigQuery dataset. The files in the GCS bucket are particularly useful for the subsequent SQL translation phase. The BigQuery dataset contains data that populates insightful Looker Studio reports.

Looker Studio Reports: Insights and Examples

For Snowflake, the assessment service generates three key reports: a summary report, a detailed report, and a segmentation report. The segmentation report is instrumental in understanding migration complexity by grouping artifacts (tables and views) into segments that should be migrated together. It also identifies unsegmented tables that can be migrated independently based on query log analysis. These reports also offer a comparison of pricing models, mapping credits to slots across different editions, and provide estimates for various commitment models, along with other valuable data points.

As another example, for Databricks, the assessment report includes statistics on jobs, query trends, and their categorization based on the size of the SQL warehouse.

Across all sources, these reports provide critical data points:

  1. Source System Details: Information to identify dependencies and plan their migration.
  2. Workload Projection: A projected view of workloads on BigQuery, BigLake, or Dataproc, along with a plan for migration, including complexity estimates.

With this comprehensive information, you can collaborate with your Google Cloud Customer Engineering team to interpret the report, which includes estimates for platform costs and the Total Cost of Ownership (TCO).

Analyzing Current Data Warehouse/Lake Utilization

The utilization of the current data warehouse or data lake is a significant factor in the migration planning. The assessment report will detail queries executed per day, the volume of data scanned by these queries, and the storage they consume. This allows for verification of the volume and usage of each table and confirmation that sample logs are representative. Depending on the workloads, it may be necessary to consider logs from peak periods, such as accounting period closings or quarterly sales reporting.

Migrations present an opportune moment for data cleanup and archiving of unused objects. The report can identify a list of unused tables, as well as tables that are used but have no write operations. The assessment service also generates a recommendation for a Proof of Concept (PoC) based on the analyzed logs.

At this stage, it is advisable to maintain an inventory of all data pipelines and integrations feeding data into the current warehouse, along with their orchestration tools and business applications. These should be factored into the migration timeline estimation. Allocating time for testing or running a PoC is also a good practice.

Refining the Migration Plan

With the gathered information on time, effort, and input from the business, the migration plan can be further refined. This includes defining success criteria, gaining a deeper understanding of the target environment, and incorporating crucial elements like rollback strategies and contingencies.

The complexity of the migration, as outlined by the assessment report, will influence the decision on whether to migrate all at once or to adopt a staged migration. The latter is the more common and recommended approach for complex architectures, especially those involving multiple source systems and iterative onboarding of teams. A staged migration enables gradual maturation of the architecture as you scale up in the new infrastructure and validate your results.

The subsequent video will delve into data migration and validation, providing a complete view for your migration plan.

Chat with this Video

AI-Powered

Hi! I can answer questions about this video "How to assess data lake and data warehouse migrations to BigQuery". What would you like to know?

Chat is based on the transcript of this video and may not be 100% accurate.

Related Videos

Ready to summarize another video?

Summarize YouTube Video