Large scale data ingestion for MongoDB using Application Integration

Key Concepts

Application Integration: A service designed to connect various applications and data sources, facilitating data flow and process automation.
Asynchronous Orchestrator-Worker Model: A scalable architectural pattern where an orchestrator initiates and monitors a large task, delegating actual data processing to multiple independent workers that operate asynchronously.
MongoDB: A popular NoSQL document database used as an example data source for large-scale ingestion.
Google Cloud Storage (GCS): Google Cloud's highly scalable and durable object storage service, used as the immediate destination for ingested data chunks.
Integration Connector Service: A component within Application Integration that manages connections to various external systems like databases or APIs.
Page Size / Page Number / Page Tokens: Mechanisms for breaking down large datasets into smaller, manageable chunks (pages) for efficient transfer. Page size and page number are common for database queries, while page tokens are often used with APIs.
Initiate Job API Trigger: The specific API endpoint or action that starts a data ingestion job.
Workload Input: A set of configurable parameters that define the specifics of an ingestion job, such as collection name, page size, and job ID.
Collection (MongoDB): A grouping of MongoDB documents, analogous to a table in a relational database.
Job ID: A unique identifier assigned to each ingestion task for tracking, monitoring, and debugging purposes.
Sanity Check: A basic verification process to ensure the consistency and integrity of the ingested data files.
BigQuery: Google Cloud's fully managed, serverless enterprise data warehouse, mentioned as a potential final destination for the ingested data.

Large Data Ingestion at Scale with Application Integration

This demonstration outlines a robust method for ingesting large datasets at scale using application integration, specifically highlighting an asynchronous orchestrator-worker model. While MongoDB is used as the data source example, the underlying pattern is broadly applicable to any connector or API supported by application integration.

Asynchronous Orchestrator-Worker Model for Efficient Data Transfer

The core of this solution is an asynchronous orchestrator-worker model, designed for efficient transfer of large data sets.

Orchestrator's Role: The orchestrator is responsible for initiating the overall ingestion job, tracking its progress, and triggering individual worker processes.
Worker's Role: Each worker handles the actual data transfer in manageable chunks. It pulls data from the specified source (e.g., MongoDB) and uploads these chunks to Google Cloud Storage (GCS).
Recursive and Sequential Process: The process is recursive and sequential, meaning workers call back the orchestrator after processing a chunk, informing it of the "next page" of data to ingest, ensuring continuous data flow.

Setting Up the Data Source and Connection

MongoDB Cluster Setup: The first step involves creating a cluster on MongoDB containing the data intended for ingestion. For this demo, a MongoDB cluster with 1.7 million Salesforce records (approximately 6 GB of data) in an "accounts" collection is used. The sync specifically targets this "accounts" collection.
Integration Connector Service Connection: A new connection is established within the Integration Connector Service. Users select MongoDB as the connector type and configure connection settings. The MongoDB host information can be retrieved from the connection string by clicking the "connect" button in MongoDB.

Leveraging the Large Data Ingestion Template

A pre-provided template for large data ingestion simplifies implementation. This template encapsulates the orchestrator-worker logic described above. Its versatility allows it to be adapted for other connectors, such as Salesforce or HTTP. In such cases, the mechanism would shift from using page size and page number (as in the MongoDB example) to page tokens, where the next page token is passed to subsequent iterations.

Initiating the Data Ingestion Job

To begin a full sync of a MongoDB collection, the initiate job API trigger is invoked. This trigger automatically counts all documents in the specified collection and then launches the full sync process to GCS. While the total document count is calculated automatically, other critical parameters must be reviewed and specified via the workload input:

Collection: The name of the MongoDB collection to be synced (e.g., "accounts").
Page size: Defines the number of documents to be processed per page/chunk.
Page no: Specifies the starting page index for the ingestion.
Sort by: Determines the sorting order of documents during ingestion.
Job ID: A unique identifier for tracking the specific ingestion operation.
Notification emails: Email addresses of recipients for the final job completion notification.

In the demo, the job is configured to ingest 1,765,000 accounts, totaling approximately 6 GB of data.

Monitoring Progress and Handling Failures

Progress Monitoring: Ingestion progress can be monitored by checking the system logs.
Failure Recovery: In the event of a failure, the system supports replaying the sync from the last successful point. This is facilitated by the workload variable, which persistently tracks the last successful position and all necessary parameters for the next iteration, ensuring data consistency and minimizing re-processing.
Output Verification: The results of the ingestion job are stored in the specified GCS folder. For improved debugging and traceability, it is recommended to modify the nomenclature of output files by adding timestamps, a page ID, or the Job ID created for the operation.

Post-Ingestion Steps and Further Integration

After the data has been ingested into GCS, two key next steps are recommended:

Sanity Check: Create a sanity check to verify the consistency and integrity of the output files in GCS. This ensures that the data has been transferred accurately.
BigQuery Integration: If the ultimate destination for this data is BigQuery, users are encouraged to leverage BigQuery jobs. These jobs provide an easy and efficient way to transfer the raw data from GCS into the final destination tables within BigQuery.

Conclusion

This demonstration provides a comprehensive framework for large-scale data ingestion using an asynchronous orchestrator-worker model within application integration. It emphasizes efficiency, scalability, and robust error handling, with MongoDB and GCS serving as practical examples. The use of templates, configurable parameters, and clear post-ingestion steps (like sanity checks and BigQuery integration) offers a complete and actionable solution for managing significant data volumes.