PDF Document Extraction in n8n - Full Workflow

By NeuralNine

Share:

Key Concepts

  • N8N Workflow Automation: Using N8N to build automated processes.
  • Document Extraction: Extracting structured information from unstructured documents.
  • PDF Parsing: Processing PDF files to retrieve data.
  • OCR (Optical Character Recognition): Technology to convert images of text into machine-readable text.
  • IMAP Trigger: An N8N node that triggers a workflow when a new email is received.
  • Unstract: A platform for LLM-powered unstructured data extraction, offering a community node for N8N.
  • Prompt Studio: Unstract's interface for creating and testing extraction prompts.
  • API Deployment: Deploying an Unstract prompt as an API endpoint for integration.
  • Google Sheets API: Integrating with Google Sheets to store extracted data.
  • OAuth 2.0: Authentication protocol used for Google Sheets integration.
  • Community Nodes: Custom nodes developed by the N8N community.

N8N Workflow for Document Extraction

This video demonstrates how to build an N8N workflow to extract information from PDF documents, including those that are handwritten, scanned, or of poor quality. The workflow automates the process of receiving documents via email, extracting key data points, and storing them in a Google Sheet.

1. Workflow Overview and Setup

The workflow consists of three main parts:

  • Trigger: Reacts to incoming emails.
  • Document Extraction: Processes the attached PDF documents.
  • Data Storage: Connects to Google Sheets to save the extracted information.

N8N Installation: The video suggests two methods for setting up N8N:

  • Self-hosted: Using Docker with the command docker volume create n8n_data and then running N8N via npx n8n or a Docker command. The local instance can be accessed at localhost:5678.
  • Cloud Variant: Using N8N's cloud service.

The presenter uses the self-hosted Docker version. After initial setup and account creation, a new workflow is created.

2. Email Trigger (IMAP Node)

Purpose: To initiate the workflow when a new email with an attachment arrives.

Configuration:

  • Node: IMAP Trigger.
  • Credentials: Requires IMAP access to an email account.
    • Host: e.g., imap.gmail.com for Gmail, mail.yourserver.de for a self-hosted email server.
    • Port: e.g., 993 for SSL.
    • SSL: Enabled.
    • User: Email address.
    • Password: Email account password.
  • Actions:
    • Mark as read: Prevents re-triggering on the same email.
    • Download attachment: Retrieves the attached file.
    • Attachment Name: A placeholder for the downloaded file (e.g., attachment).

Testing: An email with a PDF invoice is sent to the configured email address using Thunderbird. The N8N workflow immediately triggers, downloading the attachment.

3. PDF Validation (If Node)

Purpose: To ensure the attached file is a PDF document before attempting extraction.

Configuration:

  • Node: If Node.
  • Condition: Checks the MIME type of the attachment.
    • Expression: {{ $binary["attachment"][0].mimeType }} === "application/pdf"
      • $binary["attachment"][0]: Refers to the first binary attachment named "attachment" from the previous node.
      • .mimeType: Accesses the MIME type property of the attachment.

Outcome: If the MIME type is application/pdf, the "true" branch is activated; otherwise, the "false" branch is taken.

4. Document Extraction with Unstract (Community Node)

Problem with Default N8N Node: The built-in "Extract from File" node in N8N only extracts raw text and struggles with scanned or handwritten documents, failing to provide structured output or perform OCR.

Solution: Unstract Community Node: Unstract is a platform for LLM-powered unstructured data extraction. Its N8N community node enables advanced document parsing.

Unstract Setup:

  1. Create an Unstract Account: Sign up for a free account on the Unstract platform.
  2. Create a Prompt Studio Project:
    • Navigate to "Prompt Studio" in Unstract.
    • Create a new project (e.g., "invoice extraction").
    • Define the desired output fields in a JSON object: invoice_total (float), invoice_number, invoice_date, invoice_from, invoice_to.
    • Select an LLM model (e.g., GPT-4).
    • Upload sample invoices to test the prompt.
  3. Deploy as API:
    • Click "Deploy as API" within the Prompt Studio project.
    • Provide an API name and display name.
    • Click "Create deployment."
    • Note down the Organization ID, API Name, and generate an API Key from the "Manage Keys" section.

N8N Integration:

  1. Install Unstract Community Node:
    • In N8N (self-hosted), go to Profile > Settings > Community Nodes.
    • Enter the package name: nodes-unstract.
    • Accept the risks and install.
  2. Configure Unstract Node in Workflow:
    • Add the "Unstract" node to the "true" branch of the If node.
    • Credentials:
      • Create a new credential.
      • Paste the API Key and Organization ID from Unstract.
      • Paste the API Deployment Name (the last part of the API deployment URL).
    • Data Field: Specify the binary data to be processed. This is the attachment from the IMAP trigger: {{ $binary["attachment"][0].data }}.

Outcome: The Unstract node sends the PDF data to the deployed API, processes it using the defined prompt, and returns a structured JSON object containing the extracted invoice details.

5. Data Storage in Google Sheets (Google Sheets Node)

Purpose: To store the extracted invoice data in a Google Sheet.

Google Cloud Console Setup:

  1. Create a New Project: In the Google Cloud Console, create a new project (e.g., "N8N tutorial").
  2. Enable APIs: Enable the "Google Sheets API" and "Google Drive API."
  3. Configure OAuth Consent Screen:
    • Go to "OAuth consent screen."
    • Select "External" user type.
    • Enter an app name (e.g., "N8N Sheet Integration") and your email.
    • Add yourself as a test user.
  4. Create OAuth 2.0 Client:
    • Go to "APIs & Services" > "Credentials."
    • Click "Create Credentials" > "OAuth client ID."
    • Select "Web application."
    • Redirect URI: Copy the redirect URI provided by the N8N Google Sheets node credential setup and paste it here.
    • Create the client and note down the Client ID and Client Secret.

N8N Google Sheets Node Configuration:

  1. Node: Google Sheets > Append Row.
  2. Credentials:
    • Create a new credential of type "OAuth 2.0."
    • Paste the Client ID and Client Secret from Google Cloud Console.
    • Click "Sign in with Google" and authenticate with your Google account.
  3. Spreadsheet and Sheet: Select the target Google Sheet document and sheet name (e.g., "N8N invoices" and "Sheet1").
  4. Column Mapping: Manually map the extracted fields from the Unstract node output to the corresponding columns in the Google Sheet (e.g., invoice_date to "Invoice Date" column).

Testing: The workflow is executed. An email with an invoice is sent. The IMAP trigger receives it, the If node validates it as a PDF, the Unstract node extracts the data, and the Google Sheets node appends a new row with the extracted information.

6. Demonstrating Robustness with Difficult Documents

The video highlights the limitations of the default N8N "Extract from PDF" node by comparing its output with the Unstract node for challenging documents:

  • Scanned Misaligned Invoice:
    • Default Node: Produces no usable text output.
    • Unstract Node: Successfully extracts key information, including the total amount, invoice number, and date, even with misalignment.
  • Handwritten Invoice:
    • Default Node: Fails to extract any meaningful data.
    • Unstract Node: Accurately extracts the invoice total, sender, receiver, invoice number, and even a date that was not immediately obvious to the presenter.

This comparison clearly demonstrates the superior capability of Unstract for handling complex and low-quality documents due to its LLM-powered OCR and extraction.

7. Conclusion and Key Takeaways

The video concludes by emphasizing the effectiveness and professionalism of the N8N workflow built using the Unstract community node for document extraction.

Main Takeaways:

  • N8N is a powerful tool for automating document processing workflows.
  • For reliable and structured extraction from challenging PDFs (scanned, handwritten, poor quality), specialized tools like Unstract are essential.
  • The Unstract community node for N8N integrates seamlessly, allowing for advanced LLM-powered extraction.
  • The workflow can be customized to use different triggers (e.g., file uploads) and destinations (e.g., databases, other APIs) beyond email and Google Sheets.
  • The combination of N8N and Unstract provides a robust solution for automating document-intensive business processes.

Chat with this Video

AI-Powered

Hi! I can answer questions about this video "PDF Document Extraction in n8n - Full Workflow". What would you like to know?

Chat is based on the transcript of this video and may not be 100% accurate.

Related Videos

Ready to summarize another video?

Summarize YouTube Video