Unlock Multimodal RAG Agents in n8n (Images, Tables & Text)

Multimodal RAG Agent for Complex PDFs: Summary

Key Concepts:

Multimodal RAG (Retrieval-Augmented Generation)
OCR (Optical Character Recognition)
AI Vision Model
Vector Database (Superbase)
LLM (Large Language Model) - GPT-4
Embedding Model
Markdown
Base64 Encoding
Superbase Storage
N8N Automation Platform
Mistral OCR API
Image Annotation
Chunking
Data Ingestion Pipeline
System Prompt

1. Overview of the Multimodal RAG Agent

The video demonstrates building a multimodal RAG agent capable of indexing and analyzing text, images, and tables from complex PDFs. The agent leverages OCR, AI vision, and vector databases to provide comprehensive answers to user queries, including relevant images and tables directly within the response.

2. Data Ingestion Pipeline

Step 1: Document Processing with Mistral OCR

Input: Information-dense PDFs (e.g., product manuals).
OCR API: Mistral OCR is used to extract data and annotate media. It works for both machine-readable and scanned PDFs.
API Endpoint: Data is sent to Mistral's API as PDF documents.
Output:
- Markdown format: LLM-friendly text with inline file names.
- Array of elements: Images and charts in base64 format.
Image Analysis: Mistral uses an AI vision model to analyze images and provide annotations based on a user-defined prompt. This allows for deep contextual understanding of the images.
Pricing: $1 per 1,000 pages for OCR, $3 per 1,000 pages for annotations.

Step 2: Data Storage in Superbase

Image Upload: Images extracted by Mistral are uploaded to Superbase storage.
Vectorization:
- Markdown text is chunked into manageable pieces.
- An embedding model (e.g., OpenAI's text-embedding-3-small) translates the chunks into vectors.
- Vectors are stored in a Superbase vector database.

3. Querying and Response Generation

Step 1: Query Processing

User input (question) is passed to the AI agent.
The query is transformed into a vector using the same embedding model used during ingestion.
The vector database is queried against the query vector.

Step 2: Response Generation

The vector database returns the top results (including image URLs).
A large language model (LLM) like GPT-4 is used to generate a response based on the query and the retrieved data.
The LLM is specifically prompted to render images where available, using the image URLs from the vector database response.

4. Building the Workflow in N8N

4.1. Setting up Mistral OCR Integration

HTTP Request (GET): Retrieve a PDF from a publicly accessible URL.
Mistral Account Setup: Create an account on Mistral and obtain an API key.
HTTP Request (POST - Upload File): Upload the PDF to Mistral using the OCR with uploaded PDF endpoint. The curl command is imported into the HTTP request node.
- Authentication: Use predefined credential type "Mistral cloud account" in N8N.
- Map the binary data from the PDF to the request.
HTTP Request (POST - Get Signed URL): Obtain a signed URL for the uploaded document using the signed URL endpoint.
- Authentication: Use predefined credential type "Mistral cloud account" in N8N.
- Map the ID from the previous request to the request body.
HTTP Request (POST - Get OCR Results): Retrieve the OCR results using the get OCR results endpoint.
- Authentication: Use predefined credential type "Mistral cloud account" in N8N.
- Replace the signed URL in the JSON body with the signed URL obtained in the previous step.
- Change the response format from "file" to "JSON".
- Update the request body to include image annotations by providing a specific schema.

4.2. Data Transformation and Preparation

Split Out Node: Split the array of pages into individual items for processing.
Code Node (JavaScript): Use JavaScript code to insert image annotations directly into the markdown text. The code iterates through the images and replaces the file names with the full markdown description of each image.
- The code is designed to run "once for each item".
Superbase Vector Store Node:
- Select the Superbase account.
- Choose the "insert documents" operation.
- Select the "documents" table.
- Use the "default data loader" and load specific data (the markdown text).
- Use a "recursive text splitter" with a chunk size of 1000 and an overlap of 200.
- Select an embedding model (e.g., OpenAI's text-embedding-3-small).

4.3. Chat Interface with N8N Agent

Add Chat Message Trigger: Add a "chat message" trigger to initiate the chat flow.
AI Agent Node: Use an AI agent node for the chat interface.
- Select an OpenAI chat model (e.g., GPT-4.1).
- Set the sampling temperature to a lower value to reduce randomness.
Simple Memory Node: Add a simple memory node for in-memory conversation history.
Superbase Vector Store Node (Tool):
- Set the operation mode to "retrieve documents".
- Select the "documents" table.
- Set a limit for the number of documents to retrieve (e.g., 4).
- Select the same embedding model used for embedding the data.
System Prompt: Provide a system prompt to guide the LLM's behavior.
- Example: "You are a washing machine expert. You are tasked with answering a question using the information retrieved from the attached vector store. Your goal is to provide an accurate answer based on this information only. If you cannot answer the question using the provided information or if no information is returned from the vector store, say 'Sorry, I don't know.'"
Make the Workflow Publicly Available: Enable the chat URL to make the agent accessible.

4.4. Uploading Images to Superbase Storage

Superbase Details: Provide Superbase base URL and storage bucket name.
Split Out Node (Pages): Split the pages array into individual pages.
Split Out Node (Images): Split the images array within each page into individual images.
Set Fields Node:
- Generate a pseudo-random file name for each image using JavaScript.
- Store the original image ID (file name).
- Store the image annotation.
Prepare B64 String Node: Extract the base64 data from the MIME type.
Convert to File Node: Convert the base64 string to a binary file.
HTTP Request (POST - Upload to Superbase): Upload the binary file to Superbase storage.
- Use the Superbase API credential for authentication.
- Send the binary data as the request body.
Merge Results Node: Merge the results from Superbase with the original stream using the file name as the matching field.
Aggregate All Items Node: Aggregate all the merged items into an array of uploaded images.
Code Node (JavaScript): Update the inline markdown to replace the original image file names with the full Superbase URLs.

5. Key Arguments and Perspectives

Importance of Image Analysis: The video emphasizes the importance of using AI vision models to understand the content of images, rather than simply indexing them. This allows for more effective and context-aware responses.
Customization and Granularity: The video highlights the ability to customize the image annotation process by providing specific prompts to the vision API. This allows for fine-grained control over the level of detail and the type of information extracted from the images.
Importance of System Prompt: The video stresses the importance of providing a clear and concise system prompt to the LLM. This helps to ensure that the LLM provides accurate and relevant responses based on the retrieved data.

6. Conclusion

The video provides a detailed walkthrough of building a multimodal RAG agent that can effectively process and analyze complex PDFs. By combining OCR, AI vision, vector databases, and LLMs, the agent can provide comprehensive and informative responses to user queries, including relevant images and tables. The use of N8N as an automation platform simplifies the process of building and deploying the agent. The video also highlights the importance of careful data preparation, system prompting, and customization to achieve optimal performance.