DEPLOY Fully Private + Local AI RAG Agents (Step by Step)

Local Multimodal RAG Agent with N8N, Olama, Dockling & Docker

Key Concepts:

RAG (Retrieval-Augmented Generation): A technique combining information retrieval with generative AI to improve accuracy and relevance of responses.
Multimodal RAG: RAG utilizing knowledge bases with multiple data types (text, images, audio, video).
Dockling: An open-source document processing library for extracting structured data from various file formats.
VLM (Vision Language Model): AI models capable of processing both visual and textual information.
Olama: A framework for running large language models locally.
Docker: A platform for containerizing applications, ensuring consistent environments.
Air-gapped System: A system isolated from external networks for enhanced security.
Vector Store (Quadrant): A database for storing vector embeddings of documents, enabling semantic search.
Embedding Models (Nomic Embed Text): Models that convert text into numerical vectors representing semantic meaning.

1. Introduction & Motivation

The video focuses on building a fully local, air-gapped AI agent capable of interrogating private documents using Retrieval-Augmented Generation (RAG). The primary motivation is data security and control, particularly for sensitive documents like legal, medical, or financial records. The presenter argues that deploying AI on-premise reduces risk and is becoming increasingly feasible with advancements in local models. The project utilizes N8N, Olama, Dockling, and Docker.

2. Multimodal RAG Explained

Multimodal RAG involves retrieving information from a knowledge base containing diverse data types – text documents, PDFs with embedded images/tables, audio, and video. The key benefit is the ability to retrieve and return embedded images as part of the chat conversation, a capability lacking in many standard AI agents. This allows for a more comprehensive understanding of the document content.

3. Document Processing with Dockling

Dockling, an open-source document processing library from IBM, is used to convert various file formats (PDFs, Word docs, PowerPoint, images, audio) into clean, structured markdown or JSON. It recognizes headers, tables, and extracts diagrams as images, preserving the semantic structure of the original document. Dockling offers two processing pipelines:

Standard Pipeline: Uses a series of non-generative models for layout analysis, table extraction, OCR, and assembly. This approach avoids hallucinations as it copies text verbatim. Specialized pipelines exist for different file formats.
VLM Pipeline: Employs a Vision Language Model (VLM) to extract text from documents, breaking them into pages and processing them in batches. While powerful, this pipeline can introduce hallucinations due to the generative nature of VLMs. Options for local VLMs include Granite (IBM), Small Dockling, and Quenv, alongside cloud-based options like Gemini, OpenAI, and Claude (which cannot be run locally). Olama (https://olama.com) is recommended for running local VLMs like Ministral and Deepseek OCR.

4. Hardware Requirements for Local AI

Running local AI requires significant hardware resources, particularly a graphics card (GPU). LLMs, VLMs, and embedding models rely on neural networks with billions or trillions of parameters. Nvidia GeForce RTX cards are common, but have limitations on model size. Larger models (70 billion parameters) require heavy quantization, potentially sacrificing quality. The presenter emphasizes the upfront investment needed for server infrastructure, scaling with the number of concurrent users and desired response speed. Example GPU costs: RTX 4090 ($1,600), RTX 5090 ($2,000). Cloud-based open-source models (Lama Cloud, Open Router) can be used for initial development and testing before investing in local hardware.

5. Building the Local RAG Pipeline – Step-by-Step

Setup: The project utilizes a pre-built N8N self-hosted AI starter kit (bundled with N8N, Olama, Quadrant, and Postgres) with a modified Docker Compose file to include Dockling.
Docker Fundamentals: The video explains Docker images (static application code and environment) and containers (running instances of images). Docker volumes/bind mounts are crucial for persistent data storage. Docker Compose orchestrates multiple services and defines volumes, ports, and environment variables.
File Ingestion: A local file trigger in N8N monitors a designated folder ("rag files/pending") for new documents.
Document Processing (Dockling): The file is sent to Dockling via an HTTP request, using the synchronous processing endpoint. The API extracts structured data and images, saving images to a dedicated folder ("dockling scratch").
Image Handling: Images are moved from "dockling scratch" to a publicly accessible folder served by an EngineX server ("extracted images").
Vector Embedding (Quadrant): The extracted text is sent to Quadrant, a vector store, using an embedding model (Nomic Embed Text via Olama). The text is converted into vector embeddings for semantic search.
AI Agent (Olama): An AI agent is created in N8N using a local LLM (Lama 3.2). The agent is configured to use Quadrant for information retrieval.
Chat Interface: A simple chat interface is created by embedding the N8N chat widget into an HTML page served by the EngineX server.

6. Key Technologies & Configurations

N8N: Workflow automation platform.
Olama: Local LLM runner.
Dockling: Document processing library.
Docker: Containerization platform.
Quadrant: Vector database.
Nomic Embed Text: Embedding model.
Lama 3.2/GPT-OSS 20B: Large Language Models.
EngineX: Static file server.
Docker Compose: Orchestration tool.

7. Network Access & Deployment

To access the chat interface from other devices on the local network, the presenter explains the need to:

Configure the firewall to allow inbound connections on the necessary ports (8080 for the chat interface, 5678 for N8N).
Potentially set a static IP address for the server.
Consider network complexity and involve the IT team in larger organizations.

8. Advanced Techniques & Future Development

The presenter briefly mentions advanced techniques like:

Asynchronous Dockling processing using a polling loop.
Picture description API for annotating images.
Handling structured data (Excel, CSV) differently.
Context expansion and document hierarchy extraction.
Contextual vector embeddings.

9. Conclusion

The video demonstrates a practical approach to building a secure, private, and powerful local RAG agent. The presenter emphasizes the increasing feasibility of on-premise AI deployments and encourages viewers to explore the resources and community (AI Automators: [link in description]) for further development. The project highlights the importance of careful hardware selection, efficient document processing, and a well-designed RAG pipeline. The ability to test with cloud-based models before investing in local infrastructure is also emphasized.

DEPLOY Fully Private + Local AI RAG Agents (Step by Step)

Local Multimodal RAG Agent with N8N, Olama, Dockling & Docker

Chat with this Video

Related Videos

Ready to summarize another video?