RAG is Dead? Introducing Agentic File Exploration

Agentic File Search: A Deep Dive into Exploration-Based RAG

Key Concepts:

Retrieval Augmented Generation (RAG): A technique combining information retrieval with LLM generation to improve answer accuracy and context.
Semantic Similarity: Traditional RAG approach relying on finding chunks most similar to the query.
Agentic File Search/Exploration: A novel approach mimicking human information seeking, involving iterative document exploration and cross-referencing.
Chunking: Dividing documents into smaller segments for retrieval, often leading to loss of global context.
Cross-References: Links within documents pointing to related information in other parts of the same or different documents.
LLM (Large Language Model): A deep learning model capable of understanding and generating human-like text.
Olama: A framework for running LLMs locally.
DGX Spark: An NVIDIA server designed for AI and deep learning workloads, featuring high VRAM and unified memory.
Dockling: A tool for converting PDF and text documents into a standard markdown format.
Llama Index: A data framework for LLM applications, used here for event-driven loop orchestration.

The Problem with Current RAG Systems

Current Retrieval Augmented Generation (RAG) systems predominantly rely on semantic similarity to identify relevant document chunks based on a user’s query. The speaker argues this approach suffers from significant limitations. Primarily, chunking inherently leads to a loss of global context and the inability to handle cross-references and document dependencies – common features in real-world documents like legal or insurance contracts. Even if the necessary information exists within the corpus, semantic similarity may fail to retrieve it if it’s spread across multiple chunks or relies on understanding relationships between documents. This results in incomplete or inaccurate answers to complex queries.

Introducing Agentic File Search: Exploration as a Solution

To address these shortcomings, the speaker has developed an open-source project, termed agentic file search or exploration, which aims to replicate how humans investigate information. This system moves beyond simple retrieval and focuses on iteratively exploring documents, following leads, and building a comprehensive understanding. The project currently has over 350 stars on GitHub, indicating growing community interest.

The Three-Phase Exploration Process

The system operates through a three-phase process:

Phase One: Initial Scan: The system scans all available documents based on the user query without relying on pre-built indexes. Documents are pre-processed using Dockling to convert them into a standardized markdown format. An LLM is used to identify potentially relevant documents by analyzing the beginning of each document for keywords related to the query (e.g., identifying financial documents for a finance-related query).
Phase Two: Deep Dive: The LLM reads the full content of the documents identified in Phase One. Crucially, it also identifies potential cross-references that may have been missed during the initial scan. This triggers the agent to backtrack and explore those referenced files.
Phase Three: Context Collection & Answer Generation: Utilizing the backtracking mechanism, the system gathers all necessary context from the explored documents to formulate a detailed and comprehensive answer to the user’s question.

This iterative process allows the system to answer complex questions that would typically be beyond the capabilities of traditional semantic similarity-based RAG.

Technical Architecture & Tools

The system is inspired by coding agents like CloTcode and utilizes six core tools:

Folder Scanning: For identifying files within a directory.
Document Parsing: For extracting text from various document formats.
File Previewing: For quickly inspecting file content.
Reading: For accessing the full content of files.
Regular Expression-Based Search (Reax): For targeted text searches within files.
File Path Pattern Matching: For locating files based on specific naming conventions.

The system’s architecture consists of:

User Interface (UI) & Command Line Interface (CLI): Providing different access methods.
Backend Server Layer: Serving both the UI and CLI.
Orchestration Layer: Managed by Llama Index, enabling event-driven loop orchestration.
Tool Layer: Housing the six tools described above.
Document Processing Layer: Powered by Dockling for standardized formatting.

Initially built with Gemini 3 Flash, the system now supports local models, specifically Quen 3 32B.

Local Model Implementation & Hardware Considerations

Implementing local models required significant adjustments to the system prompt to ensure the LLM accurately follows instructions and utilizes the tools effectively. Smaller Quen models (4B, 8B, 14B) proved insufficient for the complex exploration tasks. Quen 3 32B demonstrated reasonable performance, but required prompt engineering.

The speaker utilizes an Nvidia DGX Spark server to run the 32B model, citing its 128GB of unified memory as crucial for accommodating the model’s size. While inference speed is slower compared to consumer GPUs like the 1490 or 1590, the DGX Spark allows for running significantly larger models and supports longer context windows (64,000 tokens in the demonstrated setup). The speaker also suggests potential benefits from processing requests in batches for serving customers.

Setup & Demonstration

The setup process involves cloning the repository and installing dependencies. Two branches are available: one for Gemini 3 Flash and another for local models with Olama. Running the local version requires setting environment variables for the Olama host and the desired model. The system can be accessed via a UI or CLI.

The speaker demonstrates the system with several queries:

Simple Query: Retrieving the purchase price from a single document (completed quickly, using ~38GB of memory).
Multi-Document Query: Identifying key risks and mitigation measures from multiple files (took ~4 minutes, using ~96GB of memory).
Complex Query: Requiring 14 steps of exploration and resulting in a comprehensive answer (took longer, demonstrating the system’s ability to handle intricate requests).

The speaker emphasizes that the system is designed for document generation and exploration, not real-time chatbot-like responses.

Conclusion & Future Directions

The agentic file search system represents a significant advancement over traditional RAG by mimicking human information-seeking behavior. By prioritizing exploration and cross-referencing, it overcomes the limitations of semantic similarity and chunking, enabling more accurate and comprehensive answers to complex queries. The speaker positions this project as an extension of local GPT and encourages collaboration and contributions. Future experiments will focus on batch processing for customer-facing applications and exploring the capabilities of even larger models (potentially 70B) on the DGX Spark.