Advanced Document Extraction in Python with Gemini

Advanced Document Extraction with Gemini: A Detailed Summary

Key Concepts:

Gemini: Google’s multimodal AI model used for document understanding and information extraction.
Structured Output: Extracting data from documents in a predefined, organized format (e.g., JSON).
Bounding Boxes: Coordinates defining the location of extracted information within a document.
PyAntic: A Python library for data validation and settings management using Python type hints.
PDF Parsing (PyMuPDF/fitz): Libraries used to open, read, and manipulate PDF documents, including drawing bounding boxes.
API Key: Authentication credential required to access the Gemini API via Google AI Studio or Vertex AI.
Normalization (0-1000 scale): Gemini provides bounding box coordinates normalized between 0 and 1000, requiring rescaling to document dimensions.

1. Introduction & Use Case

The video demonstrates advanced document extraction using Google’s Gemini model. The primary goal isn’t just extracting information from PDF documents (like invoice recipient and total amount) but also identifying where that information is located within the document using bounding boxes. This is particularly useful for multi-page documents and verifying the accuracy of the extracted data. The presenter emphasizes the benefit of a “human-in-the-loop” approach, where bounding boxes allow for quick verification and correction of AI-extracted information.

2. Setup & Environment Configuration

The process begins with setting up a Python environment using uv (a package manager). The following packages are installed:

google-genai: The Gemini API client library.
pydantic: Used to define data models (schemas) for structured output.
python-dotenv: For securely loading the Gemini API key from a .env file.
PyMuPDF (or fitz): A library for PDF manipulation, specifically for drawing bounding boxes on the document.

An API key is obtained from Google AI Studio (or Vertex AI) and stored in a .env file as Gemini_API_key. The presenter mentions a previous tutorial on the channel detailing Vertex AI setup.

3. Basic Structured Output with Gemini

The initial step involves demonstrating basic structured output without bounding boxes. This is achieved by:

Importing necessary libraries: os, load_dotenv, genai, BaseModel, and Field from pydantic.
Loading the API key: Using load_dotenv() to access the Gemini_API_key from the .env file.
Creating a Gemini client: client = genai.Client(api_key=os.getenv("Gemini_API_key")).
Defining a Pydantic model (InvoiceModel): This model defines the expected structure of the extracted data, including fields like total (float) and recipient (string), with descriptions for clarity.
Uploading the PDF: pdf_file = client.files.upload(file="invoice.pdf").
Generating content: Using client.generate_content() with a prompt requesting the recipient and total, specifying the response_schema as the InvoiceModel and the response_mime_type as application/json.
Validating and accessing the data: invoice = InvoiceModel.validate_json(response.text) and then printing invoice.model_dump() to display the extracted data.

4. Implementing Bounding Boxes

The core of the advanced extraction lies in adding bounding box information. This is achieved by:

Creating a BoundingBoxField base class: This class defines the bounding_box (list of integers representing coordinates: xmin, ymin, xmax, ymax) and page (integer representing the page number) attributes.
Defining field-specific models (TotalField, RecipientField): These models inherit from BoundingBoxField and include the specific data type for the field (e.g., total_value as a float for TotalField).
Modifying the InvoiceModel: The total and recipient fields are now defined as TotalField and RecipientField respectively.
Prompt Modification: The prompt is adjusted to instruct Gemini to include bounding box coordinates for each extracted field. The presenter suggests adding an instruction to set missing fields to null and bounding boxes to 0 0 0 0 to handle cases where information isn't found.

5. Drawing Bounding Boxes on the PDF

After extracting the data with bounding box coordinates, the code draws rectangles around the extracted information on the PDF:

Opening the PDF: Using fitz.open("invoice.pdf").
Iterating through extracted fields: A list items_to_draw is created, containing tuples of (label, bounding box, page number) for each extracted field.
Rescaling Coordinates: Gemini provides coordinates normalized between 0 and 1000. These coordinates are rescaled to the actual PDF dimensions using the page rectangle (page.rect) and the formula: coordinate = (coordinate / 1000) * dimension.
Drawing Rectangles: page.draw_rect(rect, color=(1, 0, 0), width=2) draws a red rectangle around the extracted information.
Adding Text Labels: page.insert_text(position, text, color=(1, 0, 0), fontsize=6) adds a text label above the bounding box.
Saving the Annotated PDF: doc.save("invoice_annotated.pdf") saves the modified PDF with the bounding boxes.

Notable Quote:

“The problem with AI is that 99% of the time it might give you correct information, maybe 95% of the time, but the problem is the remaining percentage that you don't get the correct answer… you need to have a human in the loop to confirm the information.”

6. Multi-Page Document Support

The presenter demonstrates that the same approach works seamlessly with multi-page PDF documents. The page attribute in the BoundingBoxField ensures that the bounding boxes are drawn on the correct page.

7. Conclusion & Future Considerations

The video concludes by highlighting the power of this technique for verifying AI-extracted data and enabling a human-in-the-loop workflow. The presenter suggests that this approach is particularly valuable for complex documents like annual reports, where verifying the source of information is crucial. They also mention potential improvements, such as exploring different ways to structure the bounding box fields and offering tutoring/freelancing services related to AI and machine learning. The key takeaway is that combining structured output with bounding boxes significantly enhances the reliability and usability of document extraction systems.