Advanced Document Extraction in Python with Gemini
By NeuralNine
Advanced Document Extraction with Gemini: A Detailed Summary
Key Concepts:
- Gemini: Google’s multimodal AI model used for document understanding and information extraction.
- Structured Output: Extracting data from documents in a predefined, organized format (e.g., JSON).
- Bounding Boxes: Coordinates defining the location of extracted information within a document.
- PyAntic: A Python library for data validation and settings management using Python type hints.
- PDF Parsing (PyMuPDF/fitz): Libraries used to open, read, and manipulate PDF documents, including drawing bounding boxes.
- API Key: Authentication credential required to access the Gemini API via Google AI Studio or Vertex AI.
- Normalization (0-1000 scale): Gemini provides bounding box coordinates normalized between 0 and 1000, requiring rescaling to document dimensions.
1. Introduction & Use Case
The video demonstrates advanced document extraction using Google’s Gemini model. The primary goal isn’t just extracting information from PDF documents (like invoice recipient and total amount) but also identifying where that information is located within the document using bounding boxes. This is particularly useful for multi-page documents and verifying the accuracy of the extracted data. The presenter emphasizes the benefit of a “human-in-the-loop” approach, where bounding boxes allow for quick verification and correction of AI-extracted information.
2. Setup & Environment Configuration
The process begins with setting up a Python environment using uv (a package manager). The following packages are installed:
google-genai: The Gemini API client library.pydantic: Used to define data models (schemas) for structured output.python-dotenv: For securely loading the Gemini API key from a.envfile.PyMuPDF(orfitz): A library for PDF manipulation, specifically for drawing bounding boxes on the document.
An API key is obtained from Google AI Studio (or Vertex AI) and stored in a .env file as Gemini_API_key. The presenter mentions a previous tutorial on the channel detailing Vertex AI setup.
3. Basic Structured Output with Gemini
The initial step involves demonstrating basic structured output without bounding boxes. This is achieved by:
- Importing necessary libraries:
os,load_dotenv,genai,BaseModel, andFieldfrompydantic. - Loading the API key: Using
load_dotenv()to access theGemini_API_keyfrom the.envfile. - Creating a Gemini client:
client = genai.Client(api_key=os.getenv("Gemini_API_key")). - Defining a Pydantic model (
InvoiceModel): This model defines the expected structure of the extracted data, including fields liketotal(float) andrecipient(string), with descriptions for clarity. - Uploading the PDF:
pdf_file = client.files.upload(file="invoice.pdf"). - Generating content: Using
client.generate_content()with a prompt requesting the recipient and total, specifying theresponse_schemaas theInvoiceModeland theresponse_mime_typeasapplication/json. - Validating and accessing the data:
invoice = InvoiceModel.validate_json(response.text)and then printinginvoice.model_dump()to display the extracted data.
4. Implementing Bounding Boxes
The core of the advanced extraction lies in adding bounding box information. This is achieved by:
- Creating a
BoundingBoxFieldbase class: This class defines thebounding_box(list of integers representing coordinates: xmin, ymin, xmax, ymax) andpage(integer representing the page number) attributes. - Defining field-specific models (
TotalField,RecipientField): These models inherit fromBoundingBoxFieldand include the specific data type for the field (e.g.,total_valueas a float forTotalField). - Modifying the
InvoiceModel: Thetotalandrecipientfields are now defined asTotalFieldandRecipientFieldrespectively. - Prompt Modification: The prompt is adjusted to instruct Gemini to include bounding box coordinates for each extracted field. The presenter suggests adding an instruction to set missing fields to
nulland bounding boxes to0 0 0 0to handle cases where information isn't found.
5. Drawing Bounding Boxes on the PDF
After extracting the data with bounding box coordinates, the code draws rectangles around the extracted information on the PDF:
- Opening the PDF: Using
fitz.open("invoice.pdf"). - Iterating through extracted fields: A list
items_to_drawis created, containing tuples of (label, bounding box, page number) for each extracted field. - Rescaling Coordinates: Gemini provides coordinates normalized between 0 and 1000. These coordinates are rescaled to the actual PDF dimensions using the page rectangle (
page.rect) and the formula:coordinate = (coordinate / 1000) * dimension. - Drawing Rectangles:
page.draw_rect(rect, color=(1, 0, 0), width=2)draws a red rectangle around the extracted information. - Adding Text Labels:
page.insert_text(position, text, color=(1, 0, 0), fontsize=6)adds a text label above the bounding box. - Saving the Annotated PDF:
doc.save("invoice_annotated.pdf")saves the modified PDF with the bounding boxes.
Notable Quote:
“The problem with AI is that 99% of the time it might give you correct information, maybe 95% of the time, but the problem is the remaining percentage that you don't get the correct answer… you need to have a human in the loop to confirm the information.”
6. Multi-Page Document Support
The presenter demonstrates that the same approach works seamlessly with multi-page PDF documents. The page attribute in the BoundingBoxField ensures that the bounding boxes are drawn on the correct page.
7. Conclusion & Future Considerations
The video concludes by highlighting the power of this technique for verifying AI-extracted data and enabling a human-in-the-loop workflow. The presenter suggests that this approach is particularly valuable for complex documents like annual reports, where verifying the source of information is crucial. They also mention potential improvements, such as exploring different ways to structure the bounding box fields and offering tutoring/freelancing services related to AI and machine learning. The key takeaway is that combining structured output with bounding boxes significantly enhances the reliability and usability of document extraction systems.
Chat with this Video
AI-PoweredHi! I can answer questions about this video "Advanced Document Extraction in Python with Gemini". What would you like to know?