Skip to content

app-creative/AiAgentPDFtoJSON

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Agent PDF to JSON

A Python pipeline that extracts structured information from insurance policy PDFs using local LLM processing via Ollama.

Overview

This project processes PDF documents containing insurance policy information and extracts key details using a local language model. The pipeline:

  1. Iterates through PDF files in a designated folder
  2. Extracts text content from each PDF
  3. Processes the text through an LLM (Ollama) to extract structured information
  4. Saves the results as JSON files

Features

  • Automated batch processing of multiple PDF files
  • Text extraction from PDF documents using pypdf
  • LLM-powered information extraction using langchain-ollama
  • Structured JSON output for easy data consumption
  • Local processing (no external API calls required)

Prerequisites

  • Python 3.12+
  • Ollama installed locally with at least one model available
  • Required Python packages (see pyproject.toml)

Installation

  1. Install dependencies using uv (or your preferred package manager):

    uv sync
  2. Ensure Ollama is running and you have a model available:

    ollama list

    If you need to pull a model:

    ollama pull llama3.1:8b

Usage

  1. Place your PDF files in the PDFs/ directory

  2. Run the pipeline:

    python main.py
  3. Find the extracted JSON files in the output/ directory

Configuration

Changing the LLM Model

By default, the pipeline uses llama3.1:8b. To use a different model, edit main.py:

llm = ChatOllama(model="your-model-name", temperature=0)

Changing Directories

Modify the main() function in main.py:

pdf_dir = "PDFs"      # Source directory for PDFs
output_dir = "output"  # Destination directory for JSON files

Output Format

The pipeline extracts the following information from insurance policy documents:

  • Policy Type: The type of insurance policy (e.g., "30-Year Term Life Insurance")
  • Policy Holder: The person or entity the policy is for
  • Coverage Amount: The coverage value with currency

Example output (life_insurance_fictional.json):

{
  "policy_type": "30-Year Term Life Insurance",
  "policy_holder": "Melissa A. Davenport",
  "coverage_amount": "$750,000",
  "source_file": "life_insurance_fictional.pdf"
}

Dependencies

  • langchain: Core LangChain framework
  • langchain-ollama: Ollama integration for LangChain
  • langchain-core: Core LangChain components
  • langchain-community: Community LangChain integrations
  • pypdf: PDF text extraction
  • ollama: Ollama Python client

How It Works

  1. PDF Text Extraction: Uses pypdf to extract raw text from each PDF page
  2. LLM Processing: Sends the extracted text to Ollama with a structured prompt requesting specific information
  3. JSON Parsing: Parses the LLM response to extract structured JSON data
  4. File Output: Saves each result as a separate JSON file named after the source PDF

Troubleshooting

Model Not Found Error

If you see model 'llama3.2' not found, check available models:

ollama list

Then update the model name in main.py to match an available model.

No Text Extracted

If a PDF fails to extract text, it may be:

  • Image-based (scanned) PDF requiring OCR
  • Password-protected
  • Corrupted file

JSON Parsing Errors

If the LLM response isn't valid JSON, the pipeline will attempt to extract JSON from the response or return a fallback structure with the raw response.

License

This project is provided as-is for educational and development purposes.

About

building an AI Agent data pipeline to process PDF -> JSON files.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%