AI Agent PDF to JSON

A Python pipeline that extracts structured information from insurance policy PDFs using local LLM processing via Ollama.

Overview

This project processes PDF documents containing insurance policy information and extracts key details using a local language model. The pipeline:

Iterates through PDF files in a designated folder
Extracts text content from each PDF
Processes the text through an LLM (Ollama) to extract structured information
Saves the results as JSON files

Features

Automated batch processing of multiple PDF files
Text extraction from PDF documents using pypdf
LLM-powered information extraction using langchain-ollama
Structured JSON output for easy data consumption
Local processing (no external API calls required)

Prerequisites

Python 3.12+
Ollama installed locally with at least one model available
Required Python packages (see pyproject.toml)

Installation

Install dependencies using uv (or your preferred package manager):
```
uv sync
```
Ensure Ollama is running and you have a model available:
```
ollama list
```
If you need to pull a model:
```
ollama pull llama3.1:8b
```

Usage

Place your PDF files in the PDFs/ directory
Run the pipeline:
```
python main.py
```
Find the extracted JSON files in the output/ directory

Configuration

Changing the LLM Model

By default, the pipeline uses llama3.1:8b. To use a different model, edit main.py:

llm = ChatOllama(model="your-model-name", temperature=0)

Changing Directories

Modify the main() function in main.py:

pdf_dir = "PDFs"      # Source directory for PDFs
output_dir = "output"  # Destination directory for JSON files

Output Format

The pipeline extracts the following information from insurance policy documents:

Policy Type: The type of insurance policy (e.g., "30-Year Term Life Insurance")
Policy Holder: The person or entity the policy is for
Coverage Amount: The coverage value with currency

Example output (life_insurance_fictional.json):

{
  "policy_type": "30-Year Term Life Insurance",
  "policy_holder": "Melissa A. Davenport",
  "coverage_amount": "$750,000",
  "source_file": "life_insurance_fictional.pdf"
}

Dependencies

langchain: Core LangChain framework
langchain-ollama: Ollama integration for LangChain
langchain-core: Core LangChain components
langchain-community: Community LangChain integrations
pypdf: PDF text extraction
ollama: Ollama Python client

How It Works

PDF Text Extraction: Uses pypdf to extract raw text from each PDF page
LLM Processing: Sends the extracted text to Ollama with a structured prompt requesting specific information
JSON Parsing: Parses the LLM response to extract structured JSON data
File Output: Saves each result as a separate JSON file named after the source PDF

Troubleshooting

Model Not Found Error

If you see model 'llama3.2' not found, check available models:

ollama list

Then update the model name in main.py to match an available model.

No Text Extracted

If a PDF fails to extract text, it may be:

Image-based (scanned) PDF requiring OCR
Password-protected
Corrupted file

JSON Parsing Errors

If the LLM response isn't valid JSON, the pipeline will attempt to extract JSON from the response or return a fallback structure with the raw response.

License

This project is provided as-is for educational and development purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
PDFs		PDFs
output		output
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI Agent PDF to JSON

Overview

Features

Prerequisites

Installation

Usage

Configuration

Changing the LLM Model

Changing Directories

Output Format

Dependencies

How It Works

Troubleshooting

Model Not Found Error

No Text Extracted

JSON Parsing Errors

License

About

Uh oh!

Releases

Packages

Languages

app-creative/AiAgentPDFtoJSON

Folders and files

Latest commit

History

Repository files navigation

AI Agent PDF to JSON

Overview

Features

Prerequisites

Installation

Usage

Configuration

Changing the LLM Model

Changing Directories

Output Format

Dependencies

How It Works

Troubleshooting

Model Not Found Error

No Text Extracted

JSON Parsing Errors

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages