A Python pipeline that extracts structured information from insurance policy PDFs using local LLM processing via Ollama.
This project processes PDF documents containing insurance policy information and extracts key details using a local language model. The pipeline:
- Iterates through PDF files in a designated folder
- Extracts text content from each PDF
- Processes the text through an LLM (Ollama) to extract structured information
- Saves the results as JSON files
- Automated batch processing of multiple PDF files
- Text extraction from PDF documents using
pypdf - LLM-powered information extraction using
langchain-ollama - Structured JSON output for easy data consumption
- Local processing (no external API calls required)
- Python 3.12+
- Ollama installed locally with at least one model available
- Required Python packages (see
pyproject.toml)
-
Install dependencies using
uv(or your preferred package manager):uv sync
-
Ensure Ollama is running and you have a model available:
ollama list
If you need to pull a model:
ollama pull llama3.1:8b
-
Place your PDF files in the
PDFs/directory -
Run the pipeline:
python main.py
-
Find the extracted JSON files in the
output/directory
By default, the pipeline uses llama3.1:8b. To use a different model, edit main.py:
llm = ChatOllama(model="your-model-name", temperature=0)Modify the main() function in main.py:
pdf_dir = "PDFs" # Source directory for PDFs
output_dir = "output" # Destination directory for JSON filesThe pipeline extracts the following information from insurance policy documents:
- Policy Type: The type of insurance policy (e.g., "30-Year Term Life Insurance")
- Policy Holder: The person or entity the policy is for
- Coverage Amount: The coverage value with currency
Example output (life_insurance_fictional.json):
{
"policy_type": "30-Year Term Life Insurance",
"policy_holder": "Melissa A. Davenport",
"coverage_amount": "$750,000",
"source_file": "life_insurance_fictional.pdf"
}langchain: Core LangChain frameworklangchain-ollama: Ollama integration for LangChainlangchain-core: Core LangChain componentslangchain-community: Community LangChain integrationspypdf: PDF text extractionollama: Ollama Python client
- PDF Text Extraction: Uses
pypdfto extract raw text from each PDF page - LLM Processing: Sends the extracted text to Ollama with a structured prompt requesting specific information
- JSON Parsing: Parses the LLM response to extract structured JSON data
- File Output: Saves each result as a separate JSON file named after the source PDF
If you see model 'llama3.2' not found, check available models:
ollama listThen update the model name in main.py to match an available model.
If a PDF fails to extract text, it may be:
- Image-based (scanned) PDF requiring OCR
- Password-protected
- Corrupted file
If the LLM response isn't valid JSON, the pipeline will attempt to extract JSON from the response or return a fallback structure with the raw response.
This project is provided as-is for educational and development purposes.