Skip to content

behzad-amini/loan-doc-gen

Repository files navigation

Loan Document Generator (Synthetic)

Generate consistent, per-applicant loan application document packages for agent testing.

Features

  • Realistic layouts: Professional T4, paystub, rental agreement, and government ID that closely resemble real Canadian documents.
  • Perfect consistency: One seeded persona (name, SIN, address, income, employer, etc.) is used across all documents.
  • Fault injection: Deliberately faulty documents (mismatch_sin_on_paystub, missing_paystub_net_pay, incomplete_rental_agreement).
  • Scan degradation: Optional realistic scanned versions of ID cards (blur, noise, rotation).
  • Ground truth: Every package includes ground_truth.json for easy automated verification.
  • Professional tooling: pre-commit, mypy, black, pytest following best practices.

How Artifact Generation Works (Step by Step)

The project is driven by a single CLI command (python -m loan_doc_gen.cli generate). Here is the full pipeline:

Step 1 — Generate Fake Applicant Profiles

The CLI creates one or more synthetic applicant profiles (name, SIN, address, employer, salary, etc.) using Faker and custom number generators. Each applicant gets a unique ID like app_a1b2c3d4e5f6. A seed ensures reproducibility.

Step 2 — Render Documents for Each Applicant

For each applicant, package_builder.py runs a series of renderers in order:

Renderer Output File reportlab backend (default) html backend
ID Card gov_id_card.png PIL image drawing PIL image drawing
Paystub paystub.pdf ReportLab (code-drawn) Jinja2 + WeasyPrint
T4 Slip t4_slip.pdf Fills a real PDF form (t4-fill-25e.pdf) via pypdf Jinja2 + WeasyPrint (CRA-inspired)
Rental Agreement rental_agreement.pdf ReportLab (generated if applicant rents) Jinja2 + WeasyPrint
Property Deed property_deed.pdf ReportLab (generated if applicant owns property) ReportLab
Investment Statement investment_statement.pdf ReportLab (generated if applicant has investments) ReportLab
Notice of Assessment notice_of_assessment.pdf ReportLab ReportLab

Note: The --renderer-backend flag controls which rendering engine is used for Paystub, T4, and Rental Agreement. ID Card, Property Deed, Investment Statement, and NOA always use PIL/ReportLab regardless of the chosen backend.

Step 3 — (Optional) Simulate Scanning

When --scan is passed, each PDF/PNG is degraded to look like a scanned document, producing companion *_scanned.png files (blur, noise, rotation).

Step 4 — Write Ground Truth & Summary

  • A ground_truth.json is saved per applicant folder (contains all profile data, fault info, and file list).
  • A row is appended to summary.csv at the output root with flattened applicant/employment/rental fields and file count.

Quickstart

conda activate loan-doc-gen
pip install -e .
pre-commit install

Generate realistic packages:

# Single clean package (default ReportLab backend)
python -m loan_doc_gen.cli generate --out out --count 1 --seed 123

# 25 applicants, 30% faulty, with scan degradation
python -m loan_doc_gen.cli generate --out out --count 25 --seed 123 --fault-rate 0.3 --scan

# High-fidelity CRA-style output using the HTML/WeasyPrint backend
python -m loan_doc_gen.cli generate --out out --count 5 --seed 42 --renderer-backend html

Run quality checks:

pre-commit run -a
pytest -q

Input / Output Locations

Role Location
CLI output root User-chosen via --out (e.g. out/)
Per-applicant folder {out_root}/app_{12-hex}/
Ground truth manifest {applicant_folder}/ground_truth.json
Batch summary {out_root}/summary.csv
T4 PDF template (input) src/loan_doc_gen/form_templates/t4-fill-25e.pdf
Face photo cache .face_cache/ under CWD (optional, see face_photo.py)

Adding a New Raw PDF Form

There is no generic "drop any PDF" pipeline — each form type requires a dedicated renderer. Follow these steps:

1. Place the fillable PDF template

Put the fillable PDF in src/loan_doc_gen/form_templates/ (same location as t4-fill-25e.pdf).

2. Discover the form field names

Use a small script to inspect the AcroForm fields in your PDF:

from pypdf import PdfReader

reader = PdfReader("path/to/your-new-form.pdf")
fields = reader.get_fields()
for name, field in fields.items():
    print(f"{name} -> {field.get('/V', '')}")

3. Create a new renderer

Write a new class in src/loan_doc_gen/renderers/ that follows the project's DocumentRenderer pattern. The T4 form-fill renderer (renderers/t4_form_fill.py) is the best reference — it maps profile fields to PDF form field names and writes the filled PDF using pypdf.

4. Register the renderer

Add your new renderer to the renderers = [...] list in package_builder.pybuild_package().

5. Add tests

Follow the pattern in tests/test_realistic_renderers.py for unit tests and tests/test_integration_generate_cli.py for integration tests.

6. Run pre-commit checks

pre-commit run -a
pytest -q

Project Structure Highlights

loan-doc-gen/
├── src/loan_doc_gen/
│   ├── cli.py                  # CLI entrypoint (generate subcommand)
│   ├── package_builder.py      # Orchestrates per-applicant packages, faults, ground truth
│   ├── persona.py              # Seeded consistent applicant profiles (Faker)
│   ├── models.py               # Data models (ApplicationProfile, etc.)
│   ├── numbers.py              # SIN, account number generators
│   ├── faults.py               # Fault injection logic
│   ├── face_photo.py           # Optional face photo download/cache
│   ├── scan.py                 # Scan degradation (blur, noise, rotation)
│   ├── logging_utils.py        # Logging configuration
│   ├── form_templates/         # Fillable PDF templates (e.g. t4-fill-25e.pdf)
│   ├── templates/              # Jinja HTML templates (WeasyPrint path)
│   └── renderers/              # All document renderers
│       ├── base.py             # DocumentRenderer base class
│       ├── id_card.py          # Government ID card (PNG via PIL)
│       ├── paystub.py          # Paystub router
│       ├── paystub_realistic.py# Paystub (ReportLab)
│       ├── t4.py               # T4 router
│       ├── t4_form_fill.py     # T4 via AcroForm PDF fill (pypdf)
│       ├── rental_agreement.py # Rental agreement router
│       ├── rental_realistic.py # Rental agreement (ReportLab)
│       ├── asset_documents.py  # Property deed, investment statement, NOA
│       ├── pdf_renderer.py     # Legacy ReportLab T4
│       ├── html_renderer.py    # HTML/WeasyPrint renderer (alternate path)
│       └── *_html.py           # HTML variants (T4, paystub, rental)
├── tests/                      # Unit + integration tests
├── out/                        # Default CLI output directory
├── demo_output/                # Sample output PDFs
├── pyproject.toml
├── requirements.txt
├── setup.cfg                   # flake8 + mypy config
└── .pre-commit-config.yaml

Renderer Backends

The project supports two rendering backends for Paystub, T4, and Rental Agreement documents. All other documents (ID Card, Property Deed, Investment Statement, NOA) always use PIL/ReportLab.

Backend Flag Method Fidelity
reportlab (default) --renderer-backend reportlab ReportLab / pypdf form-fill Good
html --renderer-backend html Jinja2 + WeasyPrint High (CRA-style)
# Default (ReportLab)
python -m loan_doc_gen.cli generate --out out --count 5 --seed 42

# High-fidelity HTML/WeasyPrint
python -m loan_doc_gen.cli generate --out out --count 5 --seed 42 --renderer-backend html

Requirement: The html backend requires WeasyPrint and its system dependencies. Install with: conda install -c conda-forge pygobject gtk3

About

Synthetic loan document generator for agent testing — T4, paystub, rental agreement, ID cards with fault injection and scan degradation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors