Generate consistent, per-applicant loan application document packages for agent testing.
- Realistic layouts: Professional T4, paystub, rental agreement, and government ID that closely resemble real Canadian documents.
- Perfect consistency: One seeded persona (name, SIN, address, income, employer, etc.) is used across all documents.
- Fault injection: Deliberately faulty documents (
mismatch_sin_on_paystub,missing_paystub_net_pay,incomplete_rental_agreement). - Scan degradation: Optional realistic scanned versions of ID cards (blur, noise, rotation).
- Ground truth: Every package includes
ground_truth.jsonfor easy automated verification. - Professional tooling: pre-commit, mypy, black, pytest following best practices.
The project is driven by a single CLI command (python -m loan_doc_gen.cli generate). Here is the full pipeline:
The CLI creates one or more synthetic applicant profiles (name, SIN, address, employer, salary, etc.) using Faker and custom number generators. Each applicant gets a unique ID like app_a1b2c3d4e5f6. A seed ensures reproducibility.
For each applicant, package_builder.py runs a series of renderers in order:
| Renderer | Output File | reportlab backend (default) |
html backend |
|---|---|---|---|
| ID Card | gov_id_card.png |
PIL image drawing | PIL image drawing |
| Paystub | paystub.pdf |
ReportLab (code-drawn) | Jinja2 + WeasyPrint |
| T4 Slip | t4_slip.pdf |
Fills a real PDF form (t4-fill-25e.pdf) via pypdf |
Jinja2 + WeasyPrint (CRA-inspired) |
| Rental Agreement | rental_agreement.pdf |
ReportLab (generated if applicant rents) | Jinja2 + WeasyPrint |
| Property Deed | property_deed.pdf |
ReportLab (generated if applicant owns property) | ReportLab |
| Investment Statement | investment_statement.pdf |
ReportLab (generated if applicant has investments) | ReportLab |
| Notice of Assessment | notice_of_assessment.pdf |
ReportLab | ReportLab |
Note: The
--renderer-backendflag controls which rendering engine is used for Paystub, T4, and Rental Agreement. ID Card, Property Deed, Investment Statement, and NOA always use PIL/ReportLab regardless of the chosen backend.
When --scan is passed, each PDF/PNG is degraded to look like a scanned document, producing companion *_scanned.png files (blur, noise, rotation).
- A
ground_truth.jsonis saved per applicant folder (contains all profile data, fault info, and file list). - A row is appended to
summary.csvat the output root with flattened applicant/employment/rental fields and file count.
conda activate loan-doc-gen
pip install -e .
pre-commit installGenerate realistic packages:
# Single clean package (default ReportLab backend)
python -m loan_doc_gen.cli generate --out out --count 1 --seed 123
# 25 applicants, 30% faulty, with scan degradation
python -m loan_doc_gen.cli generate --out out --count 25 --seed 123 --fault-rate 0.3 --scan
# High-fidelity CRA-style output using the HTML/WeasyPrint backend
python -m loan_doc_gen.cli generate --out out --count 5 --seed 42 --renderer-backend htmlRun quality checks:
pre-commit run -a
pytest -q| Role | Location |
|---|---|
| CLI output root | User-chosen via --out (e.g. out/) |
| Per-applicant folder | {out_root}/app_{12-hex}/ |
| Ground truth manifest | {applicant_folder}/ground_truth.json |
| Batch summary | {out_root}/summary.csv |
| T4 PDF template (input) | src/loan_doc_gen/form_templates/t4-fill-25e.pdf |
| Face photo cache | .face_cache/ under CWD (optional, see face_photo.py) |
There is no generic "drop any PDF" pipeline — each form type requires a dedicated renderer. Follow these steps:
Put the fillable PDF in src/loan_doc_gen/form_templates/ (same location as t4-fill-25e.pdf).
Use a small script to inspect the AcroForm fields in your PDF:
from pypdf import PdfReader
reader = PdfReader("path/to/your-new-form.pdf")
fields = reader.get_fields()
for name, field in fields.items():
print(f"{name} -> {field.get('/V', '')}")Write a new class in src/loan_doc_gen/renderers/ that follows the project's DocumentRenderer pattern. The T4 form-fill renderer (renderers/t4_form_fill.py) is the best reference — it maps profile fields to PDF form field names and writes the filled PDF using pypdf.
Add your new renderer to the renderers = [...] list in package_builder.py → build_package().
Follow the pattern in tests/test_realistic_renderers.py for unit tests and tests/test_integration_generate_cli.py for integration tests.
pre-commit run -a
pytest -qloan-doc-gen/
├── src/loan_doc_gen/
│ ├── cli.py # CLI entrypoint (generate subcommand)
│ ├── package_builder.py # Orchestrates per-applicant packages, faults, ground truth
│ ├── persona.py # Seeded consistent applicant profiles (Faker)
│ ├── models.py # Data models (ApplicationProfile, etc.)
│ ├── numbers.py # SIN, account number generators
│ ├── faults.py # Fault injection logic
│ ├── face_photo.py # Optional face photo download/cache
│ ├── scan.py # Scan degradation (blur, noise, rotation)
│ ├── logging_utils.py # Logging configuration
│ ├── form_templates/ # Fillable PDF templates (e.g. t4-fill-25e.pdf)
│ ├── templates/ # Jinja HTML templates (WeasyPrint path)
│ └── renderers/ # All document renderers
│ ├── base.py # DocumentRenderer base class
│ ├── id_card.py # Government ID card (PNG via PIL)
│ ├── paystub.py # Paystub router
│ ├── paystub_realistic.py# Paystub (ReportLab)
│ ├── t4.py # T4 router
│ ├── t4_form_fill.py # T4 via AcroForm PDF fill (pypdf)
│ ├── rental_agreement.py # Rental agreement router
│ ├── rental_realistic.py # Rental agreement (ReportLab)
│ ├── asset_documents.py # Property deed, investment statement, NOA
│ ├── pdf_renderer.py # Legacy ReportLab T4
│ ├── html_renderer.py # HTML/WeasyPrint renderer (alternate path)
│ └── *_html.py # HTML variants (T4, paystub, rental)
├── tests/ # Unit + integration tests
├── out/ # Default CLI output directory
├── demo_output/ # Sample output PDFs
├── pyproject.toml
├── requirements.txt
├── setup.cfg # flake8 + mypy config
└── .pre-commit-config.yaml
The project supports two rendering backends for Paystub, T4, and Rental Agreement documents. All other documents (ID Card, Property Deed, Investment Statement, NOA) always use PIL/ReportLab.
| Backend | Flag | Method | Fidelity |
|---|---|---|---|
reportlab (default) |
--renderer-backend reportlab |
ReportLab / pypdf form-fill | Good |
html |
--renderer-backend html |
Jinja2 + WeasyPrint | High (CRA-style) |
# Default (ReportLab)
python -m loan_doc_gen.cli generate --out out --count 5 --seed 42
# High-fidelity HTML/WeasyPrint
python -m loan_doc_gen.cli generate --out out --count 5 --seed 42 --renderer-backend htmlRequirement: The
htmlbackend requires WeasyPrint and its system dependencies. Install with:conda install -c conda-forge pygobject gtk3