This repository contains a reference implementation and deployment scaffold for a privacy-preserving PDF sanitization agent (Content Disarm & Reconstruction focused). The project is intended for defensive use only — to remove active content and sensitive metadata from incoming PDFs before distribution or storage.
Provide a local-first, auditable pipeline to sanitize PDF files. Minimize retention of PII and full-text in logs by design. Offer configurable sanitization policies (light ↔ strong). Provide easy deployment patterns: CLI, Docker, AWS Lambda / S3 trigger, and Email gateway examples. Be a developer-friendly open-source project with tests and CI.
License: MIT Language: Python (>=3.9) Tools used: pikepdf, qpdf, exiftool / mat2, ghostscript (optional), pytest