Strip JavaScript, OpenAction, Launch actions and other active content from PDFs. Lightweight Python library on top of pikepdf. MIT licensed.
📚 Full documentation | 📦 PyPI | 🛠️ Built by kovetz.co.il
PDFs can carry executable content: JavaScript that runs when the file opens, auto-actions that fire on every page navigation, "Launch" actions that try to open other programs, embedded files that drop malware. If you process user-uploaded PDFs in your app, you should strip this content before serving them back.
The Python ecosystem has parsers (pikepdf, pypdf, PyMuPDF) and a heavy
container-based tool (Dangerzone), but no clean
drop-in library that says "give me this PDF without active content." This is
that library.
pip install pdf-defangRequires Python 3.9+ and pikepdf 8+.
from pdf_defang import sanitize, scan
# Clean a file in place
sanitize("uploaded.pdf")
# Get a detailed report of what was removed
report = sanitize("uploaded.pdf", return_report=True)
print(report.javascript_in_names) # 2
print(report.open_action_removed) # True
print(report.annotation_action_types) # ['Launch']
print(report.dangerous_uris_removed) # 1
print(report.as_dict()) # JSON-serialisable
# Inspect a file WITHOUT modifying it
report = scan("suspicious.pdf")
print(report.risk_level) # 'high' / 'medium' / 'low' / 'none'
print(report.has_javascript) # Truefrom pdf_defang import sanitize_async, scan_async
async def handle_upload(path):
report = await sanitize_async(path, return_report=True)
return report.as_dict()from pdf_defang import sanitize_bytes
raw_pdf: bytes = ... # from S3, HTTP, anywhere
cleaned: bytes = sanitize_bytes(raw_pdf)
# No disk involvedsanitize("encrypted.pdf", password="hunter2")
# Still encrypted with the same password, JavaScript removed.# Public uploads: kill everything active (safest)
sanitize("untrusted.pdf") # level="strict"
# Trusted internal forms that need Submit / Calculate buttons:
sanitize("expense_form.pdf", level="balanced")Both levels strip pure attack vectors (/Launch, /GoToR, document
JavaScript, dangerous URI schemes, etc.). balanced additionally
preserves /SubmitForm / /ResetForm / form JS actions, annotation
/AA and /JS triggers, the AcroForm /CO calculation order, and
embedded files (used by PDF portfolios). Default is strict.
# Clean a single file (strict by default)
pdf-defang clean uploaded.pdf
# Clean many at once
pdf-defang clean *.pdf
# Keep form interactivity working
pdf-defang clean --level balanced internal_form.pdf
# Inspect without changes
pdf-defang scan suspicious.pdf
# Get JSON output for piping into your logging stack
pdf-defang scan suspicious.pdf --json | jq .risk_level
pdf-defang clean *.pdf --json > sanitization-log.jsonExit codes follow shell conventions:
| Code | clean |
scan |
|---|---|---|
| 0 | All files were already clean | No active content found |
| 1 | At least one file had something stripped | Active content detected |
| 2 | At least one file could not be opened | File could not be scanned |
from pdf_defang import sanitize
def handle_upload(uploaded_file_path: str) -> str:
report = sanitize(uploaded_file_path, return_report=True)
if report.error:
raise ValueError(f"Could not process PDF: {report.error}")
# Log what was removed for your audit trail
logger.info("Sanitized %s: %s", uploaded_file_path, report.as_dict())
return uploaded_file_path # safe to serve back to other users nowfrom pdf_defang import scan
report = scan("phishing_attachment.pdf")
if report.risk_level == "high":
quarantine(report)
elif report.risk_level == "medium":
notify_security_team(report)find /var/incoming -name '*.pdf' | xargs pdf-defang clean --json >> audit.jsonl| Item | Where | What it does |
|---|---|---|
/JavaScript in /Names |
Document root | Document-level JavaScript that runs on open |
/EmbeddedFiles |
Document root | Files hidden inside the PDF (potential malware) |
/OpenAction |
Document root | Action automatically executed when PDF opens |
/AA |
Document root | "Additional Actions" - auto-execute on navigation |
/XFA |
/AcroForm |
Legacy XML forms - well-known attack surface |
/CO |
/AcroForm |
Form field Calculation Order |
/AA |
Each page | Page-level auto-execute actions |
Dangerous /A |
Each annotation | JavaScript, Launch, ImportData, SubmitForm, ResetForm, Rendition, GoToR, GoToE, Movie, Sound actions |
/AA |
Each annotation | Per-annotation auto-actions |
/JS |
Each annotation | JavaScript attached directly to an annotation |
Unsafe /URI |
Each annotation | URI actions with dangerous schemes (javascript:, file:, data:, vbscript:, UNC paths). Standard hyperlinks (http, https, mailto, tel, ftp, etc.) are preserved. |
Sanitization is non-destructive to visible content:
- All text, images and layout
- Standard form fields (filled values stay intact)
- Bookmarks, table of contents, page labels
- Document metadata (Author, Title, Subject, Keywords)
- Standard link annotations to
mailto:/http(s):URLs - Document structure, page count, page order
| Tool | Why this might not fit you |
|---|---|
| Dangerzone | Excellent for sensitive analyst workflows, but runs a full Docker container per file. Minutes per PDF, not milliseconds. |
| iText / Apryse | Powerful, but commercial licenses start at thousands of USD/year. |
| pikepdf directly | Brilliant library, but it's a parser, not a sanitizer. You'd write the same _strip_document_level() code we wrote here. That's exactly what we extracted. |
pdf-defang is for the case where you want a small, free, drop-in function
to ship in your existing Python app. No subprocesses, no Docker, no per-seat
license.
Measured on a Windows 11 laptop, Python 3.13, on the fixture PDFs:
| Operation | Median time |
|---|---|
scan_bytes() on a clean PDF (in memory) |
~0.3 ms |
sanitize_bytes() on a malicious PDF (in memory) |
~0.6 ms |
sanitize() on a clean PDF (with disk I/O) |
~8 ms |
sanitize() kitchen-sink PDF (with disk I/O) |
~8 ms |
These are 50-100 times faster than container-based tools like Dangerzone (which take seconds-to-minutes per file).
To benchmark on your hardware:
python -m pytest tests/test_performance.py -v -s- Sanitization modifies the input file in place. If you need the original preserved for audit, copy it first.
- Encrypted PDFs require the
password=argument. Wrong-password attempts return an error report (not an exception). - Malformed PDFs may not open at all - we surface the underlying pikepdf error in the report. The original file is not touched on failure.
- This is defense in depth, not a replacement for layered controls. Don't rely on a sanitizer alone for high-risk attachment workflows: also validate uploaders, sandbox processing, and scan with AV.
This library was originally written for kovetz.co.il (Hebrew PDF tools, www.kovetz.co.il) in May 2026, during an APT scanning campaign by an Iranian-attributed threat actor sweeping endpoints for upload vectors. We needed to make sure that any PDF leaving our service was free of executable payloads, even if an attacker successfully uploaded a poisoned file.
We initially wrote 67 lines of pikepdf code, tested it on the kovetz.co.il fleet (thousands of files/day), then realised there's no clean equivalent in the OSS Python ecosystem. So we extracted it here for everyone else who needs the same thing.
Issues and PRs welcome at github.com/kovetz-PDF/pdf-defang.
If you've found a PDF in the wild that contains active content we don't strip, please open an issue with the file (or a minimal reproducer) attached.
git clone https://github.com/kovetz-PDF/pdf-defang.git
cd pdf-defang
python -m pip install -e ".[test]"
python -m pytestThe tests/conftest.py will auto-generate the test fixture PDFs on first run.
MIT - free for any use, including commercial.
Built and maintained by kovetz.co.il. Contact: contact@kovetz.co.il