Helper repo to run the pdftable CLI on PDFs and export results to JSON.
It includes:
- A simple workflow to run
pdftableover a PDF into an output folder (per-page PNG/PDF/HTML). export_pdftable_to_json.py– merges outputs into a single JSON file per run.- Optional patch scripts we used to make the upstream code robust on CPU/Apple Silicon:
unify_torch_device_patch.pypatch_keyerror_tsr.pyfix_unboundlocal_tsr.py
This repo wraps the upstream
pdftableproject. Ensure thepdftableCLI is installed and on PATH.
Run the extractor (example):
pdftable --file_path_or_url "/path/to/your.pdf" \
--output_dir "./cv_out" \
--pages all \
--lang enExport to JSON:
python export_pdftable_to_json.py --outdir ./cv_out --outjson ./cv_out/results.jsonThe ./cv_out folder will contain page-*.{pdf,png,html} and model artifacts; results.json is a list of page objects with text blocks and (when present) table info.
License Apache License, Version 2.0 — see LICENSE.txt.