pdftable-runner

Helper repo to run the pdftable CLI on PDFs and export results to JSON.

It includes:

A simple workflow to run pdftable over a PDF into an output folder (per-page PNG/PDF/HTML).
export_pdftable_to_json.py – merges outputs into a single JSON file per run.
Optional patch scripts we used to make the upstream code robust on CPU/Apple Silicon:
- unify_torch_device_patch.py
- patch_keyerror_tsr.py
- fix_unboundlocal_tsr.py

This repo wraps the upstream pdftable project. Ensure the pdftable CLI is installed and on PATH.

Quick start

Run the extractor (example):

pdftable --file_path_or_url "/path/to/your.pdf" \
  --output_dir "./cv_out" \
  --pages all \
  --lang en

Export to JSON:

python export_pdftable_to_json.py --outdir ./cv_out --outjson ./cv_out/results.json

The ./cv_out folder will contain page-*.{pdf,png,html} and model artifacts; results.json is a list of page objects with text blocks and (when present) table info.

License Apache License, Version 2.0 — see LICENSE.txt.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
sample_output		sample_output
scripts		scripts
vendor		vendor
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE.txt		LICENSE.txt
README.md		README.md
run_example.sh		run_example.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pdftable-runner

Quick start

About

Uh oh!

Releases

Packages

Languages

License

chrismattmann/pdftable-runner

Folders and files

Latest commit

History

Repository files navigation

pdftable-runner

Quick start

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages