Skip to content
View SunnkerLocket89's full-sized avatar
💭
Color my life with the chaos of trouble
💭
Color my life with the chaos of trouble
  • A Different Perspective
  • Saint Louis
  • 17:44 (UTC -06:00)

Block or report SunnkerLocket89

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
SunnkerLocket89/README.md

Idaho4 Exhibits Parser

This repository provides a command line helper that automates the task of downloading and organising the public exhibits listed in the Idaho4_exhibits_with_full_metadata.xlsx spreadsheet. The script reads the spreadsheet, downloads the referenced PDF files, and optionally extracts the first N pages of each document into a dedicated folder.

Installation

The parser now works out of the box using only the Python standard library. Optional third-party packages improve performance and unlock extras:

Install them individually or via the provided requirements.txt file when available:

pip install -r requirements.txt

Usage

python run_idaho4_parser.py \
  --in-file Idaho4_exhibits_with_full_metadata.xlsx \
  --sheet Exhibits_With_Metadata \
  --workers 6 \
  --extract-pages 4

By default the script stores the downloaded PDFs in idaho4_output/downloads and writes a JSON manifest plus a CSV summary to idaho4_output. Downloaded files are prefixed with the zero-padded Excel row number to guarantee unique filenames while keeping the on-disk order aligned with the worksheet. The manifest records whether each row succeeded, was skipped (for example because it did not contain a URL), or failed, and includes the corresponding Excel row number for quick cross-referencing. Re-run the command with --resume to continue from where a previous session stopped without re-downloading files.

Common flags

  • --url-column – Set the spreadsheet column that contains the PDF URL. When omitted the script attempts to infer a sensible column automatically.
  • --id-column – Configure the column that uniquely identifies each exhibit. This identifier is used to name the downloaded files.
  • --out-dir – Choose a different destination directory for all generated artefacts.
  • --manifest / --csv – Override the default manifest output paths.
  • --verbose – Enable verbose logging for troubleshooting.

Run python run_idaho4_parser.py --help to see the full list of supported flags.

Pinned Loading

  1. freelawproject/x-ray freelawproject/x-ray Public

    A tool to detect whether a PDF has a bad redaction

    Python 159 22

  2. codex codex Public

    Forked from openai/codex

    Lightweight coding agent that runs in your terminal

    Rust 1

  3. exiftool exiftool Public template

    Forked from exiftool/exiftool

    ExifTool meta information reader/writer

    Perl 1