Skip to content

bleeckerj/pdf-from-pdfs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

PDF-from-PDFs: Assemble a Randomized, Deduplicated PDF Sampler

This Python program recursively scans a root directory for PDF files, then assembles a new PDF by randomly sampling pages from the discovered PDFs. It ensures the output PDF contains a specified total number of pages, distributing the selection as evenly as possible across all source PDFs. The program also deduplicates visually similar pages using perceptual hashing, so near-duplicate pages (such as those with only minor differences like dates) are included only once.

Why?

I've been creating these archival books for containing fragments and recollections and week notes from the 10 years (as of 2025-09-21T08:23:36-07:00) I was working on OMATA the analog-digital computer, and some of these were basically ‘unprintable’ and by that I mean they would have been 6000 page books, such as the output from Instagram Stories — lots of content goes in there and in the digital realm they are ephemeral but in the analog/material realm they could only be some kind of peculiar art exhibition (eg from July to November, ‘6000 Pages of Digital Stories: The Predicament of Content Creation in the Age of Influencers’).

So, for some of these, to have just a shard of a shard of the experience — a random selection (cause I'm not really inclined to go through all 6000 pages in something like Adobe Acrobat which would be like wrestling with a pride of Gorillas for a scrap of delicious fruit) would probably more or less suffice? Cause you could write a little preface that explained the dilemma and reflected on the absurdity of it all? For those cases, I can use this little program which would root around in a hierarchy of PDFs and make some random selections and also try to avoid similar images (some might literally be a PDF page of the same image, but the PDF itself ends up being different so I'm trying this Hemming hash thing to sort out similarities.)

That's it. Another archive utility.

Features

  • Recursive PDF Discovery: Finds all PDF files under a given root directory and its subfolders.
  • Random Page Sampling: Randomly selects pages from each PDF, distributing the total as evenly as possible.
  • Visual Deduplication: Uses perceptual hashing to detect and skip visually identical or near-identical pages, even if they appear in different PDFs or with minor changes.
  • Configurable Similarity Threshold: The deduplication sensitivity can be adjusted with a command-line argument.
  • Command-Line Interface: Easy to use with clear arguments for input, output, and deduplication settings.

Requirements

Install dependencies with:

pip install PyPDF2 pdf2image Pillow imagehash

You may also need to install poppler for pdf2image to work (see their docs for platform-specific instructions).

Usage

python assemble_random_pdf.py <root_dir> <total_pages> <output_pdf> [--hash-threshold N]
  • <root_dir>: Root directory to search for PDFs
  • <total_pages>: Total number of pages in the output PDF
  • <output_pdf>: Path to the output PDF file
  • --hash-threshold N: (Optional) Hamming distance threshold for deduplication (default: 8). Lower is stricter, higher allows more variation.

Example

python assemble_random_pdf.py "/path/to/pdfs" 300 output.pdf --hash-threshold 8

How Visual Deduplication Works

Each page is rendered as an image and converted to a perceptual hash. Pages are considered duplicates if the Hamming distance between their hashes is less than or equal to the threshold. This allows the program to skip pages that are visually very similar, even if they have small differences (like dates or watermarks).

Hamming distance is the number of bits that differ between two hashes. A lower value means the pages are more similar; a higher value allows more variation.

Limitations

  • Deduplication is based on visual similarity, not text content.
  • Rendering and hashing pages can be slow for large collections.
  • Requires poppler for PDF rendering (see pdf2image docs).

License

MIT License

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages