Skip to content

dominik1001/paper-claw

 
 

Repository files navigation

PaperClaw

A CLI tool that turns an inbox of PDFs into an organised, agent-searchable document library. PDFs are OCR'd, classified by Claude, given a deterministic filename, and filed into a flat sidecar library.

Requirements

  • Go 1.21+
  • ocrmypdf and tesseract (for OCR)
  • An Anthropic API key (for document classification)

Installation

make deploy

This builds the binary and installs it to /usr/local/bin/paperclaw (requires sudo). To build without installing:

make build   # output: bin/paperclaw

Configuration

Setting Flag Environment variable Default
Inbox --inbox PAPERCLAW_INBOX ~/paperclaw/inbox
Library --library PAPERCLAW_LIBRARY ~/paperclaw/library

ANTHROPIC_API_KEY must be set in the environment for the process command. CLI flags take precedence over environment variables.

Usage

Process inbox PDFs

paperclaw process [--inbox PATH] [--library PATH]

Walks the inbox, OCRs each PDF, classifies it with Claude, and files it into the library. Duplicates are skipped (detected by SHA-256 hash). Failed documents land in library/_quarantine/ with a processing_error.json explaining the failure.

3 documents processed, 1 skipped (duplicate), 0 quarantined

List documents

paperclaw list [--type TYPE] [--since DATE] [--vendor VENDOR] [--overdue]

Filters are combinable:

Flag Description
--type invoice, utility_bill, bank_statement, insurance_letter, contract, government_letter, other
--since Documents dated on or after YYYY-MM-DD
--vendor Substring match on vendor name (case-insensitive)
--overdue Only documents with a past due_date
paperclaw list --type utility_bill --overdue
paperclaw list --vendor stadtwerke --since 2026-01-01

Show a document

paperclaw show <id-prefix>

Prints full metadata and OCR transcript for one document. The id-prefix is a short prefix of the SHA-256 document ID (8+ characters is typically unambiguous).

Search transcripts

paperclaw search <query>

Full-text search across all OCR transcripts. Returns matching document entries and their IDs; use paperclaw show to fetch the full content.

paperclaw search IBAN
paperclaw search "Rechnungsnummer 2024"

JSON output

All commands print JSON when stdout is not a TTY, so they compose cleanly with jq and agent tooling:

paperclaw list --type invoice | jq '.[].summary'

Library layout

~/paperclaw/library/
  process.log
  2026-04-01_stadtwerke_strom-rechnung/
    document.pdf
    transcript.md
    metadata.json
  _quarantine/
    bad-scan.pdf/
      document.pdf
      processing_error.json

Each metadata.json contains the document type, date, vendor, summary, and optional fields (amount, currency, due date, tags, language).

Development

make check   # format + lint + test
make test    # tests only
make lint    # lint only

Pre-commit hooks (via lefthook) run format, lint, and tests automatically. Run make setup once to install tooling.

Claude Code skill

skills/paperclaw/SKILL.md is a Claude Code project skill that lets an agent drive PaperClaw on your behalf. It is installed as a slash command via .claude/commands/paperclaw.md.

To use it, open this project in Claude Code and type /paperclaw followed by a natural-language question:

/paperclaw Which utility bills are overdue?
/paperclaw Find the invoice for the gadget I bought in March.
/paperclaw Show me the latest electricity bill from Stadtwerke.
/paperclaw Search for my IBAN across all documents.

The agent maps your question to the right paperclaw subcommand, runs it, and returns a plain-language answer. It uses list for filtering, search for keyword lookup, and show when you need the full document content.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Go 98.8%
  • Makefile 1.2%