Skip to content

anaisbetts/kreuzakt

Repository files navigation

Kreuzakt - a simple replacement for Paperless

Kreuzakt is a project that takes the best parts of Paperless, drastically improves the OCR using VLLMs, and throws out 99% of the complexity. Take every boring document in your life and make them all instantly easy to find, and (optionally) let AIs search them to answer questions for you.

image

What's Different:

  • Kreuzakt uses a single Docker container with an SQLite database, there aren't a ton of moving parts
  • Rather than use Tesseract, Kreuzakt uses LLMs to do OCR (by default via OpenRouter but Ollama/Local LLMs work as well) via Kreuzberg. This drastically improves OCR accuracy, and by extension, search accuracy.
  • Kreuzakt provides a remote MCP server - connect Claude Desktop, Cursor, or any other MCP client to Kreuzakt and ask questions about your documents
  • Kreuzakt uses an LLM to also derive a title / description / original date for every document, out of the box. Zero manual curation / toil work.
  • Metadata can always be regenerated from the source documents, the only thing you need to migrate is the originals

What's the Same:

  • Kreuzakt always preserves your original documents, it never edits them directly
  • Ingestion based on file watches works the same, drop documents into the 'ingest' folder and it will automatically be processed

Self-hosting with Docker Compose

services:
  kreuzakt:
    image: ghcr.io/anaisbetts/kreuzakt:latest
    ports:
      - "3000:3000"
    environment:
      OPENROUTER_KEY: ${OPENROUTER_KEY}
      TZ: Europe/Berlin  # Set your local timezone
    volumes:
      - ./docs:data
    restart: unless-stopped

Drop this in a docker-compose.yml, set OPENROUTER_KEY in your environment or a .env file, and run docker compose up -d. The web UI is at http://localhost:3000.

The ./docs folder will be initialized with directories including ./data/ingest, ./data/originals, and ./data/thumbnails.

Ok now what do I do?

  1. docker-compose up -d
  2. Drop all of your documents into the ingest folder - they will eventually all move to the originals folder. You can see the progress at /settings - if you have a lot of documents it might take a bit.
  3. If you've got an existing Paperless install, you can run the import
  4. You can also simply drag-drop a bunch of files onto the main page

How much is this gonna cost me?

I'm too lazy to do the math on exactly how much per-page it costs, but for perspective, importing 440 documents from Paperless (a few of which were up to 80pgs long), cost me ~$5.

Volume mounts

Everything lives under /data by default — the SQLite database, originals, thumbnails, and the ingest folder. If you want to split things up, override with individual env vars and mount each path separately:

Variable Default Description
INGEST_DIR /data/ingest Watched folder for new documents
IMPORT_DIR /data/import Staging folder for orchestrated imports (e.g. Paperless); not watched
ORIGINALS_DIR /data/originals Stored original files
THUMBNAILS_DIR /data/thumbnails Generated thumbnails
DB_PATH /data/docs-ai.db SQLite database

Optional environment variables

Variable Default Description
OPENROUTER_KEY API key for OpenRouter (recommended)
OPENAI_API_KEY Alternative: direct OpenAI key
OPENAI_BASE_URL https://openrouter.ai/api/v1 Base URL for any OpenAI-compatible API (e.g. Ollama at http://host.docker.internal:11434/v1)
OCR_VLM_MODEL openai/gpt-5.4-mini Model used for OCR
METADATA_LLM_MODEL openai/gpt-5.4 Model used for title/description extraction
PORT 3000 Port inside the container
TZ UTC Timezone for date display (e.g. Europe/Berlin, America/New_York). Use any tz database name.
INGEST_WATCH_POLL false Poll INGEST_DIR instead of using inotify. Enable when the ingest folder is on NFS, SMB, or a FUSE mount — inotify does not see changes made on the remote side.
INGEST_WATCH_POLL_INTERVAL_MS 2000 Poll interval in ms when INGEST_WATCH_POLL is enabled.

MCP setup

Kreuzakt exposes a remote MCP endpoint at /mcp (Streamable HTTP). Replace the hostname in the snippets below with wherever you serve the app — for example https://docs.your-tailnet.ts.net/mcp when using Tailscale Serve. Most clients will not talk to plain http, so terminating TLS (Serve, a reverse proxy, etc.) is the usual approach.

Claude Desktopnpx mcp-remote@latest …

mcp-remote bridges the HTTP MCP endpoint for clients that expect a local process.

{
  "mcpServers": {
    "docs": {
      "command": "npx",
      "args": ["mcp-remote@latest", "https://docs.your-tailnet.ts.net/mcp"]
    }
  }
}
Cursortype: "http" in MCP config

Add to .cursor/mcp.json or your project’s MCP settings.

{
  "mcpServers": {
    "docs": {
      "type": "http",
      "url": "https://docs.your-tailnet.ts.net/mcp"
    }
  }
}

Example prompts

  • "Find invoices from Deutsche Telekom."
  • "What was my health insurance number again?"
  • "How much did I pay in taxes last year"

Local development

Prerequisites: Bun (the project runs Next.js and scripts through Bun; see package.json) and a Rust toolchain for the Kreuzberg extraction CLI.

  1. Install dependencies: bun install
  2. Build the local extraction CLI: cargo build -p kreuzakt-kreuzberg
  3. Copy .env.local.example to .env.local and set at least one way to reach an OpenAI-compatible API. The usual choice is OPENROUTER_KEY. For a local LLM, set OPENAI_DEV_URL, OPENAI_DEV_KEY, and optionally OCR_VLM_DEV_MODEL / METADATA_LLM_DEV_MODEL. See .env.local.example for all variables the app and tooling recognize.
  4. Start the dev server: bun dev. The app listens on port 3000 by default (PORT). Runtime data defaults to ./data (SQLite, ingest, originals, thumbnails) unless you override DATA_DIR or individual path variables.

Other useful commands:

  • bun test — unit tests
  • cargo test — Rust extraction CLI tests
  • bun run test:integration — integration tests (loads .env.local via --env-file; requires Paperless-related vars when those tests run)
  • bun storybook — UI development on port 6006

Text export

POST /api/documents/export-text exports every document whose SQLite content column is non-empty as a ZIP of .txt files. Each file is named {id}-{sanitized-title}.txt and begins with YAML frontmatter (original_filename, document_url, original_url) followed by the extracted document body. The response is an application/zip download named kreuzakt-text-export-YYYYMMDD-HHmmss.zip. Returns 400 if there is no exportable content.

curl -X POST http://localhost:3000/api/documents/export-text -o export.zip

So.... why's it called "Kreuzakt"?

It uses the library Kreuzberg, and it is a tool to help you with your "Akte" (files/documents). Just like "Berghain" is a portmanteau of "Kreuzberg" and "Friedrichshain", the two districts in Berlin that it sits between. (today you learn!)

About

A search engine for humans and computers for your most boring documents

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages