A comprehensive tool for generating RAG (Retrieval-Augmented Generation) eval datasets from Confluence pages (more sources planned).
This system uses a distributed architecture with Kafka for task queue management, separate scraper and worker processes, and supports both batch and streaming processing.
- Scraper: Discovers and submits Confluence pages to a Kafka topic
- Worker: Processes pages from Kafka, generates QA pairs, and stores results in SQLite
- Kafka/Redpanda: Message queue for task distribution
- SQLite: Storage for processed documents and generated QA pairs
- Distributed Processing: Separate scraper and worker processes for scalability
- Batch Processing: Support for processing documents in batches for cross-document questions
- Hierarchical Crawling: Discovers parent-child relationships between pages
- Date Filtering: Filter pages by last modification date (e.g., only pages updated in last 6 months)
- Intelligent Question Generation: Creates questions of varying difficulty levels
- Multiple Languages: Support for German and English content
- HTML Processing: Robust HTML parsing for clean text extraction
# Install core package (using uv)
uv sync
# Install with standalone script dependencies
uv sync --extra scripts
# Start Redpanda (Kafka-compatible message broker) for distributed mode
docker-compose up -d # If you have a docker-compose.yml for Kafka/RedpandaSet up your environment variables:
export CONFLUENCE_BASE_URL="https://your-domain.atlassian.net"
export CONFLUENCE_USERNAME="your-email@domain.com"
export CONFLUENCE_API_KEY="your-api-token"
export OPENROUTER_API_KEY="your-openrouter-key"
export KAFKA_BOOTSTRAP_SERVERS="localhost:19092"
export KAFKA_TOPIC="tasks"
export SQLITE_DATABASE="./data.db"The QA generator talks to any OpenAI-compatible endpoint, selected by
--generator-base-url and an API key. The key is read from LLM_API_KEY
(falling back to OPENROUTER_API_KEY for backwards compatibility).
# Default: OpenRouter
export OPENROUTER_API_KEY="your-openrouter-key"
# Any other OpenAI-compatible endpoint
export LLM_API_KEY="$(your-token-command)" # or a static key
python -m slurp worker \
--generator-base-url https://your-llm-endpoint.example/v1 \
--generator-model your-modelSlurp ingests content through pluggable connectors, selected with
--connector. The default is local.
| Connector | Source | Requires |
|---|---|---|
local |
Files on disk (.md/.html/.txt) |
nothing (no Confluence creds) |
confluence |
A Confluence space | CONFLUENCE_* credentials |
Both connectors still flow through Kafka and the LLM generator, so a broker
(infra/docker-compose.yaml) and OPENROUTER_API_KEY are required either way.
# Scrape a directory of documents into the queue
python -m slurp scraper --local-path ./docs
# Only markdown files
python -m slurp scraper --local-path ./docs --local-extensions .md
# A single file
python -m slurp scraper --local-path ./docs/intro.md
# Then run the worker (it dispatches on each task's connector automatically)
python -m slurp worker --generator-batch-size 1# Serve an auto-refreshing HTML view of the generated QA pairs (default :8077)
python -m slurp render --open --sqlite-database ./data.dbThe page polls the SQLite generations table, so QA pairs appear as the worker
produces them.
python -m slurp skill # print the bundled SKILL.md
python -m slurp skill --install # write it to ./.claude/skills/slurp/SKILL.mdThe scraper discovers Confluence pages and submits them to Kafka
(note the explicit --connector confluence, since local is the default):
# Scrape up to 50 pages from a Confluence space
python -m slurp scraper --connector confluence --confluence-space RESEARCH --confluence-max-pages 50
# Filter by recent pages (last 3 months)
python -m slurp scraper --connector confluence --confluence-space RESEARCH --confluence-months-back 3
# Skip the first 100 pages
python -m slurp scraper --connector confluence --confluence-space RESEARCH --confluence-skip 100
# Run multiple scraper workers
python -m slurp scraper --workers 2 --connector confluence --confluence-space RESEARCHThe worker processes pages from Kafka and generates QA pairs:
# Process pages individually
python -m slurp worker --generator-batch-size 1
# Process pages in batches of 4 for cross-document questions
python -m slurp worker --generator-batch-size 4
# Specify a different model
python -m slurp worker --generator-model "anthropic/claude-3-sonnet"
# Run multiple worker processes
python -m slurp worker --workers 4 --generator-language de--confluence-space: Confluence space key to scrape--confluence-max-pages: Maximum number of pages to fetch (default: 50)--confluence-months-back: Only process pages modified within last N months (0 = no filter, default: 0)--confluence-skip: Number of pages to skip (default: 0)--confluence-concurrency: Number of concurrent requests (default: 4)--confluence-page-batch-size: Number of pages to fetch per batch (default: 50)
--generator-batch-size: Number of documents to process together (default: 1)--generator-model: LLM model to use (default: "google/gemini-2.5-flash-preview-05-20")--generator-language: Language for generated questions (default: "de")--generator-difficulty-ratio: Question difficulty (easy/medium/hard/mixed/balanced)--generator-concurrency: Number of concurrent LLM requests (default: 5)
The system uses SQLite for storing processed documents and generated QA pairs:
task_results: Stores processed Confluence pagesgenerations: Stores generated QA pairs with references to source pages
- Kafka Connection Errors: Ensure Redpanda is running (
docker-compose ps) - Missing Environment Variables: Check that all required environment variables are set
- Database Errors: Verify SQLite database permissions and path
- LLM API Errors: Check your OpenRouter API key and quota
- HTML Parsing Issues: The HTML parser has been optimized for Confluence pages
- Scraper: Discovers and submits Confluence pages to Kafka
- Worker: Processes pages from Kafka and generates QA pairs
- LLMGenerator: Generates questions and answers using LLMs
- HTMLParser: Cleans and processes HTML content
- SqlitePersistence: Stores results in SQLite database
- KafkaQueueSubmitter: Submits tasks to Kafka
- KafkaConsumer: Consumes tasks from Kafka