High-Throughput RAG Ingestion Engine with Concurrent Worker Pools
Project XLR8 is a high-performance, concurrent ingestion engine designed to process massive datasets for RAG (Retrieval-Augmented Generation) applications.
Unlike simple scripts that process documents sequentially, XLR8 implements a Bounded Worker Pool architecture. It reads raw text, generates vector embeddings using a local LLM (Ollama), and indexes them into a Vector Database (Weaviate)—all while strictly adhering to API rate limits and memory constraints. It features a real-time TUI (Terminal User Interface) to visualize throughput and latency.
I chose a Fan-Out / Fan-In pattern with a Token Bucket Rate Limiter. This ensures the system utilizes available concurrency without overwhelming the downstream Embedding API or the Vector Database.
- Bounded Concurrency: Prevents memory leaks and CPU thrashing by capping the number of active goroutines.
- Backpressure Handling: The
Job Channelacts as a buffer. If the database slows down, workers naturally slow down, preventing system crash. - Decoupled Components: The UI is an observer; it doesn't block the ingestion pipeline.
Imagine an airport with thousands of passengers (Documents) trying to catch a flight (Get into the Database).
- Bad Approach: Everyone rushes the gate at once. The scanners break, security is overwhelmed, and chaos ensues.
- XLR8 Approach:
- The Line (Channel): Passengers wait in an orderly queue.
- The Security Agents (Workers): 50 agents process passengers in parallel.
- The Officer (Rate Limiter): A guard ensures only 10 people pass per second, even if agents are fast, to comply with regulations.
- The Shuttle Bus (Batcher): Once through security, passengers wait on a bus. The bus only leaves when it's full (e.g., 50 people), reducing the number of trips to the plane.
XLR8 is a CLI tool written in Go that implements the Producer-Consumer pattern. It spins up a fixed pool of goroutines (workers) that consume jobs from a buffered channel. Each worker acquires a token from a x/time/rate limiter before calling an external Embedding API. Results are aggregated by a Batcher which flushes to Weaviate in optimal chunk sizes to maximize I/O throughput.
Raw Text → Job Queue → Worker Pool (Rate Limited) → Ollama Embedding → Batcher → Weaviate Index
- Producer (Generator): Pushes raw documents into a buffered
Jobschannel. - Token Bucket Rate Limiter: Regulates the speed at which workers can consume jobs, ensuring thatupstream API quotas (500 RPM) is never exceeded.
- Worker Pool: A fixed set of Goroutines (N=50) that:
- Acquire a token from the Limiter.
- Generate Embeddings (Interface-driven).
- Push processed vectors to a
Resultschannel.
- Consumer (Aggregator): Batches results for the Vector DB and updates the real-time TUI.
- Race Condition Free: Validated via
go run -race. - Graceful Shutdown: Handles
SIGINT(Ctrl+C) by stopping the producer, waiting for in-flight jobs to drain, and closing connections safely. - Interface Driven: Modular design allows hot-swapping of LLM providers and Vector Databases.
- True Concurrency: Utilizes Go routines and Channels for non-blocking processing.
- Rate Limiting: Token-bucket algorithm prevents 429 (Too Many Requests) errors from LLM providers.
- Smart Batching: Buffers embeddings in memory and flushes to DB based on size (e.g., 50 docs) or time (e.g., every 1s).
- Real-Time TUI: Interactive dashboard built with Bubble Tea showing Docs/Sec, Progress, and Error rates.
- Interface Driven: The
EmbedderandVectorStoreare interfaces, allowing hot-swapping between Ollama/OpenAI or Weaviate/Pinecone. - Graceful Shutdown: Handles
SIGINT(Ctrl+C) by draining in-flight jobs before exiting to prevent data corruption.
Quality assurance is built-in to ensure thread safety in the highly concurrent worker pool.
-
Race Condition Detection:
go run -race cmd/xlr8/main.go
Ensures no memory is shared unsafely between the 50+ concurrent threads.
-
Unit Testing:
go test ./... -v
| Component | Technology | Why I chose it? |
|---|---|---|
| Language | Golang (1.23) | Best-in-class concurrency primitives (Channels/Goroutines) and raw performance. |
| Vector DB | Weaviate | Open-source, Go-native client, and excellent batching capabilities. |
| AI Inference | Ollama | Allows local, offline embedding generation (zero cost for dev/test). |
| CLI UI | Bubble Tea | Follows "The Elm Architecture" for robust, state-driven terminal UIs. |
| Rate Limiting | x/time/rate | Standard library robust token bucket implementation. |
- Go 1.23+ installed.
- Docker & Docker Compose (for Weaviate).
- Ollama (installed locally for embeddings).
git clone https://github.com/AnubhavMadhav/XLR8.git
cd xlr8
go mod tidy- Start Vector DB:
docker-compose up -d
- Start AI Model:
ollama pull nomic-embed-text ollama serve
go run cmd/xlr8/main.goOnce ingestion is complete, query your data:
go run cmd/search/main.go "fast animal"
# Output: Found 'Peregrine Falcon' (Score: 0.89)Visualizing the concurrent worker pool processing documents in real-time.
|
Running (Workers Active) |
Completed (100% Processed) |
Confirming that data was successfully batched and persisted to the Vector Database.
1. Docker Infrastructure Logs:
Weaviate logs confirming batch ingestion from the XLR8 pipeline.
2. Data Verification:
Go script verifying vector dimensions and document count.
Demonstrating the engine's ability to understand context over keywords.
Test A: Concept "Fast Animals" (Matches Peregrine Falcon)
Test B: Concept "Programming" (Matches Golang, Rust, Kubernetes)
Standard Go Project Layout for maintainability.
xlr8/
├── cmd/
│ ├── xlr8/ # Main entry point (Ingestion)
│ └── search/ # CLI tool for querying Weaviate
├── internal/
│ ├── core/ # Domain types and Interfaces (Embedder/Store)
│ ├── pipeline/ # Orchestrator and Batching logic
│ ├── worker/ # Worker Pool implementation
│ └── ui/ # Bubble Tea TUI models
├── docker-compose.yml
└── go.mod
-
Bounded vs. Unbounded Concurrency:
- Decision: I chose a fixed pool (e.g., 50 workers).
- Trade-off: Unbounded goroutines could theoretically process faster but would crash the system by exhausting memory or file descriptors under load.
-
Strict Interface Decoupling:
- Decision:
Embedderis an interface. - Benefit: Allows us to switch from
Ollama(local) toOpenAI(cloud) by changing just one line of code inmain.go, without touching the worker logic.
- Decision:
-
Batching Strategy:
- Decision: The
Batcheruses aselectstatement to flush on either size limit OR time interval. - Reason: Prevents "stranded data" where the last few documents sit in the buffer forever if the queue dries up.
- Decision: The
Built with ❤️ and Golang by Anubhav Madhav.