Tip
Come for information; stay for the searing social critique
Exploring how ethics-neutral data processing decisions are not ethics-neutral by applying different document chunking strategies to data available through the UK Parliament API
- What We're Building (Non-Technical) - Understanding the project without code: why chunking decisions matter, how they shape political discourse, and what makes this uncomfortable
- Technical Implementation Plan - Complete technical architecture: stack decisions, chunking strategies, Neo4j schema, RAG pipeline design, and phase-by-phase implementation guide
- Roadmap Overview - Current status across all modules, next milestones, and recent wins
- Chunking Strategy Roadmap - Progress on the four chunking pipelines (2/4 complete), MVP milestones, and future features
- Neo4j Vector Storage Roadmap - Vector database implementation (complete), hybrid indexing strategy, and storage architecture
- User Interface Roadmap - SvelteKit frontend plans, comparative visualization, and citation display
- Parliament API Roadmap - API client implementation (complete), data ingestion, and testing
- Agents & Automation Roadmap - Slash commands and documentation workflows (complete)
- Parliament API Developer Guide - Complete reference for all UK Parliament API endpoints: Hansard, Members, Votes, Committees, and Bills
- API Testing Guide - Comprehensive testing workflows using Bruno and HTTPie, validation patterns, and integration strategies
- Roadmaps, Docs & Slash Commands Guide - Complete guide to the three-part documentation system: how roadmaps track progress, how docs explain concepts, and how slash commands automate updates
- Node.js 18+
- OpenAI API key (for text-embedding-3-large embeddings)
- Neo4j 5.11+ (for vector storage, required in later phases)
- Clone the repository and install dependencies:
npm install- Copy the environment template and configure your API keys:
cp .env.example .env- Edit
.envand add your OpenAI API key:
OPENAI_API_KEY=your_actual_openai_key_here
Test a single semantic chunking strategy (1024 tokens):
npm run test:chunkingCompare both semantic strategies side-by-side (1024 vs 256 tokens):
npm run test:chunking:compareThis will show:
- Chunk count differences between strategies
- Token granularity analysis
- Processing time comparisons
- Speaker and party distribution
- Sample chunks from each strategy
The Neo4j vector database stores chunks from all 4 chunking strategies using a hybrid indexing approach:
- Dual-label system: Each chunk has
:Chunk+ strategy label (:Semantic1024,:Semantic256,:Late1024,:Late256) - 5 vector indexes: 4 strategy-specific + 1 unified for cross-strategy similarity
- Different embeddings per strategy: Semantic uses standard embeddings, late chunking uses blended embeddings (70% chunk + 30% debate context)
Step 1: Set up Neo4j Aura instance (see Neo4j Setup Guide)
Step 2: Verify connection and initialize schema (creates 5 vector indexes):
npm run test:neo4j:setupStep 3: Populate with all 4 chunking strategies (~154 chunks from test data):
npm run test:neo4j:populateStep 4: Test comparative vector search (queries all 4 strategies simultaneously):
npm run test:neo4j:searchThis demonstrates the core experiment: how different chunking strategies retrieve different results for the same query. Test queries show 8-25% overlap between strategies, proving that chunking choices significantly impact retrieval.
The SvelteKit frontend provides an interactive comparative search interface:
Start the development server:
npm run devAccess the UI: Open http://localhost:5173
Features:
- Comparative search: Query all 4 chunking strategies simultaneously
- Configurable results: Choose how many chunks (n) to retrieve per strategy (1-20)
- Divergence analysis: See mathematical breakdown of overlapping vs unique results
- Collapsible strategy views: Expand/collapse each strategy's results independently
- Rich citations: Party-colored speaker badges, Hansard references, similarity scores with tooltips
- Visual frames: Clear 2×2 grid showing "Early Chunking 1024/256" and "Late Chunking 1024/256"
Example queries:
- "What is the government's position on NHS funding?"
- "How did the opposition respond to the Prime Minister?"
- "What are the concerns about education policy?"