A fully local semantic document clustering and exploration tool. Upload your PDFs, papers, or notes — DocuCluster automatically discovers topics, groups related passages, and lets you explore everything on an interactive map. No cloud, no API keys, everything runs on your machine.
Built as a personal learning project while studying text clustering and topic modeling techniques from "Hands-On Large Language Models" by Jay Alammar and Maarten Grootendorst.
Upload documents → chunks are embedded → UMAP reduces dimensions → HDBSCAN finds clusters → BERTopic labels them → interactive map appears.
- Scatter view — UMAP 2D visualization, each dot is a chunk, color = topic
- Graph view — D3 force-directed graph showing semantic connections between chunks
- Cluster Dictionary — searchable panel showing all topics with keywords
- Hybrid Search — BM25 + semantic reranking, results light up on the map
Backend
- FastAPI + WebSockets (real-time pipeline progress)
- sentence-transformers (thenlper/gte-small) for embeddings
- UMAP for dimensionality reduction
- HDBSCAN for clustering
- BERTopic + c-TF-IDF for topic modeling
- Flan-T5 (local) or Ollama for topic labeling
- BM25 + cosine similarity for hybrid search
Frontend
- React + TypeScript
- Plotly.js for UMAP scatter plot
- D3.js force simulation for graph view
- Zustand for state management
- Vite
- Python 3.10+
- Node.js 18+
- CUDA GPU recommended (runs on CPU but slower)
- 8GB+ RAM recommended
cd backend
pip install -r requirements.txt
uvicorn main:app --reload --host 0.0.0.0 --port 8000cd frontend
npm install
npm run devInstall Ollama, pull a model:
ollama pull llama3.2Then select Ollama in the UI and enter your model name.
- Drop PDF, DOCX, TXT, or MD files into the upload area
- Adjust Cluster Granularity (lower = fewer, larger clusters)
- Click Run Pipeline and watch it process in real time
- Explore the map — click clusters to see their chunks
- Use the search bar for semantic search across all documents
- Toggle between Scatter and Graph view
PDF · DOCX · TXT · Markdown
MIT