Semantic search and RAG-powered Q&A over the classic 20 Newsgroups dataset — combining Cohere embeddings, Fuzzy C-Means clustering, a pure-Python semantic cache, and Groq LLM, exposed via FastAPI and Next.js 14.
- Quick Overview
- Architecture at a Glance
- How It Works
- Tech Stack
- Project Structure
- Setup Guide
- Running the Application
- API Reference
- Design Decisions
- Docker Deployment
- For Reviewers
| What | Description |
|---|---|
| Goal | Semantic search over ~11k newsgroup posts with LLM-generated answers and a cache to avoid repeated LLM calls. |
| Backend | Python FastAPI on port 8000 — 8 REST endpoints for search, clusters, cache stats, and health. |
| Frontend | Next.js 14 (App Router) on port 3000 — Search page, Cluster Explorer, Cache Dashboard. |
| Data | 20 Newsgroups — cleaned, embedded with Cohere, stored in Supabase (pgvector). |
| One-time setup | Run setup_db → ingest → cluster (in order). Then start API + frontend. |
flowchart TB
subgraph Frontend["🖥️ Next.js 14 Frontend (port 3000)"]
Search["Search Page"]
Clusters["Cluster Explorer"]
Cache["Cache Dashboard"]
end
subgraph Backend["⚙️ FastAPI Backend (port 8000)"]
API["/api routes"]
Embedder["Cohere Embedder\n1024-dim, asymmetric"]
CacheModule["Semantic Cache\ncluster-bucketed, 0.88 threshold"]
LLM["Groq LLM\nllama-3.3-70b"]
Retriever["Supabase Retriever\npgvector search"]
end
subgraph Data["☁️ Supabase (PostgreSQL + pgvector)"]
Docs[("documents\n+ embeddings")]
ClustersTable[("document_clusters\n+ cluster_metadata")]
end
Search -->|POST /api/query| API
Clusters -->|GET /api/clusters| API
Cache -->|GET /api/cache/stats| API
API --> Embedder
API --> CacheModule
API --> LLM
API --> Retriever
Retriever --> Docs
Retriever --> ClustersTable
sequenceDiagram
participant User
participant Frontend
participant API
participant Embedder
participant Cache
participant Retriever
participant LLM
participant Supabase
User->>Frontend: Enter query & submit
Frontend->>API: POST /api/query {"query": "..."}
API->>Embedder: embed_query (search_query)
Embedder-->>API: 1024-dim vector
API->>API: Assign to nearest cluster (15 centroids)
API->>Cache: lookup(query_embedding, cluster_id)
alt Cache HIT
Cache-->>API: cached answer
API->>Retriever: semantic_search (for display)
Retriever->>Supabase: match_documents RPC
Supabase-->>API: top docs
API-->>Frontend: result (cache_hit: true)
else Cache MISS
Cache-->>API: null
API->>Retriever: semantic_search
Retriever->>Supabase: match_documents RPC
Supabase-->>API: top 5 docs
API->>LLM: answer_query(query, docs)
LLM-->>API: answer text
API->>Cache: store(query, embedding, answer, cluster_id)
API-->>Frontend: result (cache_hit: false)
end
Frontend-->>User: Show answer, docs, cluster info
flowchart LR
subgraph Step1["1. setup_db"]
A[Supabase] --> B[Create tables + indexes\npgvector, IVFFlat]
end
subgraph Step2["2. ingest"]
C[20 Newsgroups\nsklearn] --> D[Clean & dedupe]
D --> E[Cohere embed\nsearch_document]
E --> F[Supabase documents\n+ data/embeddings.npy]
end
subgraph Step3["3. cluster"]
F --> G[UMAP 1024→50 dim]
G --> H[Fuzzy C-Means\nk=15, m=2]
H --> I[document_clusters\n+ cluster_metadata\n+ centroids]
end
Step1 --> Step2 --> Step3
flowchart TD
A[Query embedding + cluster_id] --> B[Phase 1: Candidate selection]
B --> C[Collect entries from\nbucket[cluster] + bucket[cluster±1]]
C --> D[Phase 2: Cosine similarity]
D --> E{Best sim ≥ 0.88?}
E -->|Yes| F[✅ HIT — return cached answer]
E -->|No| G[❌ MISS — vector search + LLM]
| Layer | Technology | Purpose |
|---|---|---|
| Embeddings | Cohere embed-english-v3.0 |
1024-dim, asymmetric (search_document / search_query) |
| Vector DB | Supabase (PostgreSQL + pgvector) | Store documents + embeddings, cosine search via RPC |
| Clustering | scikit-fuzzy (FCM), UMAP | Soft clusters (k=15, m=2), boundary docs, 50-dim for FCM |
| LLM | Groq — llama-3.3-70b-versatile |
RAG answers on cache miss only |
| Cache | In-memory Python (no Redis) | Cluster-bucketed, LRU+LFU eviction, 500 entries max |
| Backend | FastAPI + uvicorn | REST API, Pydantic v2, CORS, request ID middleware |
| Frontend | Next.js 14, Tailwind, Framer Motion | Search UI, cluster explorer, cache dashboard |
Trademarkia-RAG_AI-ML_Task/
├── backend/ # Python FastAPI application
│ ├── app/
│ │ ├── main.py # FastAPI app, lifespan, CORS, middleware
│ │ ├── config.py # pydantic-settings (env vars)
│ │ ├── api/
│ │ │ └── routes.py # All 8 API endpoints
│ │ ├── core/
│ │ │ ├── cache.py # SemanticCache — cluster-bucketed, two-phase lookup
│ │ │ └── clustering.py # ClusterAssigner — runtime centroid assignment
│ │ ├── services/
│ │ │ ├── embedder.py # CohereEmbedder — embed_query / embed_documents
│ │ │ ├── retriever.py # SupabaseRetriever — semantic_search, cluster stats
│ │ │ └── llm.py # GroqLLMService — answer_query
│ │ └── models/
│ │ └── schemas.py # Pydantic request/response models
│ ├── scripts/
│ │ ├── setup_db.py # One-time: create Supabase tables + indexes
│ │ ├── ingest.py # One-time: load 20 Newsgroups, clean, embed, store
│ │ └── cluster.py # One-time: UMAP, FCM, store memberships + metadata
│ ├── data/ # Generated by scripts (gitignored)
│ │ ├── embeddings.npy # (N, 1024) document embeddings
│ │ ├── docs_metadata.json # Cleaned document records
│ │ ├── memberships.npy # FCM membership matrix
│ │ ├── umap_2d.npy # 2D coords for visualization
│ │ └── ...
│ ├── requirements.txt
│ └── .env.example
│
├── frontend/ # Next.js 14 App Router
│ ├── app/
│ │ ├── page.tsx # Main search page
│ │ ├── clusters/page.tsx # Cluster explorer
│ │ ├── cache/page.tsx # Cache dashboard + threshold explorer
│ │ ├── layout.tsx
│ │ └── globals.css
│ ├── components/
│ │ └── Navbar.tsx
│ ├── lib/
│ │ └── api.ts # Typed API client (fetch wrappers)
│ ├── .env.local # NEXT_PUBLIC_API_URL=http://localhost:8000
│ └── package.json
│
├── docker/
│ ├── Dockerfile # Python 3.11-slim, uvicorn
│ ├── docker-compose.yml # Single API service, data volume
│ └── .dockerignore
│
└── README.md
| Requirement | Version / Notes |
|---|---|
| Python | 3.11+ |
| Node.js | 18+ |
| Supabase | Free tier — supabase.com |
| Cohere API key | dashboard.cohere.com |
| Groq API key | console.groq.com |
git clone <repo-url>
cd Trademarkia-RAG_AI-ML_Task
# Backend
cd backend
python -m venv venv
# Windows
.\venv\Scripts\activate
# Linux/macOS
# source venv/bin/activate
pip install -r requirements.txtcp .env.example .env
# Edit backend/.env with your keys| Variable | Description | Example |
|---|---|---|
COHERE_API_KEY |
Cohere API key | From Cohere dashboard |
GROQ_API_KEY |
Groq API key | From Groq console |
SUPABASE_URL |
Supabase project URL | https://xxx.supabase.co |
SUPABASE_SERVICE_KEY |
Service role key (not anon) | eyJ... |
SUPABASE_DB_URL |
PostgreSQL connection string | postgresql://postgres:PASSWORD@... |
CACHE_SIMILARITY_THRESHOLD |
Cache hit threshold (default 0.88) | Optional |
N_CLUSTERS |
Number of FCM clusters (default 15) | Optional |
CORS_ORIGINS |
Allowed frontend origin | http://localhost:3000 |
# From backend/
python -m scripts.setup_dbThis creates:
- documents — id, text, text_preview, embedding (vector 1024), etc.
- document_clusters — doc_id, cluster_memberships (JSONB), dominant_cluster, is_boundary_doc, entropy
- cluster_metadata — cluster_id, label, top_terms, top_categories, doc_count, centroid (vector 1024)
- cache_analytics — (optional table for logs)
- IVFFlat index on
documents.embedding, B-tree indexes as needed
Then: Create the RPC function in Supabase. Open Dashboard → SQL Editor and run the match_documents SQL (see comment block in backend/scripts/setup_db.py, or the API Reference section below).
python -m scripts.ingest- Fetches 20 Newsgroups (sklearn), cleans and deduplicates.
- Embeds with Cohere (
input_type="search_document"), batch size 24, 20s sleep between batches (free-tier friendly). - Upserts into Supabase
documentsand savesdata/embeddings.npyanddata/docs_metadata.json. - Resume: If interrupted, re-run; it resumes from
data/embeddings_checkpoint.npyanddata/checkpoint_index.json.
⏱ Rough time: ~15–30+ minutes depending on rate limits.
python -m scripts.cluster- Loads
data/embeddings.npyanddata/docs_metadata.json. - UMAP: 1024 → 50 dim (for FCM) and 1024 → 2 dim (for viz).
- Fuzzy C-Means: k=15, m=2.0 on 50-dim.
- Writes
document_clustersandcluster_metadata(including 1024-dim centroids) to Supabase and local files.
⏱ Rough time: ~5–10 minutes.
cd ../frontend
cp .env.local.example .env.local
# .env.local should contain: NEXT_PUBLIC_API_URL=http://localhost:8000
npm installcd backend
# Activate venv if not already
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload- API: http://localhost:8000
- Swagger UI: http://localhost:8000/docs
- Health:
curl http://localhost:8000/api/health
cd frontend
npm run dev- App: http://localhost:3000
- Pages: Search (/) | Clusters (/clusters) | Cache (/cache)
# Health check
curl http://localhost:8000/api/health
# Search (replace with your query)
curl -X POST http://localhost:8000/api/query \
-H "Content-Type: application/json" \
-d '{"query": "What are the arguments about gun control?"}'All endpoints are under /api. Base URL: http://localhost:8000.
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/query |
Semantic search + LLM answer (cache-aware). Body: {"query": "string"} |
GET |
/api/cache/stats |
Cache stats: total_entries, hit_count, miss_count, hit_rate, entries_by_cluster, threshold_used |
DELETE |
/api/cache |
Flush semantic cache |
GET |
/api/clusters |
All cluster metadata (labels, doc_count, top_terms, top_categories, avg_entropy) |
GET |
/api/clusters/{id}/documents |
Documents in cluster by membership. Query: ?limit=10 (1–100) |
GET |
/api/clusters/boundary |
Most uncertain (boundary) documents. Query: ?limit=20 |
GET |
/api/query/threshold-explore |
Per-threshold hit analysis. Query: ?query=... |
GET |
/api/health |
Status, version, cache_entries, cache_hit_rate, clusters_loaded |
Example response (POST /api/query):
{
"query": "What are the main arguments for and against gun control?",
"cache_hit": false,
"matched_query": null,
"similarity_score": null,
"result": "Based on the newsgroup discussions, the main arguments...",
"dominant_cluster": 5,
"cluster_memberships": {"5": 0.72, "9": 0.11, "3": 0.08},
"retrieved_docs": [
{
"text_preview": "The second amendment clearly states...",
"original_category": "talk.politics.guns",
"similarity": 0.89,
"dominant_cluster": 5
}
],
"processing_time_ms": 1842.33
}Create match_documents RPC in Supabase (SQL Editor):
CREATE OR REPLACE FUNCTION match_documents(
query_embedding vector(1024),
match_count int DEFAULT 5,
filter_cluster int DEFAULT NULL
)
RETURNS TABLE (
id bigint,
doc_index integer,
text_preview text,
original_category text,
dominant_cluster integer,
cluster_memberships jsonb,
similarity float
)
LANGUAGE plpgsql AS $$
BEGIN
RETURN QUERY
SELECT
d.id, d.doc_index, d.text_preview, d.original_category,
dc.dominant_cluster, dc.cluster_memberships,
1 - (d.embedding <=> query_embedding) AS similarity
FROM documents d
JOIN document_clusters dc ON d.id = dc.doc_id
WHERE (filter_cluster IS NULL OR dc.dominant_cluster = filter_cluster)
ORDER BY d.embedding <=> query_embedding
LIMIT match_count;
END;
$$;- Asymmetric embeddings:
search_documentfor indexing andsearch_queryfor queries — different projections improve retrieval for paraphrased questions. - 1024 dimensions — used consistently for storage and runtime cluster assignment.
- Soft memberships — each doc has a score per cluster; captures overlap (e.g. politics + religion).
- Boundary documents — low gap between top-2 memberships highlights ambiguous posts.
- k=15 — chosen via k-sweep (FPC + silhouette); balances thematic groups without over-splitting.
- FCM in 1024 dimensions suffers from curse of dimensionality (distances concentrate).
- UMAP reduces to 50 dimensions (cosine) before FCM; 2D UMAP is used only for visualization.
- Runtime uses 1024-dim centroids (mean of member embeddings) so no UMAP at query time — just 15 dot products.
- ~500 entries × 1024 floats ≈ 2 MB — no need for Redis.
- Serializing/deserializing vectors for every lookup would add latency.
- Cluster-bucketed storage — lookup only in bucket[cluster] and bucket[cluster±1], ~100 comparisons instead of 500.
- LRU+LFU hybrid eviction —
score = timestamp - hit_count × 3600so popular queries stay longer.
| Threshold | Effect |
|---|---|
| 0.80 | Higher hit rate, more risk of off-topic reuse |
| 0.88 | Balanced: paraphrases hit, sub-topic changes miss |
| 0.95 | Strict: near-identical queries only |
Explore live: GET /api/query/threshold-explore?query=...
Docker is for deployment only. One-time scripts (setup_db, ingest, cluster) are intended to run locally with your env and keys.
After running those locally:
docker-compose -f docker/docker-compose.yml build
docker-compose -f docker/docker-compose.yml up -d
curl http://localhost:8000/api/health- No Redis — cache is in-process.
- No DB in compose — Supabase is external.
- Volume —
backend/data/is mounted so embeddings and cluster artifacts persist.
| If you want to… | Look at |
|---|---|
| See the full request/response flow | backend/app/api/routes.py — especially POST /query |
| Understand the cache | backend/app/core/cache.py — lookup, store, get_stats, flush |
| See how embeddings are used | backend/app/services/embedder.py (query vs document input_type) |
| See how clustering is used at runtime | backend/app/core/clustering.py — ClusterAssigner, centroid loading in main.py |
| Trace ingest pipeline | backend/scripts/ingest.py — load → clean → embed → store → checkpoint |
| Trace cluster pipeline | backend/scripts/cluster.py — UMAP → FCM → memberships → Supabase |
| Check frontend API usage | frontend/lib/api.ts — all endpoints; pages use these only (no hardcoded data) |
Run order for a clean run:
setup_db → (create RPC in Supabase) → ingest → cluster → start API → start frontend.
MIT