🔍 20 Newsgroups Semantic Search System

Semantic search and RAG-powered Q&A over the classic 20 Newsgroups dataset — combining Cohere embeddings, Fuzzy C-Means clustering, a pure-Python semantic cache, and Groq LLM, exposed via FastAPI and Next.js 14.

📑 Table of Contents

🎯 Quick Overview

What	Description
Goal	Semantic search over ~11k newsgroup posts with LLM-generated answers and a cache to avoid repeated LLM calls.
Backend	Python FastAPI on port 8000 — 8 REST endpoints for search, clusters, cache stats, and health.
Frontend	Next.js 14 (App Router) on port 3000 — Search page, Cluster Explorer, Cache Dashboard.
Data	20 Newsgroups — cleaned, embedded with Cohere, stored in Supabase (pgvector).
One-time setup	Run `setup_db` → `ingest` → `cluster` (in order). Then start API + frontend.

🏗 Architecture at a Glance

High-Level System Diagram

flowchart TB
    subgraph Frontend["🖥️ Next.js 14 Frontend (port 3000)"]
        Search["Search Page"]
        Clusters["Cluster Explorer"]
        Cache["Cache Dashboard"]
    end

    subgraph Backend["⚙️ FastAPI Backend (port 8000)"]
        API["/api routes"]
        Embedder["Cohere Embedder\n1024-dim, asymmetric"]
        CacheModule["Semantic Cache\ncluster-bucketed, 0.88 threshold"]
        LLM["Groq LLM\nllama-3.3-70b"]
        Retriever["Supabase Retriever\npgvector search"]
    end

    subgraph Data["☁️ Supabase (PostgreSQL + pgvector)"]
        Docs[("documents\n+ embeddings")]
        ClustersTable[("document_clusters\n+ cluster_metadata")]
    end

    Search -->|POST /api/query| API
    Clusters -->|GET /api/clusters| API
    Cache -->|GET /api/cache/stats| API

    API --> Embedder
    API --> CacheModule
    API --> LLM
    API --> Retriever
    Retriever --> Docs
    Retriever --> ClustersTable

Query Flow (What Happens When You Search)

sequenceDiagram
    participant User
    participant Frontend
    participant API
    participant Embedder
    participant Cache
    participant Retriever
    participant LLM
    participant Supabase

    User->>Frontend: Enter query & submit
    Frontend->>API: POST /api/query {"query": "..."}
    API->>Embedder: embed_query (search_query)
    Embedder-->>API: 1024-dim vector
    API->>API: Assign to nearest cluster (15 centroids)
    API->>Cache: lookup(query_embedding, cluster_id)

    alt Cache HIT
        Cache-->>API: cached answer
        API->>Retriever: semantic_search (for display)
        Retriever->>Supabase: match_documents RPC
        Supabase-->>API: top docs
        API-->>Frontend: result (cache_hit: true)
    else Cache MISS
        Cache-->>API: null
        API->>Retriever: semantic_search
        Retriever->>Supabase: match_documents RPC
        Supabase-->>API: top 5 docs
        API->>LLM: answer_query(query, docs)
        LLM-->>API: answer text
        API->>Cache: store(query, embedding, answer, cluster_id)
        API-->>Frontend: result (cache_hit: false)
    end

    Frontend-->>User: Show answer, docs, cluster info

Data Pipeline (One-Time Setup)

flowchart LR
    subgraph Step1["1. setup_db"]
        A[Supabase] --> B[Create tables + indexes\npgvector, IVFFlat]
    end

    subgraph Step2["2. ingest"]
        C[20 Newsgroups\nsklearn] --> D[Clean & dedupe]
        D --> E[Cohere embed\nsearch_document]
        E --> F[Supabase documents\n+ data/embeddings.npy]
    end

    subgraph Step3["3. cluster"]
        F --> G[UMAP 1024→50 dim]
        G --> H[Fuzzy C-Means\nk=15, m=2]
        H --> I[document_clusters\n+ cluster_metadata\n+ centroids]
    end

    Step1 --> Step2 --> Step3

Semantic Cache Lookup (Two-Phase)

flowchart TD
    A[Query embedding + cluster_id] --> B[Phase 1: Candidate selection]
    B --> C[Collect entries from\nbucket[cluster] + bucket[cluster±1]]
    C --> D[Phase 2: Cosine similarity]
    D --> E{Best sim ≥ 0.88?}
    E -->|Yes| F[✅ HIT — return cached answer]
    E -->|No| G[❌ MISS — vector search + LLM]

🛠 Tech Stack

Layer	Technology	Purpose
Embeddings	Cohere `embed-english-v3.0`	1024-dim, asymmetric (`search_document` / `search_query`)
Vector DB	Supabase (PostgreSQL + pgvector)	Store documents + embeddings, cosine search via RPC
Clustering	scikit-fuzzy (FCM), UMAP	Soft clusters (k=15, m=2), boundary docs, 50-dim for FCM
LLM	Groq — `llama-3.3-70b-versatile`	RAG answers on cache miss only
Cache	In-memory Python (no Redis)	Cluster-bucketed, LRU+LFU eviction, 500 entries max
Backend	FastAPI + uvicorn	REST API, Pydantic v2, CORS, request ID middleware
Frontend	Next.js 14, Tailwind, Framer Motion	Search UI, cluster explorer, cache dashboard

📁 Project Structure

Trademarkia-RAG_AI-ML_Task/
├── backend/                          # Python FastAPI application
│   ├── app/
│   │   ├── main.py                   # FastAPI app, lifespan, CORS, middleware
│   │   ├── config.py                 # pydantic-settings (env vars)
│   │   ├── api/
│   │   │   └── routes.py             # All 8 API endpoints
│   │   ├── core/
│   │   │   ├── cache.py              # SemanticCache — cluster-bucketed, two-phase lookup
│   │   │   └── clustering.py         # ClusterAssigner — runtime centroid assignment
│   │   ├── services/
│   │   │   ├── embedder.py           # CohereEmbedder — embed_query / embed_documents
│   │   │   ├── retriever.py         # SupabaseRetriever — semantic_search, cluster stats
│   │   │   └── llm.py               # GroqLLMService — answer_query
│   │   └── models/
│   │       └── schemas.py            # Pydantic request/response models
│   ├── scripts/
│   │   ├── setup_db.py               # One-time: create Supabase tables + indexes
│   │   ├── ingest.py                 # One-time: load 20 Newsgroups, clean, embed, store
│   │   └── cluster.py                # One-time: UMAP, FCM, store memberships + metadata
│   ├── data/                         # Generated by scripts (gitignored)
│   │   ├── embeddings.npy            # (N, 1024) document embeddings
│   │   ├── docs_metadata.json       # Cleaned document records
│   │   ├── memberships.npy          # FCM membership matrix
│   │   ├── umap_2d.npy               # 2D coords for visualization
│   │   └── ...
│   ├── requirements.txt
│   └── .env.example
│
├── frontend/                         # Next.js 14 App Router
│   ├── app/
│   │   ├── page.tsx                  # Main search page
│   │   ├── clusters/page.tsx        # Cluster explorer
│   │   ├── cache/page.tsx           # Cache dashboard + threshold explorer
│   │   ├── layout.tsx
│   │   └── globals.css
│   ├── components/
│   │   └── Navbar.tsx
│   ├── lib/
│   │   └── api.ts                   # Typed API client (fetch wrappers)
│   ├── .env.local                    # NEXT_PUBLIC_API_URL=http://localhost:8000
│   └── package.json
│
├── docker/
│   ├── Dockerfile                    # Python 3.11-slim, uvicorn
│   ├── docker-compose.yml            # Single API service, data volume
│   └── .dockerignore
│
└── README.md

📋 Setup Guide

Prerequisites

Requirement	Version / Notes
Python	3.11+
Node.js	18+
Supabase	Free tier — supabase.com
Cohere API key	dashboard.cohere.com
Groq API key	console.groq.com

Step 1: Clone and Backend Environment

git clone <repo-url>
cd Trademarkia-RAG_AI-ML_Task

# Backend
cd backend
python -m venv venv

# Windows
.\venv\Scripts\activate

# Linux/macOS
# source venv/bin/activate

pip install -r requirements.txt

Step 2: Environment Variables

cp .env.example .env
# Edit backend/.env with your keys

Variable	Description	Example
`COHERE_API_KEY`	Cohere API key	From Cohere dashboard
`GROQ_API_KEY`	Groq API key	From Groq console
`SUPABASE_URL`	Supabase project URL	`https://xxx.supabase.co`
`SUPABASE_SERVICE_KEY`	Service role key (not anon)	`eyJ...`
`SUPABASE_DB_URL`	PostgreSQL connection string	`postgresql://postgres:PASSWORD@...`
`CACHE_SIMILARITY_THRESHOLD`	Cache hit threshold (default 0.88)	Optional
`N_CLUSTERS`	Number of FCM clusters (default 15)	Optional
`CORS_ORIGINS`	Allowed frontend origin	`http://localhost:3000`

Step 3: Database Setup

# From backend/
python -m scripts.setup_db

This creates:

documents — id, text, text_preview, embedding (vector 1024), etc.
document_clusters — doc_id, cluster_memberships (JSONB), dominant_cluster, is_boundary_doc, entropy
cluster_metadata — cluster_id, label, top_terms, top_categories, doc_count, centroid (vector 1024)
cache_analytics — (optional table for logs)
IVFFlat index on documents.embedding, B-tree indexes as needed

Then: Create the RPC function in Supabase. Open Dashboard → SQL Editor and run the match_documents SQL (see comment block in backend/scripts/setup_db.py, or the API Reference section below).

Step 4: Ingest Data

python -m scripts.ingest

Fetches 20 Newsgroups (sklearn), cleans and deduplicates.
Embeds with Cohere (input_type="search_document"), batch size 24, 20s sleep between batches (free-tier friendly).
Upserts into Supabase documents and saves data/embeddings.npy and data/docs_metadata.json.
Resume: If interrupted, re-run; it resumes from data/embeddings_checkpoint.npy and data/checkpoint_index.json.

⏱ Rough time: ~15–30+ minutes depending on rate limits.

Step 5: Clustering

python -m scripts.cluster

Loads data/embeddings.npy and data/docs_metadata.json.
UMAP: 1024 → 50 dim (for FCM) and 1024 → 2 dim (for viz).
Fuzzy C-Means: k=15, m=2.0 on 50-dim.
Writes document_clusters and cluster_metadata (including 1024-dim centroids) to Supabase and local files.

⏱ Rough time: ~5–10 minutes.

Step 6: Frontend Environment (Optional)

cd ../frontend
cp .env.local.example .env.local
# .env.local should contain: NEXT_PUBLIC_API_URL=http://localhost:8000
npm install

▶️ Running the Application

Start Backend

cd backend
# Activate venv if not already
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

API: http://localhost:8000
Swagger UI: http://localhost:8000/docs
Health: curl http://localhost:8000/api/health

Start Frontend

cd frontend
npm run dev

App: http://localhost:3000
Pages: Search (/) | Clusters (/clusters) | Cache (/cache)

Quick Verification

# Health check
curl http://localhost:8000/api/health

# Search (replace with your query)
curl -X POST http://localhost:8000/api/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What are the arguments about gun control?"}'

📡 API Reference

All endpoints are under /api. Base URL: http://localhost:8000.

Method	Endpoint	Description
`POST`	`/api/query`	Semantic search + LLM answer (cache-aware). Body: `{"query": "string"}`
`GET`	`/api/cache/stats`	Cache stats: total_entries, hit_count, miss_count, hit_rate, entries_by_cluster, threshold_used
`DELETE`	`/api/cache`	Flush semantic cache
`GET`	`/api/clusters`	All cluster metadata (labels, doc_count, top_terms, top_categories, avg_entropy)
`GET`	`/api/clusters/{id}/documents`	Documents in cluster by membership. Query: `?limit=10` (1–100)
`GET`	`/api/clusters/boundary`	Most uncertain (boundary) documents. Query: `?limit=20`
`GET`	`/api/query/threshold-explore`	Per-threshold hit analysis. Query: `?query=...`
`GET`	`/api/health`	Status, version, cache_entries, cache_hit_rate, clusters_loaded

Example response (POST /api/query):

{
  "query": "What are the main arguments for and against gun control?",
  "cache_hit": false,
  "matched_query": null,
  "similarity_score": null,
  "result": "Based on the newsgroup discussions, the main arguments...",
  "dominant_cluster": 5,
  "cluster_memberships": {"5": 0.72, "9": 0.11, "3": 0.08},
  "retrieved_docs": [
    {
      "text_preview": "The second amendment clearly states...",
      "original_category": "talk.politics.guns",
      "similarity": 0.89,
      "dominant_cluster": 5
    }
  ],
  "processing_time_ms": 1842.33
}

Create match_documents RPC in Supabase (SQL Editor):

CREATE OR REPLACE FUNCTION match_documents(
    query_embedding vector(1024),
    match_count int DEFAULT 5,
    filter_cluster int DEFAULT NULL
)
RETURNS TABLE (
    id bigint,
    doc_index integer,
    text_preview text,
    original_category text,
    dominant_cluster integer,
    cluster_memberships jsonb,
    similarity float
)
LANGUAGE plpgsql AS $$
BEGIN
    RETURN QUERY
    SELECT
        d.id, d.doc_index, d.text_preview, d.original_category,
        dc.dominant_cluster, dc.cluster_memberships,
        1 - (d.embedding <=> query_embedding) AS similarity
    FROM documents d
    JOIN document_clusters dc ON d.id = dc.doc_id
    WHERE (filter_cluster IS NULL OR dc.dominant_cluster = filter_cluster)
    ORDER BY d.embedding <=> query_embedding
    LIMIT match_count;
END;
$$;

🧠 Design Decisions

Why Cohere embed-english-v3.0?

Asymmetric embeddings: search_document for indexing and search_query for queries — different projections improve retrieval for paraphrased questions.
1024 dimensions — used consistently for storage and runtime cluster assignment.

Why Fuzzy C-Means (FCM)?

Soft memberships — each doc has a score per cluster; captures overlap (e.g. politics + religion).
Boundary documents — low gap between top-2 memberships highlights ambiguous posts.
k=15 — chosen via k-sweep (FPC + silhouette); balances thematic groups without over-splitting.

Why UMAP Before FCM?

FCM in 1024 dimensions suffers from curse of dimensionality (distances concentrate).
UMAP reduces to 50 dimensions (cosine) before FCM; 2D UMAP is used only for visualization.
Runtime uses 1024-dim centroids (mean of member embeddings) so no UMAP at query time — just 15 dot products.

Why a Pure-Python Semantic Cache?

~500 entries × 1024 floats ≈ 2 MB — no need for Redis.
Serializing/deserializing vectors for every lookup would add latency.
Cluster-bucketed storage — lookup only in bucket[cluster] and bucket[cluster±1], ~100 comparisons instead of 500.
LRU+LFU hybrid eviction — score = timestamp - hit_count × 3600 so popular queries stay longer.

Cache Threshold 0.88

Threshold	Effect
0.80	Higher hit rate, more risk of off-topic reuse
0.88	Balanced: paraphrases hit, sub-topic changes miss
0.95	Strict: near-identical queries only

Explore live: GET /api/query/threshold-explore?query=...

🐳 Docker Deployment

Docker is for deployment only. One-time scripts (setup_db, ingest, cluster) are intended to run locally with your env and keys.

After running those locally:

docker-compose -f docker/docker-compose.yml build
docker-compose -f docker/docker-compose.yml up -d
curl http://localhost:8000/api/health

No Redis — cache is in-process.
No DB in compose — Supabase is external.
Volume — backend/data/ is mounted so embeddings and cluster artifacts persist.

👀 For Reviewers

If you want to…	Look at
See the full request/response flow	`backend/app/api/routes.py` — especially `POST /query`
Understand the cache	`backend/app/core/cache.py` — `lookup`, `store`, `get_stats`, `flush`
See how embeddings are used	`backend/app/services/embedder.py` (query vs document input_type)
See how clustering is used at runtime	`backend/app/core/clustering.py` — `ClusterAssigner`, centroid loading in `main.py`
Trace ingest pipeline	`backend/scripts/ingest.py` — load → clean → embed → store → checkpoint
Trace cluster pipeline	`backend/scripts/cluster.py` — UMAP → FCM → memberships → Supabase
Check frontend API usage	`frontend/lib/api.ts` — all endpoints; pages use these only (no hardcoded data)

Run order for a clean run:
setup_db → (create RPC in Supabase) → ingest → cluster → start API → start frontend.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
backend		backend
docker		docker
frontend		frontend
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

🔍 20 Newsgroups Semantic Search System

📑 Table of Contents

🎯 Quick Overview

🏗 Architecture at a Glance

High-Level System Diagram

Query Flow (What Happens When You Search)

Data Pipeline (One-Time Setup)

Semantic Cache Lookup (Two-Phase)

🛠 Tech Stack

📁 Project Structure

📋 Setup Guide

Prerequisites

Step 1: Clone and Backend Environment

Step 2: Environment Variables

Step 3: Database Setup

Step 4: Ingest Data

Step 5: Clustering

Step 6: Frontend Environment (Optional)

▶️ Running the Application

Start Backend

Start Frontend

Quick Verification

📡 API Reference

🧠 Design Decisions

Why Cohere embed-english-v3.0?

Why Fuzzy C-Means (FCM)?

Why UMAP Before FCM?

Why a Pure-Python Semantic Cache?

Cache Threshold 0.88

🐳 Docker Deployment

👀 For Reviewers

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages