Skip to content

Pranaykarvi/20_Newsgroups

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔍 20 Newsgroups Semantic Search System

Semantic search and RAG-powered Q&A over the classic 20 Newsgroups dataset — combining Cohere embeddings, Fuzzy C-Means clustering, a pure-Python semantic cache, and Groq LLM, exposed via FastAPI and Next.js 14.


📑 Table of Contents


🎯 Quick Overview

What Description
Goal Semantic search over ~11k newsgroup posts with LLM-generated answers and a cache to avoid repeated LLM calls.
Backend Python FastAPI on port 8000 — 8 REST endpoints for search, clusters, cache stats, and health.
Frontend Next.js 14 (App Router) on port 3000 — Search page, Cluster Explorer, Cache Dashboard.
Data 20 Newsgroups — cleaned, embedded with Cohere, stored in Supabase (pgvector).
One-time setup Run setup_dbingestcluster (in order). Then start API + frontend.

🏗 Architecture at a Glance

High-Level System Diagram

flowchart TB
    subgraph Frontend["🖥️ Next.js 14 Frontend (port 3000)"]
        Search["Search Page"]
        Clusters["Cluster Explorer"]
        Cache["Cache Dashboard"]
    end

    subgraph Backend["⚙️ FastAPI Backend (port 8000)"]
        API["/api routes"]
        Embedder["Cohere Embedder\n1024-dim, asymmetric"]
        CacheModule["Semantic Cache\ncluster-bucketed, 0.88 threshold"]
        LLM["Groq LLM\nllama-3.3-70b"]
        Retriever["Supabase Retriever\npgvector search"]
    end

    subgraph Data["☁️ Supabase (PostgreSQL + pgvector)"]
        Docs[("documents\n+ embeddings")]
        ClustersTable[("document_clusters\n+ cluster_metadata")]
    end

    Search -->|POST /api/query| API
    Clusters -->|GET /api/clusters| API
    Cache -->|GET /api/cache/stats| API

    API --> Embedder
    API --> CacheModule
    API --> LLM
    API --> Retriever
    Retriever --> Docs
    Retriever --> ClustersTable
Loading

Query Flow (What Happens When You Search)

sequenceDiagram
    participant User
    participant Frontend
    participant API
    participant Embedder
    participant Cache
    participant Retriever
    participant LLM
    participant Supabase

    User->>Frontend: Enter query & submit
    Frontend->>API: POST /api/query {"query": "..."}
    API->>Embedder: embed_query (search_query)
    Embedder-->>API: 1024-dim vector
    API->>API: Assign to nearest cluster (15 centroids)
    API->>Cache: lookup(query_embedding, cluster_id)

    alt Cache HIT
        Cache-->>API: cached answer
        API->>Retriever: semantic_search (for display)
        Retriever->>Supabase: match_documents RPC
        Supabase-->>API: top docs
        API-->>Frontend: result (cache_hit: true)
    else Cache MISS
        Cache-->>API: null
        API->>Retriever: semantic_search
        Retriever->>Supabase: match_documents RPC
        Supabase-->>API: top 5 docs
        API->>LLM: answer_query(query, docs)
        LLM-->>API: answer text
        API->>Cache: store(query, embedding, answer, cluster_id)
        API-->>Frontend: result (cache_hit: false)
    end

    Frontend-->>User: Show answer, docs, cluster info
Loading

Data Pipeline (One-Time Setup)

flowchart LR
    subgraph Step1["1. setup_db"]
        A[Supabase] --> B[Create tables + indexes\npgvector, IVFFlat]
    end

    subgraph Step2["2. ingest"]
        C[20 Newsgroups\nsklearn] --> D[Clean & dedupe]
        D --> E[Cohere embed\nsearch_document]
        E --> F[Supabase documents\n+ data/embeddings.npy]
    end

    subgraph Step3["3. cluster"]
        F --> G[UMAP 1024→50 dim]
        G --> H[Fuzzy C-Means\nk=15, m=2]
        H --> I[document_clusters\n+ cluster_metadata\n+ centroids]
    end

    Step1 --> Step2 --> Step3
Loading

Semantic Cache Lookup (Two-Phase)

flowchart TD
    A[Query embedding + cluster_id] --> B[Phase 1: Candidate selection]
    B --> C[Collect entries from\nbucket[cluster] + bucket[cluster±1]]
    C --> D[Phase 2: Cosine similarity]
    D --> E{Best sim ≥ 0.88?}
    E -->|Yes| F[✅ HIT — return cached answer]
    E -->|No| G[❌ MISS — vector search + LLM]
Loading

🛠 Tech Stack

Layer Technology Purpose
Embeddings Cohere embed-english-v3.0 1024-dim, asymmetric (search_document / search_query)
Vector DB Supabase (PostgreSQL + pgvector) Store documents + embeddings, cosine search via RPC
Clustering scikit-fuzzy (FCM), UMAP Soft clusters (k=15, m=2), boundary docs, 50-dim for FCM
LLM Groq — llama-3.3-70b-versatile RAG answers on cache miss only
Cache In-memory Python (no Redis) Cluster-bucketed, LRU+LFU eviction, 500 entries max
Backend FastAPI + uvicorn REST API, Pydantic v2, CORS, request ID middleware
Frontend Next.js 14, Tailwind, Framer Motion Search UI, cluster explorer, cache dashboard

📁 Project Structure

Trademarkia-RAG_AI-ML_Task/
├── backend/                          # Python FastAPI application
│   ├── app/
│   │   ├── main.py                   # FastAPI app, lifespan, CORS, middleware
│   │   ├── config.py                 # pydantic-settings (env vars)
│   │   ├── api/
│   │   │   └── routes.py             # All 8 API endpoints
│   │   ├── core/
│   │   │   ├── cache.py              # SemanticCache — cluster-bucketed, two-phase lookup
│   │   │   └── clustering.py         # ClusterAssigner — runtime centroid assignment
│   │   ├── services/
│   │   │   ├── embedder.py           # CohereEmbedder — embed_query / embed_documents
│   │   │   ├── retriever.py         # SupabaseRetriever — semantic_search, cluster stats
│   │   │   └── llm.py               # GroqLLMService — answer_query
│   │   └── models/
│   │       └── schemas.py            # Pydantic request/response models
│   ├── scripts/
│   │   ├── setup_db.py               # One-time: create Supabase tables + indexes
│   │   ├── ingest.py                 # One-time: load 20 Newsgroups, clean, embed, store
│   │   └── cluster.py                # One-time: UMAP, FCM, store memberships + metadata
│   ├── data/                         # Generated by scripts (gitignored)
│   │   ├── embeddings.npy            # (N, 1024) document embeddings
│   │   ├── docs_metadata.json       # Cleaned document records
│   │   ├── memberships.npy          # FCM membership matrix
│   │   ├── umap_2d.npy               # 2D coords for visualization
│   │   └── ...
│   ├── requirements.txt
│   └── .env.example
│
├── frontend/                         # Next.js 14 App Router
│   ├── app/
│   │   ├── page.tsx                  # Main search page
│   │   ├── clusters/page.tsx        # Cluster explorer
│   │   ├── cache/page.tsx           # Cache dashboard + threshold explorer
│   │   ├── layout.tsx
│   │   └── globals.css
│   ├── components/
│   │   └── Navbar.tsx
│   ├── lib/
│   │   └── api.ts                   # Typed API client (fetch wrappers)
│   ├── .env.local                    # NEXT_PUBLIC_API_URL=http://localhost:8000
│   └── package.json
│
├── docker/
│   ├── Dockerfile                    # Python 3.11-slim, uvicorn
│   ├── docker-compose.yml            # Single API service, data volume
│   └── .dockerignore
│
└── README.md

📋 Setup Guide

Prerequisites

Requirement Version / Notes
Python 3.11+
Node.js 18+
Supabase Free tier — supabase.com
Cohere API key dashboard.cohere.com
Groq API key console.groq.com

Step 1: Clone and Backend Environment

git clone <repo-url>
cd Trademarkia-RAG_AI-ML_Task

# Backend
cd backend
python -m venv venv

# Windows
.\venv\Scripts\activate

# Linux/macOS
# source venv/bin/activate

pip install -r requirements.txt

Step 2: Environment Variables

cp .env.example .env
# Edit backend/.env with your keys
Variable Description Example
COHERE_API_KEY Cohere API key From Cohere dashboard
GROQ_API_KEY Groq API key From Groq console
SUPABASE_URL Supabase project URL https://xxx.supabase.co
SUPABASE_SERVICE_KEY Service role key (not anon) eyJ...
SUPABASE_DB_URL PostgreSQL connection string postgresql://postgres:PASSWORD@...
CACHE_SIMILARITY_THRESHOLD Cache hit threshold (default 0.88) Optional
N_CLUSTERS Number of FCM clusters (default 15) Optional
CORS_ORIGINS Allowed frontend origin http://localhost:3000

Step 3: Database Setup

# From backend/
python -m scripts.setup_db

This creates:

  • documents — id, text, text_preview, embedding (vector 1024), etc.
  • document_clusters — doc_id, cluster_memberships (JSONB), dominant_cluster, is_boundary_doc, entropy
  • cluster_metadata — cluster_id, label, top_terms, top_categories, doc_count, centroid (vector 1024)
  • cache_analytics — (optional table for logs)
  • IVFFlat index on documents.embedding, B-tree indexes as needed

Then: Create the RPC function in Supabase. Open Dashboard → SQL Editor and run the match_documents SQL (see comment block in backend/scripts/setup_db.py, or the API Reference section below).

Step 4: Ingest Data

python -m scripts.ingest
  • Fetches 20 Newsgroups (sklearn), cleans and deduplicates.
  • Embeds with Cohere (input_type="search_document"), batch size 24, 20s sleep between batches (free-tier friendly).
  • Upserts into Supabase documents and saves data/embeddings.npy and data/docs_metadata.json.
  • Resume: If interrupted, re-run; it resumes from data/embeddings_checkpoint.npy and data/checkpoint_index.json.

Rough time: ~15–30+ minutes depending on rate limits.

Step 5: Clustering

python -m scripts.cluster
  • Loads data/embeddings.npy and data/docs_metadata.json.
  • UMAP: 1024 → 50 dim (for FCM) and 1024 → 2 dim (for viz).
  • Fuzzy C-Means: k=15, m=2.0 on 50-dim.
  • Writes document_clusters and cluster_metadata (including 1024-dim centroids) to Supabase and local files.

Rough time: ~5–10 minutes.

Step 6: Frontend Environment (Optional)

cd ../frontend
cp .env.local.example .env.local
# .env.local should contain: NEXT_PUBLIC_API_URL=http://localhost:8000
npm install

▶️ Running the Application

Start Backend

cd backend
# Activate venv if not already
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

Start Frontend

cd frontend
npm run dev

Quick Verification

# Health check
curl http://localhost:8000/api/health

# Search (replace with your query)
curl -X POST http://localhost:8000/api/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What are the arguments about gun control?"}'

📡 API Reference

All endpoints are under /api. Base URL: http://localhost:8000.

Method Endpoint Description
POST /api/query Semantic search + LLM answer (cache-aware). Body: {"query": "string"}
GET /api/cache/stats Cache stats: total_entries, hit_count, miss_count, hit_rate, entries_by_cluster, threshold_used
DELETE /api/cache Flush semantic cache
GET /api/clusters All cluster metadata (labels, doc_count, top_terms, top_categories, avg_entropy)
GET /api/clusters/{id}/documents Documents in cluster by membership. Query: ?limit=10 (1–100)
GET /api/clusters/boundary Most uncertain (boundary) documents. Query: ?limit=20
GET /api/query/threshold-explore Per-threshold hit analysis. Query: ?query=...
GET /api/health Status, version, cache_entries, cache_hit_rate, clusters_loaded

Example response (POST /api/query):

{
  "query": "What are the main arguments for and against gun control?",
  "cache_hit": false,
  "matched_query": null,
  "similarity_score": null,
  "result": "Based on the newsgroup discussions, the main arguments...",
  "dominant_cluster": 5,
  "cluster_memberships": {"5": 0.72, "9": 0.11, "3": 0.08},
  "retrieved_docs": [
    {
      "text_preview": "The second amendment clearly states...",
      "original_category": "talk.politics.guns",
      "similarity": 0.89,
      "dominant_cluster": 5
    }
  ],
  "processing_time_ms": 1842.33
}

Create match_documents RPC in Supabase (SQL Editor):

CREATE OR REPLACE FUNCTION match_documents(
    query_embedding vector(1024),
    match_count int DEFAULT 5,
    filter_cluster int DEFAULT NULL
)
RETURNS TABLE (
    id bigint,
    doc_index integer,
    text_preview text,
    original_category text,
    dominant_cluster integer,
    cluster_memberships jsonb,
    similarity float
)
LANGUAGE plpgsql AS $$
BEGIN
    RETURN QUERY
    SELECT
        d.id, d.doc_index, d.text_preview, d.original_category,
        dc.dominant_cluster, dc.cluster_memberships,
        1 - (d.embedding <=> query_embedding) AS similarity
    FROM documents d
    JOIN document_clusters dc ON d.id = dc.doc_id
    WHERE (filter_cluster IS NULL OR dc.dominant_cluster = filter_cluster)
    ORDER BY d.embedding <=> query_embedding
    LIMIT match_count;
END;
$$;

🧠 Design Decisions

Why Cohere embed-english-v3.0?

  • Asymmetric embeddings: search_document for indexing and search_query for queries — different projections improve retrieval for paraphrased questions.
  • 1024 dimensions — used consistently for storage and runtime cluster assignment.

Why Fuzzy C-Means (FCM)?

  • Soft memberships — each doc has a score per cluster; captures overlap (e.g. politics + religion).
  • Boundary documents — low gap between top-2 memberships highlights ambiguous posts.
  • k=15 — chosen via k-sweep (FPC + silhouette); balances thematic groups without over-splitting.

Why UMAP Before FCM?

  • FCM in 1024 dimensions suffers from curse of dimensionality (distances concentrate).
  • UMAP reduces to 50 dimensions (cosine) before FCM; 2D UMAP is used only for visualization.
  • Runtime uses 1024-dim centroids (mean of member embeddings) so no UMAP at query time — just 15 dot products.

Why a Pure-Python Semantic Cache?

  • ~500 entries × 1024 floats ≈ 2 MB — no need for Redis.
  • Serializing/deserializing vectors for every lookup would add latency.
  • Cluster-bucketed storage — lookup only in bucket[cluster] and bucket[cluster±1], ~100 comparisons instead of 500.
  • LRU+LFU hybrid evictionscore = timestamp - hit_count × 3600 so popular queries stay longer.

Cache Threshold 0.88

Threshold Effect
0.80 Higher hit rate, more risk of off-topic reuse
0.88 Balanced: paraphrases hit, sub-topic changes miss
0.95 Strict: near-identical queries only

Explore live: GET /api/query/threshold-explore?query=...


🐳 Docker Deployment

Docker is for deployment only. One-time scripts (setup_db, ingest, cluster) are intended to run locally with your env and keys.

After running those locally:

docker-compose -f docker/docker-compose.yml build
docker-compose -f docker/docker-compose.yml up -d
curl http://localhost:8000/api/health
  • No Redis — cache is in-process.
  • No DB in compose — Supabase is external.
  • Volumebackend/data/ is mounted so embeddings and cluster artifacts persist.

👀 For Reviewers

If you want to… Look at
See the full request/response flow backend/app/api/routes.py — especially POST /query
Understand the cache backend/app/core/cache.pylookup, store, get_stats, flush
See how embeddings are used backend/app/services/embedder.py (query vs document input_type)
See how clustering is used at runtime backend/app/core/clustering.pyClusterAssigner, centroid loading in main.py
Trace ingest pipeline backend/scripts/ingest.py — load → clean → embed → store → checkpoint
Trace cluster pipeline backend/scripts/cluster.py — UMAP → FCM → memberships → Supabase
Check frontend API usage frontend/lib/api.ts — all endpoints; pages use these only (no hardcoded data)

Run order for a clean run:
setup_db → (create RPC in Supabase) → ingestcluster → start API → start frontend.


License

MIT

About

Semantic search and RAG-powered Q&A over the classic 20 Newsgroups dataset — combining Cohere embeddings, Fuzzy C-Means clustering, a pure-Python semantic cache, and Groq LLM, exposed via FastAPI and Next.js 14.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors