A Retrieval-Augmented Generation (RAG) chatbot backend for Sahrdaya College of Engineering & Technology (SCET), Kodakara, Thrissur, Kerala. It answers questions about faculty, departments, admissions, placements, clubs, infrastructure, and more — all grounded in data scraped from the college website. The runtime now supports SQL + Hybrid RAG, plus an additive GraphRAG content-graph route for structure/link relationship queries.
📚 Want to learn how it works? Check WORKING.md for a complete technical breakdown of the pipeline: scraping, preprocessing, hybrid retrieval (BM25 + Vector), additive GraphRAG routing, SQL routing, index caching, and answer generation.
🚀 What's planned next? See FUTURE_ADDITIONS.md for the roadmap: streaming responses, Docker, web frontend, and more.
🤝 Contribution status: See CONTRIBUTING.md. For now, we are not accepting PRs; please open Issues.
- Backend API: FastAPI, Uvicorn, Pydantic
- LLM Orchestration: LangChain, LangChain Core, LangChain Groq
- LLM Provider/Model: Groq (
openai/gpt-oss-120b) - Retrieval: Hybrid BM25 + FAISS Vector Search + Cross-Encoder Reranking + GraphRAG (content graph)
- Graph Layer: NetworkX + node-link JSON cache (
.index_cache/content_graph.json) - Embeddings:
sentence-transformers/all-MiniLM-L6-v2 - Reranker:
cross-encoder/ms-marco-MiniLM-L-6-v2 - Data Processing: Playwright, BeautifulSoup, NLTK, custom preprocessing pipeline
- Database: SQLite (faculty, former people, students, interests)
- Frontend: Next.js (App Router), React, TypeScript, Tailwind CSS
- Deployment: Docker, Docker Compose, Nginx (load balancing + edge rate limiting)
- Observability: JSONL logging, health/readiness/load/limits endpoints, CLI session analytics
git clone <repo-url>
cd ragx-backend
pip install -r requirements.txtCopy environment template and add your Groq key(s):
copy .env.example .envThen start API server:
python api_main.pyOr start the load-balanced Docker stack:
docker compose -f docker-compose.nginx.yml up --build -dflowchart TD
subgraph DEPLOY["🚢 Deployment Layer"]
FE["🌐 Frontend Client"] --> LB["⚖️ Nginx LB :8080"]
CFG["docker-compose.nginx.yml + .env"] --> API1["🐳 rag-api-1"]
CFG --> API2["🐳 rag-api-2"]
CFG --> API3["🐳 rag-api-3"]
API1 --> LB
API2 --> LB
API3 --> LB
HC["Healthchecks gate traffic"] --> LB
LB --> CHAT["POST /api/chat\nPOST /api/chat/stream"]
LB --> OPS["GET /api/health /ready /load /limits"]
ERR["502/429/503"] --> RETRY["Client retry + backoff"]
end
subgraph API["🧩 FastAPI Runtime Layer"]
CHAT --> SS["In-memory Session Store"]
CHAT --> RL["Local Rate + Load Guardrails"]
CHAT --> KP["Busy-key Failover Pool"]
CHAT --> LOG["JSONL Chat Logger"]
SS --> R
RL --> R
KP --> R
LOG --> EV["logs/events.jsonl"]
LOG --> IPL["logs/<client_ip>.jsonl"]
LOG --> ROT["Rotating files\n5MB x 7 backups"]
OPS --> OBS["Metrics + status endpoints"]
OBS --> ERR
end
subgraph INGESTION["🕷️ Data Ingestion Layer"]
A["🌐 sahrdaya.ac.in"] -->|"sitemap<br/>discovery"| B["🗺️ URL Queue"]
B -->|"4 threads"| C["🎭 Playwright<br/>Renderer"]
C -->|"open modals<br/>click Stats/Download"| C2["📎 Popup Link<br/>Capture"]
C2 -->|"inject discovered<br/>PDF URLs"| D["🧹 BS4<br/>Cleaner"]
C -->|"JS execution<br/>+ DOM parse"| D
D --> E["📄 data/raw/sahrdaya_rag.txt<br/>~2K chunks"]
end
subgraph ETL["⚙️ ETL Pipeline"]
E -->|"NLTK sentence<br/>tokenizer"| F["✂️ Sentence<br/>Splitter"]
F -->|"regex +<br/>pattern match"| G["🏷️ Category<br/>Tagger"]
G -->|"alias<br/>injection"| H["📦 data/processed/data_cleaned.jsonl<br/>18 categories"]
STTXT["📝 data/raw/student_inputs/*.txt"] --> STTXTLOAD["🧩 TXT Block Loader<br/>preprocess_data.py"]
STTXTLOAD -->|"blank-line split<br/>+ chunk merge"| H
E -->|"role-label<br/>parsing"| FP["👥 Former People<br/>Structurer"]
FP -->|"10 role chunks<br/>+ summary"| H
E -->|"legacy + listing<br/>parsing"| I["👤 Faculty<br/>Extractor"]
E -->|"former people<br/>parsing"| IP["👥 Former People<br/>Parser"]
STCSV["🧾 data/students.csv"] --> STING["👩🎓 student_db.py<br/>Normalizer + Loader"]
I -->|"~109 profiles<br/>16 columns"| K["🗃️ SQLite DB<br/>data/sql/college.db"]
IP -->|"52 records<br/>10 roles"| K
STING -->|"students + interests + links<br/>projects + socials"| K
end
subgraph INDEX["📊 Multi-Index Layer"]
H -->|"TF-IDF<br/>tokenization"| L["🔍 BM25<br/>Retriever"]
H -->|"all-MiniLM-L6-v2<br/>384-dim"| M["🧠 FAISS<br/>Vector Store"]
H -->|"chunk/category nodes"| G1["🕸️ Content Graph<br/>NetworkX"]
TJSON["🗂️ data/raw/sahrdaya_tracking.json"] -->|"page/url/chunk links"| G1
L -.->|"pickle"| N[("💾 .index_cache/")]
M -.->|"save_local()"| N
G1 -.->|"node-link JSON"| N
N -.->|"data hash"| O{{"♻️ Cache<br/>Hit?"}}
O -->|"yes"| P["⚡ Fast Load<br/>~0.1s"]
O -->|"no"| Q["🔨 Rebuild<br/>indexes + graph"]
end
subgraph RUNTIME["❓ Query Runtime"]
R(["🧑 User<br/>Query"]) --> QC["✍️ LLM Query<br/>Typo Corrector"]
QC --> QM["🧭 Canonical Query Mapper<br/>deterministic + LLM"]
QM --> QX["🔎 Query Expander<br/>dept + intent terms"]
QX --> S["📝 Chat<br/>History"]
QX --> SF["👤 Student Name<br/>Fast Path"]
SF -->|"match"| V
SF -->|"no match"| T
S --> T{"🧠 LLM<br/>Classifier"}
subgraph SQL["📊 SQL Path"]
T -->|"bulk / aggregate /<br/>faculty + former + students"| U["🔧 Schema-Aware<br/>SQL Generator"]
K --> U
U -->|"SELECT only<br/>safety filter"| V["⚡ SQLite<br/>Executor"]
V --> W["📋 Formatted SQL<br/>Result"]
V -->|"error"| X["🔄 Fallback<br/>to RAG"]
end
subgraph RAG["📚 Retrieval Path (GraphRAG + Hybrid)"]
T -->|"single-person /<br/>general"| GH{"🧭 Graph Intent<br/>Heuristic?"}
X --> GH
GH -->|"yes"| GR["🕸️ GraphRAG<br/>Retriever"]
G1 --> GR
GR --> GE{"strong<br/>graph evidence?"}
GE -->|"yes"| AD["💬 Groq LLM<br/>Generation"]
GH -->|"no"| Y["🔎 Hybrid Retriever"]
GE -->|"no"| Y
Y --> Z["⚡ Ensemble<br/>Retriever"]
L -->|"weight: 0.6"| Z
M -->|"weight: 0.4"| Z
Z -->|"RRF<br/>fusion"| AA["📊 Top-25<br/>Candidates"]
AA --> AB["🤖 Cross-Encoder<br/>ms-marco-MiniLM"]
AB -->|"rerank"| AC["🎯 Top-10<br/>Relevant"]
AC --> AD
end
end
subgraph OUTPUT["✅ Output Layer"]
W --> AE(["📨 Response"])
AD --> AE
AE --> AF["📊 Token<br/>Counter"]
AE --> AG["🔢 Chunk<br/>Tracker"]
AF & AG --> AH["📈 Session<br/>Analytics"]
AH --> AI["📉 /graph<br/>Dashboard"]
AH --> AJ["📋 /stats<br/>Summary"]
end
style INGESTION fill:none,stroke:#4a9eff,stroke-width:2px
style ETL fill:none,stroke:#22c55e,stroke-width:2px
style INDEX fill:none,stroke:#a855f7,stroke-width:2px
style RUNTIME fill:none,stroke:#f97316,stroke-width:2px
style API fill:none,stroke:#ff7f50,stroke-width:2px
style DEPLOY fill:none,stroke:#7b61ff,stroke-width:2px
style SQL fill:none,stroke:#ef4444,stroke-width:1px,stroke-dasharray: 5 5
style RAG fill:none,stroke:#06b6d4,stroke-width:1px,stroke-dasharray: 5 5
style G1 fill:none,stroke:#0ea5e9,stroke-width:1px,stroke-dasharray: 4 4
style GR fill:none,stroke:#0ea5e9,stroke-width:1px,stroke-dasharray: 4 4
style OUTPUT fill:none,stroke:#eab308,stroke-width:2px
flowchart TD
U[User Query] --> N[Normalize + Canonicalize]
N --> S{Single-student fast match?}
S -->|yes| SQ[Direct Student SQL]
S -->|no| B{Bulk SQL intent?}
B -->|yes| SG[Generate SQL]
SG --> SX{SQL success with rows?}
SX -->|yes| A[Answer mode: sql]
SX -->|no| G
B -->|no| G{Graph intent heuristic?}
G -->|yes| GR[GraphRAG content graph retrieval]
GR --> GE{Strong graph evidence?}
GE -->|yes| L[Groq generation]
GE -->|no| H
G -->|no| H[Hybrid BM25 + FAISS + reranker]
H --> L
SQ --> A
L --> AR[Answer mode: graph_rag or rag]
| File | Role |
|---|---|
scraper.py |
Multi-threaded web scraper (Playwright + Sitemap, popup-aware PDF link capture, thread-safe, 4 output formats) |
data/raw/sahrdaya_rag.txt |
Raw scraped chunks (TSV: chunk_id\tcontent) |
preprocess_data.py |
Cleans, categorises (18 categories), sentence-splits, injects search aliases, and structures former people data |
data/processed/data_cleaned.jsonl |
Optimised chunks ready for indexing |
sql_db_setup.py |
Orchestrates shared SQL DB build and runs faculty/former people/student loaders into one SQLite file |
faculty_extractor.py |
Parses faculty data from raw chunks (supports both legacy profile blocks and current listing-style chunks) |
former_people_extractor.py |
Parses and inserts former office-bearers into former_people |
student_db.py |
Loads data/students.csv, normalizes interests, and populates students, interests, student_interests |
data/students.csv |
Student source data (bio/biography, interests, social links, projects links; photo is optional) |
data/sql/college.db |
Shared SQLite database for faculty, former people, students, and canonical interests |
sql_smoke_test.py |
Quick DB validation (schema + row sanity checks after ingestion/parser changes) |
rag_setup.py |
Builds FAISS + BM25 indexes, builds/caches content graph, canonicalizes queries, routes SQL vs GraphRAG vs hybrid RAG, includes single-student fast lookup, and formats SQL output |
main.py |
Interactive CLI chatbot with stats, ASCII dashboard, and session analytics |
api/ |
FastAPI app split into core, routes, and services layers |
api/services/chat_logger.py |
JSON Lines chat logging (success + error), per-IP files, rotating handler with retention |
api_main.py |
API entrypoint (Uvicorn) |
.env / .env.example |
Runtime settings (keys, limits, CORS, concurrency) |
Dockerfile |
Container image for API service |
docker-compose.yml |
Single-container deployment |
docker-compose.nginx.yml |
3 API containers + Nginx load balancing |
deploy/nginx-docker.conf |
Nginx upstream/load-balancer config for Docker |
logs/ |
Runtime JSONL logs: events.jsonl (all events) + <client_ip>.jsonl (per-IP) |
- Python 3.10+ (tested on 3.14)
- A Groq API key set via
.env(GROQ_API_KEYorGROQ_API_KEYS) - ~500 MB disk space for embeddings model download on first run
- Playwright browsers (only needed for scraping):
playwright install
git clone <repo-url>
cd ragx-backendpython -m venv venv
# Windows
venv\Scripts\activate
# Linux / macOS
source venv/bin/activatepip install -r requirements.txtKey packages: langchain, langchain-community, langchain-classic, langchain-groq, langchain-huggingface, faiss-cpu, rank-bm25, nltk, groq, beautifulsoup4, playwright, networkx.
import nltk
nltk.download("punkt_tab")# Full site crawl
python scraper.py https://www.sahrdaya.ac.in/ -o sahrdaya --threads 8 --use-playwright
# Single page append (example: placement page with modal PDF links)
python scraper.py https://www.sahrdaya.ac.in/traning-and-placement -o sahrdaya --single --use-playwrightThis produces data/raw/sahrdaya_rag.txt, which is the default input for preprocessing and DB setup.
For modal-driven pages (Stats -> Download/Open External), the scraper also captures discovered PDF URLs and stores them as Document Links in the raw output.
To add your own raw student text into RAG, place one or more .txt files in data/raw/student_inputs/.
When you run preprocessing, these files are automatically split into chunks and merged into data/processed/data_cleaned.jsonl.
python preprocess_data.pyReads data/raw/sahrdaya_rag.txt, cleans text, detects categories, splits into sentence-aware chunks, injects search aliases, structures former people data into per-role chunks, and writes data/processed/data_cleaned.jsonl.
Sample output:
[1/4] Loaded 785 raw chunks from data/raw/sahrdaya_rag.txt
[2/4] Cleaned text — kept 784 chunks, skipped 1 near-empty
[3/4] Categorised & re-chunked — 2198 final chunks (466 large chunks were split)
[4/4] Wrote 2198 chunks to data/processed/data_cleaned.jsonl
python main.pyOn first run, FAISS and BM25 indexes are built from data/processed/data_cleaned.jsonl (~50s). Subsequent runs load from .index_cache/ in ~0.1s. The cache auto-invalidates when the data file changes (MD5 hash check).
If data/sql/college.db doesn't exist, it's auto-built from data/raw/sahrdaya_rag.txt on startup.
Student data from data/students.csv is also loaded at startup into students, interests, and student_interests in the same DB.
For a quick integrity check after parser or ingestion updates, run python sql_smoke_test.py.
Student profile SQL output includes: name, graduation year, department, bio, Instagram, GitHub, projects links, LinkedIn, and website (photo is optional when available).
Copy environment template and add your keys:
copy .env.example .envThen start API server:
python api_main.pyAPI endpoints:
| Endpoint | Method | Purpose |
|---|---|---|
/api/chat |
POST | Chat request-response API |
/api/chat/stream |
POST | SSE streaming events |
/api/sessions |
POST | Create new chat session |
/api/sessions/{session_id}/history |
GET | Get session chat history |
/api/sessions/{session_id} |
DELETE | Delete session |
/api/health |
GET | Liveness check |
/api/ready |
GET | Readiness check |
/api/load |
GET | Current in-flight load |
/api/limits |
GET | Local quota usage + key health |
The API includes:
- in-memory session isolation
- busy-key failover (switch to next key if one key is rate-limited)
- local quota guardrails for RPM/TPM/RPD/TPD
- full-answer policy (no API-side output truncation; continuation attempts if model stops due length)
docker compose up --build -dAccess API at:
http://127.0.0.1:8000
Useful commands:
docker compose logs -f rag-api
docker compose downdocker compose -f docker-compose.nginx.yml up --build -dAccess API through Nginx at:
http://127.0.0.1:8080
Useful commands:
docker compose -f docker-compose.nginx.yml logs -f
docker compose -f docker-compose.nginx.yml downDocker files added:
Dockerfile.dockerignoredocker-compose.ymldocker-compose.nginx.ymldeploy/nginx-docker.conf
This repo already includes an end-to-end deployment path from source code to a health-gated, load-balanced runtime.
flowchart LR
A[Code changes] --> B[Build API image from Dockerfile]
B --> C[Start rag-api-1, rag-api-2, rag-api-3]
C --> D[Health checks on /api/health]
D --> E{All healthy?}
E -->|Yes| F[Start Nginx reverse proxy]
E -->|No| G[Keep service out of traffic]
F --> H[Serve traffic on :8080]
-
Build
- Command:
docker compose -f docker-compose.nginx.yml build - Source:
Dockerfile
- Command:
-
Deploy
- Command:
docker compose -f docker-compose.nginx.yml up -d - Starts 3 API containers and 1 Nginx container
- Command:
-
Health gate
- Each API replica must pass
GET /api/health - Nginx waits for healthy replicas via
depends_on: condition: service_healthy
- Each API replica must pass
-
Traffic serving
- Nginx listens on
:8080 - Requests are balanced across
rag-api-1,rag-api-2, andrag-api-3using least-connections
- Nginx listens on
-
Verify
curl http://127.0.0.1:8080/api/healthdocker compose -f docker-compose.nginx.yml psdocker compose -f docker-compose.nginx.yml logs -f
When you push code updates:
docker compose -f docker-compose.nginx.yml build
docker compose -f docker-compose.nginx.yml up -d
docker compose -f docker-compose.nginx.yml psThis rebuilds images, recreates containers, and only sends traffic to healthy API instances.
If you connect this to GitHub Actions, use this same sequence as your deploy job on main branch pushes:
- Checkout code
- Build images (
docker compose ... build) - Deploy (
docker compose ... up -d) - Health probe (
/api/health) - Fail job if health probe fails
| Command | Description |
|---|---|
/help |
Show available commands |
/graph |
Session dashboard with ASCII charts (response times, token usage, chunk heatmap) |
/chunks |
Show chunks used in last retrieval |
/history |
Show conversation history |
/stats |
Re-show last query stats box |
/clear |
Clear conversation history |
/reset |
Reset session stats and history |
exit |
Quit the program |
ragx-backend/
├── scraper.py # Web scraper (multi-threaded, Playwright, popup-aware PDF link capture)
├── data/
│ ├── raw/
│ │ ├── sahrdaya_rag.txt
│ │ ├── sahrdaya_raw.txt
│ │ ├── sahrdaya_structured.json
│ │ └── sahrdaya_tracking.json
│ ├── processed/
│ │ └── data_cleaned.jsonl
│ ├── sql/
│ │ └── college.db
│ └── students.csv # Student profile source data
├── preprocess_data.py # Data preprocessing pipeline
├── sql_db_setup.py # Shared SQLite DB setup (faculty + former + students)
├── faculty_extractor.py # Faculty parser/loader (legacy + listing formats)
├── former_people_extractor.py # Former people parser/loader
├── student_db.py # Student CSV loader + interest normalization
├── sql_smoke_test.py # SQL ingestion sanity test
├── rag_setup.py # Retrieval engine (hybrid RAG + GraphRAG + SQL routing)
├── main.py # CLI chatbot with session analytics
├── api/
│ ├── app.py # FastAPI app bootstrap + middleware wiring
│ ├── core/
│ │ ├── models.py # Pydantic request/response schemas
│ │ └── settings.py # Environment-backed configuration
│ ├── routes/
│ │ └── chat.py # API endpoints (/api/chat, sessions, health)
│ └── services/
│ ├── key_pool.py # Busy-key failover state
│ ├── load_control.py # Concurrency and queue controls
│ ├── rate_limit_manager.py # Local RPM/TPM/RPD/TPD budget tracking
│ └── session_store.py # In-memory session memory with TTL
├── api_main.py # Uvicorn run entrypoint
├── .env.example # Environment template
├── Dockerfile # Docker image build
├── docker-compose.yml # Single API deployment
├── docker-compose.nginx.yml # Nginx + 3 API replicas
├── deploy/
│ ├── nginx.conf # Bare-metal/local nginx config
│ └── nginx-docker.conf # Docker nginx config
├── docs/
│ ├── docs/CONTRIBUTING.md # Contribution status and policy
│ ├── FRONTEND_INTEGRATION.md # Frontend integration guide
│ ├── WORKING.md # Technical documentation — how the RAG works
│ └── FUTURE_ADDITIONS.md # Roadmap and planned improvements
├── tests/
│ └── test_api.py # API tests
├── requirements.txt # Python dependencies
├── .index_cache/ # Cached retrieval artifacts (auto-generated)
│ ├── faiss/ # FAISS vector index
│ ├── bm25.pkl # BM25 retriever (k=8)
│ ├── bm25_large.pkl # BM25 retriever (k=50)
│ ├── content_graph.json # GraphRAG content graph cache (node-link JSON)
│ ├── content_graph_hash.txt # Hash for graph cache invalidation
│ └── data_hash.txt # MD5 hash for FAISS/BM25 cache invalidation
└── README.md # Setup and usage guide
| Setting | Default | File | Notes |
|---|---|---|---|
| LLM | Groq openai/gpt-oss-120b |
rag_setup.py |
Requires Groq API key |
| Embeddings | all-MiniLM-L6-v2 (384-dim) |
rag_setup.py |
Runs locally, no API key |
| Chunk size | 700 chars target, 910 split threshold | preprocess_data.py |
Sentence-aware splitting |
| Former people | 10 role-based chunks + 1 summary | preprocess_data.py |
Structured per-role parsing for accurate retrieval |
| BM25:Vector weights | 0.6:0.4 | rag_setup.py |
BM25 weighted higher for keyword queries |
| GraphRAG cache | .index_cache/content_graph.json |
rag_setup.py |
Built from processed chunks + tracking graph for structure/link queries |
| Max context | 22,000 chars (~6K tokens) | rag_setup.py |
Truncates retrieved chunks to fit |
| SQL history limit | 1,500 chars | rag_setup.py |
Caps history sent to SQL classifier |
| Retrieval modes | sql / rag / graph_rag |
api/routes/chat.py |
Returned in chat metadata for route verification |
| API host/port | 0.0.0.0:8000 |
.env |
Controlled by API_HOST, API_PORT |
| API concurrency | 4 |
.env |
MAX_CONCURRENT_REQUESTS for load control |
| Queue wait timeout | 20s |
.env |
QUEUE_WAIT_SECONDS before busy response |
| Key failover | Enabled | .env + api/services/key_pool.py |
Uses GROQ_API_KEYS pool with cooldown |
| Local quota guardrails | RPM/TPM/RPD/TPD | .env + api/services/rate_limit_manager.py |
Protects service before upstream limits |
This repository is All Rights Reserved.
- You may contribute to this repository through pull requests and approved collaboration workflows.
- You may not copy, reuse, redistribute, relicense, or sell this code outside this repository without prior written permission from the copyright holder.
See LICENSE for full legal terms.