Skip to content

ghostjat/chatbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔭 PHAROS — Autonomous Career Counseling Expert System

"Pharos" — the ancient lighthouse of Alexandria. Here, a lighthouse for Indian students navigating the sea of career choices.

A fully self-contained, zero-external-API career counseling chatbot for Indian Class 9–12 students. Built on CodeIgniter 4 + Rubix ML, Pharos is a dynamic Expert System that reasons with psychometric data, machine-learned intent classification, and a database of thousands of empathetic response templates — no OpenAI, no Anthropic, no third-party AI services.


📐 System Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                         BROWSER (SPA)                               │
│   HTML / Vanilla JS — Claude-style chat UI (dark navy + saffron)   │
└─────────────────────────┬───────────────────────────────────────────┘
                          │ HTTPS JSON
┌─────────────────────────▼───────────────────────────────────────────┐
│                   CODEIGNITER 4 (PHP 8.2+)                          │
│                                                                      │
│  ChatController  ──▶  AutonomousLearningService  (Rubix ML NB)     │
│        │                        │                                    │
│        │               [Confidence Gate ≥ 0.45]                     │
│        │                        │                                    │
│        ▼                        ▼ PASS              FAIL ▼          │
│  DialogueStateManager    GuidanceEngineService   TrainingQueueModel │
│  (CI4 Session FSM)        (MCDA Scoring Engine)   (human review)   │
│        │                        │                                    │
│        │                 ResponseTemplateModel                       │
│        │                 (knowledge base)                            │
│        ▼                        ▼                                    │
│                    StudentProfileModel                               │
│               (RIASEC + academics + session state)                  │
│                                                                      │
│  AIController  ──▶  /api/admin/ai/* (train, heal, queue mgmt)      │
│  QuizController ──▶ /api/quiz/* (RIASEC 30-question assessment)    │
└──────────────────────────────────┬──────────────────────────────────┘
                                   │
                          ┌────────▼─────────┐
                          │   MySQL / MariaDB │
                          │  5 core tables    │
                          └──────────────────┘

🚀 Quick Start

Prerequisites

Dependency Version Notes
PHP ≥ 8.2 ext-intl, ext-mbstring, ext-json required
Composer ≥ 2.5 Package manager
MySQL / MariaDB ≥ 8.0 / 10.6 Main database
PHP-CLI same as above For Spark commands

1. Clone & Install

git clone https://github.com/your-org/pharos.git
cd pharos
composer install

2. Environment Setup

cp .env.example .env
# Edit .env — set database credentials and encryption key
php spark key:generate

Minimum .env settings:

CI_ENVIRONMENT = development
database.default.database = pharos_db
database.default.username = pharos_user
database.default.password = your_password

3. Database Setup

# Create the database first
mysql -u root -p -e "CREATE DATABASE pharos_db CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;"

# Run migrations (creates 5 tables + session table)
php spark migrate

# Seed the knowledge base (~15 templates across 8 intents)
php spark db:seed PharosTemplateSeeder

4. Train the ML Model

# Trains NaiveBayes on ~200 seed samples, saves to writable/models/
php spark pharos:train

Expected output:

╔══════════════════════════════════════════════════════╗
║          PHAROS — Intent Classifier Training          ║
╚══════════════════════════════════════════════════════╝
▶ Building pipeline: TextNormalizer → StopWordFilter → WordCountVectorizer(NGram 1-2) → TfIdfTransformer → NaiveBayes
▶ Loading seed corpus (~200+ samples, 11 intent classes)...
╔══════════════════════════════════════════════════════╗
║              ✓  Training Complete                     ║
╚══════════════════════════════════════════════════════╝
  Samples trained:     212
  Intent classes:      11
  Training time:       1.84s
  Model saved to:      /var/www/pharos/writable/models/pharos_intent_classifier.rbx

5. Launch

php spark serve
# Open: http://localhost:8080

📁 Project Structure

pharos/
├── app/
│   ├── Commands/
│   │   ├── PharosTrain.php          # php spark pharos:train
│   │   └── PharosHeal.php           # php spark pharos:heal
│   ├── Config/
│   │   ├── Filters.php              # Registers AdminApiFilter alias
│   │   └── Routes.php               # All API routes + SPA catch-all
│   ├── Controllers/
│   │   ├── HomeController.php       # Serves SPA shell
│   │   ├── ChatController.php       # Core conversation pipeline
│   │   ├── QuizController.php       # RIASEC 30-question assessment
│   │   └── AIController.php         # ML admin API (train/heal/queue)
│   ├── Entities/
│   │   └── StudentProfile.php       # CI4 Entity with computed RIASEC/academic methods
│   ├── Filters/
│   │   └── AdminApiFilter.php       # Bearer token auth for /api/admin/*
│   ├── Models/
│   │   ├── StudentProfileModel.php  # RIASEC + academic score persistence
│   │   ├── ResponseTemplateModel.php# Knowledge base + MCDA query methods
│   │   ├── ChatMessageModel.php     # Conversation history
│   │   └── TrainingQueueModel.php   # Self-healing training pipeline
│   ├── Services/
│   │   ├── AutonomousLearningService.php  # Rubix ML NaiveBayes engine
│   │   ├── DialogueStateManager.php       # Session FSM + slot-filling
│   │   └── GuidanceEngineService.php      # MCDA scoring + template injection
│   └── Views/
│       └── pharos/app.php           # Full SPA (HTML/CSS/JS, no framework)
├── database/
│   └── migrations/
│       ├── 2024-01-01-000001_CreatePharosTables.php
│       ├── 2024-01-01-000002_CreateSessionTable.php
│       └── PharosTemplateSeeder.php
├── writable/
│   └── models/                      # Serialized .rbx model files (gitignored)
├── composer.json
├── .env.example
└── README.md

🧠 Module Deep-Dive

Module 1: Student Profile & Knowledge Base

StudentProfileModel + StudentProfile Entity

The profile is the context window of the Expert System. Every ML decision is conditioned on its values.

Key computed properties on StudentProfile:

Method Description
getRiasecCode() Returns top-3 RIASEC dimensions by score (e.g., "IRS")
getDominantRiasec() Returns single dominant dimension letter
getOverallAcademicScore() Weighted avg: Math×0.30, Sci×0.30, Eng×0.25, Com×0.15
hasStemLearningGap() true if STEM-aspired AND (Math<60 OR Science<60)
getAcademicLabel() "High Achiever" / "Average" / "Needs Support" etc.

ResponseTemplateModel — The Knowledge Base

Each row represents one response the system can give. Key columns:

Column Type Example
intent_id VARCHAR exam_stress
required_riasec VARCHAR I,IA,IR (comma-separated codes)
academic_condition VARCHAR math < 60
grade_filter VARCHAR 11,12
stream_filter VARCHAR PCM
eq_filter ENUM HIGH
priority_weight INT (1-100) 85
template_text TEXT Dear {student_name}, your score in...
follow_up_prompts JSON ["Tell me more about JEE", "What about NEET?"]

Module 2: The ML Brain (AutonomousLearningService)

Pipeline Architecture

Raw User Input
     │
     ▼
TextNormalizer          → lowercase, unicode normalization
     │
     ▼
StopWordFilter          → removes 'the', 'is', 'mera', 'kya', etc.
     │
     ▼
WordCountVectorizer     → NGram(1,2) — builds unigram + bigram vocabulary
  (max 10,000 tokens)     e.g., ["jee", "main", "jee main", "eligibility"]
     │
     ▼
TfIdfTransformer        → term frequency × inverse document frequency
                           Reduces weight of high-frequency generic words
     │
     ▼
NaiveBayes(α=1.0)       → Laplace-smoothed multinomial NB classifier
     │
     ▼
proba()                 → Probability distribution over all intent classes
     │
     ├── max(proba) ≥ 0.45 ──▶ Intent classified → GuidanceEngineService
     │
     └── max(proba) < 0.45 ──▶ Saved to training_queue → out_of_scope response

Intent Classes (11 total)

Intent ID Description Example Queries
confusion_pcm_vs_commerce PCM vs Commerce choice "Should I take science or commerce?"
query_jee_eligibility JEE eligibility/dates/syllabus "What is JEE Mains eligibility?"
query_neet_eligibility NEET queries "Can I give NEET after Class 12?"
exam_stress Anxiety, burnout, pressure "I can't handle the JEE pressure"
search_college College search/ranking "Best NIT for computer science?"
career_explore_riasec Career exploration "What career suits me?"
query_scholarship Scholarship info "What scholarships are available?"
study_tips_request Study strategies "How to study Physics effectively?"
doubt_resolution Subject doubts "I don't understand integration"
parent_conflict Parent pressure "My parents want me to be a doctor"
out_of_scope Unrecognized/off-topic Anything below 0.45 confidence

Confidence Gating

// From AutonomousLearningService::classify()
$proba = $estimator->proba([$sample]);
arsort($proba[0]);
$topIntent     = array_key_first($proba[0]);
$topConfidence = reset($proba[0]);

if ($topConfidence < self::CONFIDENCE_GATE) {  // 0.45
    // Save to training_queue for human review
    $queueModel->insert([
        'raw_query'       => $query,
        'session_context' => json_encode($context),
        'status'          => 'pending',
    ]);
    return ['intent' => 'out_of_scope', 'confidence' => $topConfidence];
}

Self-Healing Loop (partialFit)

THEORY: Naive Bayes maintains per-class word count matrices.
        partialFit() adds NEW counts ON TOP of existing counts.
        It does NOT reset the model — it shifts the distributions.

EXAMPLE:
  Before heal:
    P("jee main" | query_jee_eligibility) = 0.043
    P("jee main" | exam_stress)            = 0.008

  After healing with 50 new resolved "exam_stress" samples
  that contain "jee main":
    P("jee main" | exam_stress) shifts to 0.014
    P("jee main" | query_jee_eligibility) stays ~0.043

  Net effect: classifier better distinguishes stress queries
  about JEE vs factual eligibility queries about JEE.

VOCABULARY NOTE: The NGram vocabulary is FROZEN after initial training.
  Any new n-grams in the healing batch are silently ignored.
  The model does NOT expand its vocabulary mid-life.
  This is intentional — vocabulary drift causes dimension mismatch errors.

Cron Setup (production):

# /etc/cron.hourly/pharos-heal
0 * * * * www-data /usr/bin/php /var/www/pharos/spark pharos:heal --batch=100 >> /var/log/pharos_heal.log 2>&1

Module 3: Memory & State (DialogueStateManager)

Session State Schema

{
  "dialogue_state": "normal | slot_filling | awaiting_quiz | follow_up",
  "active_entity": "JEE Mains",
  "entity_history": ["JEE Mains", "NIT Trichy"],
  "active_intent": "query_jee_eligibility",
  "slot": {
    "type": "city",
    "pending_intent": "search_college",
    "filled_value": null
  },
  "conversation_turn": 7,
  "topic_stack": ["exam_stress", "query_jee_eligibility"]
}

Entity Memory — Solving Conversational Amnesia

Turn 1: "Tell me about JEE Mains"
        → active_entity = "JEE Mains"
        → Response: "JEE Mains is a national entrance exam..."

Turn 2: "What is the syllabus?"    ← only 4 words, no entity
        → injectEntityContext() detects: query < 6 words AND no new entity
        → Injects: "JEE Mains What is the syllabus?"
        → ML now correctly classifies as query_jee_eligibility
        → Response gives JEE syllabus

Slot-Filling State Machine

Student: "Which college should I join?"
         → ML: intent = search_college
         → DialogueStateManager: intentRequiresSlot('search_college') = true
         → Shift to slot_filling mode
         → Return: "Sure! Which city are you looking in? 🏙️"

Student: "Mumbai"
         → State is slot_filling
         → resolveSlot("Mumbai") → slot.filled_value = "Mumbai"
         → Restore intent = search_college, state = normal
         → GuidanceEngineService gets slot_value = "Mumbai"
         → Template injection: {slot_value} = "Mumbai"
         → Response: "Here are top colleges in Mumbai for PCM students..."

Module 4: MCDA Conflict Resolution (GuidanceEngineService)

Multi-Criteria Decision Analysis Scoring

When the ML returns an intent, the Guidance Engine must select ONE template from potentially dozens of candidates for that intent. It does this using a weighted MCDA scoring model.

Criteria and Weights:

Criterion Symbol Weight Description
Template Priority C₁ 0.40 (40%) The hand-tuned quality weight set by content authors
RIASEC Match C₂ 0.30 (30%) How well required_riasec matches student's code
Academic Condition C₃ 0.20 (20%) Does the student satisfy the template's academic rule?
Specificity Bonus C₄ 0.10 (10%) Templates with more constraints are more personalized

Scoring Formula (per template):

Score(T) = (C₁ × 0.40) + (C₂ × 0.30) + (C₃ × 0.20) + (C₄ × 0.10)

Where:
  C₁ = template.priority_weight / 100

  C₂ = RIASEC positional overlap score:
       Compare required_riasec against student's top-3 code.
       Position weights: [0.5, 0.3, 0.2] (dominant letter worth most)
       Each matching position contributes its weight to C₂.
       Empty required_riasec → 0.5 (neutral — template is universal)

  C₃ = Academic condition evaluation:
       Empty condition → 0.5 (neutral)
       Condition PASSES (e.g., student math=55, condition="math < 60") → 1.0
       Condition FAILS → 0.0

  C₄ = Specificity density:
       has_riasec_filter × 0.4 +
       has_academic_condition × 0.3 +
       has_grade_filter × 0.2 +
       has_stream_filter × 0.1

Conflict Resolution Override:

IF student.hasStemLearningGap() IS TRUE:

  The system detects a conflict:
    "Student wants Engineering (STEM) but has Math < 60 OR Science < 60"

  Apply override bonuses/penalties:
    Templates tagged 'learning_gap' category: score += LEARNING_GAP_BONUS (0.25)
    Templates tagged 'passion_match' category: score -= PASSION_MATCH_PENALTY (0.15)

  EFFECT: "Learning gap" templates — which say things like:
    "Your ambition for Engineering is real, {student_name}. But Math at {math_score}%
     needs attention first. Here is your 90-day recovery plan..."

  ...will always outrank "passion match" templates that blindly encourage
  the student toward Engineering without addressing the academic gap.

  This is mathematically guaranteed:
    A learning_gap template with priority=70 gets: ... + 0.25 = boosted
    A passion_match template with priority=80 gets: ... - 0.15 = penalized
    Net: learning_gap wins even with lower base priority.

Dynamic Variable Injection

Once the winner template is selected, all {tag} placeholders are replaced with live student data:

Tag Source Example Value
{student_name} StudentProfile.name "Aryan"
{grade} StudentProfile.grade "11"
{stream} StudentProfile.stream "PCM"
{top_riasec} StudentProfile.getDominantRiasec() "I"
{riasec_code} StudentProfile.getRiasecCode() "IRS"
{aspired_career} StudentProfile.aspired_career "Engineer"
{math_score} StudentProfile.math_score "54"
{overall_score} StudentProfile.getOverallAcademicScore() "61.5"
{academic_label} StudentProfile.getAcademicLabel() "Average"
{active_entity} DialogueStateManager.active_entity "JEE Mains"
{slot_value} DialogueStateManager.slot.filled_value "Mumbai"
{preferred_location} StudentProfile.preferred_location "Delhi"
{eq_level} StudentProfile.eq_level "HIGH"

🗄️ Database Schema

student_profiles

Column Type Description
id BIGINT PK Auto-increment
user_id BIGINT Foreign key to users table
name VARCHAR(100) Student's name
grade TINYINT 9, 10, 11, or 12
stream VARCHAR(20) PCM, PCB, Commerce, Arts, Undecided
math_score DECIMAL(5,2) 0.00 – 100.00
science_score DECIMAL(5,2)
english_score DECIMAL(5,2)
commerce_score DECIMAL(5,2)
riasec_r through riasec_c TINYINT 0 – 100 per dimension
eq_level ENUM LOW, MEDIUM, HIGH
aspired_career VARCHAR(100) Extracted from dialogue
preferred_location VARCHAR(100)
conversation_count INT For adaptive behavior

response_templates

Column Type Description
id BIGINT PK
intent_id VARCHAR(100) Maps to ML intent class
template_key VARCHAR(100) Unique slug
category VARCHAR(50) learning_gap, passion_match, etc.
required_riasec VARCHAR(50) Comma-separated codes or empty
academic_condition VARCHAR(100) e.g., math < 60
grade_filter VARCHAR(20) e.g., 11,12
stream_filter VARCHAR(20) e.g., PCM
eq_filter VARCHAR(10) e.g., HIGH
priority_weight TINYINT 1 – 100
template_text TEXT Response with {injection_tags}
follow_up_prompts JSON Array of suggested follow-ups
usage_count INT Tracks which templates are used most

training_queue

Column Type Description
id BIGINT PK
raw_query TEXT User's original message
assigned_intent VARCHAR(100) Set by admin reviewer
session_context JSON Session state at time of failure
status ENUM pending → resolved → trained / rejected

chat_messages

Column Type Description
id BIGINT PK
session_id VARCHAR(128) CI4 session ID
user_id BIGINT
role ENUM user, assistant
content TEXT Message text
intent VARCHAR(100) Detected intent (for assistant messages)
confidence DECIMAL(5,4) ML confidence score
template_id BIGINT Which template was used
metadata JSON MCDA scores, entity, slot data

🔌 API Reference

Chat

POST /api/chat
Content-Type: application/json

{
  "message": "I want to become an engineer but my math is weak",
  "user_id": 42,
  "session_id": "abc123"
}

Response 200:
{
  "success": true,
  "response": "Dear Aryan, your ambition for Engineering is admirable...",
  "intent": "confusion_pcm_vs_commerce",
  "confidence": 0.78,
  "follow_ups": ["Tell me about JEE eligibility", "How to improve Math?"],
  "dialogue_state": "normal",
  "active_entity": null
}

Quiz

GET /api/quiz/questions
→ Returns 30 RIASEC questions

POST /api/quiz/submit
{ "user_id": 42, "responses": {"R1": 4, "R2": 3, ..., "C5": 5} }
→ { "riasec_code": "IRS", "scores": {...}, "suggested_careers": [...] }

Admin

# All admin routes require: Authorization: Bearer <token>

GET  /api/admin/ai/status       → Model info, queue counts
POST /api/admin/ai/train        → Full retrain from scratch
POST /api/admin/ai/heal         → partialFit with resolved queue items
GET  /api/admin/ai/queue        → List training queue (filter by status)
PUT  /api/admin/ai/queue/{id}   → Assign intent, set resolved
DELETE /api/admin/ai/queue/{id} → Reject (exclude from training)

⚙️ Configuration Reference

.env Key Default Description
pharos.confidenceGate 0.45 ML confidence threshold
pharos.modelPath writable/models/ Model storage directory
pharos.healBatchSize 100 Samples per self-healing run
pharos.healMinSamples 20 Minimum before heal triggers
pharos.adminToken (none) Bearer token for /api/admin/*

📈 Scaling the Knowledge Base

The system is designed to scale from ~15 templates to thousands:

  1. Add templates via SQL — each new row in response_templates is immediately available
  2. Tune priority_weight — higher weight = preferred by MCDA
  3. Add intent classes — add seed corpus samples in AutonomousLearningService::getSeedCorpus(), retrain
  4. Run the heal loop — as students use the system, low-confidence queries flow to training_queue, humans review and resolve, and php spark pharos:heal absorbs them

Recommended template distribution:

Intent Min templates Recommended
exam_stress 5 20+
query_jee_eligibility 5 30+ (grade×stream combos)
confusion_pcm_vs_commerce 5 25+ (score range combos)
career_explore_riasec 6 36+ (one per RIASEC code)
search_college 5 50+ (stream×city combos)

🛠️ Development Commands

# Full setup (first run)
composer install && php spark key:generate && php spark migrate \
  && php spark db:seed PharosTemplateSeeder && php spark pharos:train

# Development server
php spark serve --port=8080

# Train model from scratch
php spark pharos:train --verbose

# Run self-healing loop
php spark pharos:heal --batch=50

# Dry-run heal (preview without training)
php spark pharos:heal --dry-run

# Database operations
php spark migrate:refresh          # Drop and re-run all migrations
php spark db:seed PharosTemplateSeeder  # Re-seed templates

# Clear caches
php spark cache:clear

📄 License

MIT License — see LICENSE file.


Built with ❤️ for every Indian student standing at the crossroads of Class 10 and Class 11.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors