🔭 PHAROS — Autonomous Career Counseling Expert System

"Pharos" — the ancient lighthouse of Alexandria. Here, a lighthouse for Indian students navigating the sea of career choices.

A fully self-contained, zero-external-API career counseling chatbot for Indian Class 9–12 students. Built on CodeIgniter 4 + Rubix ML, Pharos is a dynamic Expert System that reasons with psychometric data, machine-learned intent classification, and a database of thousands of empathetic response templates — no OpenAI, no Anthropic, no third-party AI services.

📐 System Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                         BROWSER (SPA)                               │
│   HTML / Vanilla JS — Claude-style chat UI (dark navy + saffron)   │
└─────────────────────────┬───────────────────────────────────────────┘
                          │ HTTPS JSON
┌─────────────────────────▼───────────────────────────────────────────┐
│                   CODEIGNITER 4 (PHP 8.2+)                          │
│                                                                      │
│  ChatController  ──▶  AutonomousLearningService  (Rubix ML NB)     │
│        │                        │                                    │
│        │               [Confidence Gate ≥ 0.45]                     │
│        │                        │                                    │
│        ▼                        ▼ PASS              FAIL ▼          │
│  DialogueStateManager    GuidanceEngineService   TrainingQueueModel │
│  (CI4 Session FSM)        (MCDA Scoring Engine)   (human review)   │
│        │                        │                                    │
│        │                 ResponseTemplateModel                       │
│        │                 (knowledge base)                            │
│        ▼                        ▼                                    │
│                    StudentProfileModel                               │
│               (RIASEC + academics + session state)                  │
│                                                                      │
│  AIController  ──▶  /api/admin/ai/* (train, heal, queue mgmt)      │
│  QuizController ──▶ /api/quiz/* (RIASEC 30-question assessment)    │
└──────────────────────────────────┬──────────────────────────────────┘
                                   │
                          ┌────────▼─────────┐
                          │   MySQL / MariaDB │
                          │  5 core tables    │
                          └──────────────────┘

🚀 Quick Start

Prerequisites

Dependency	Version	Notes
PHP	≥ 8.2	ext-intl, ext-mbstring, ext-json required
Composer	≥ 2.5	Package manager
MySQL / MariaDB	≥ 8.0 / 10.6	Main database
PHP-CLI	same as above	For Spark commands

1. Clone & Install

git clone https://github.com/your-org/pharos.git
cd pharos
composer install

2. Environment Setup

cp .env.example .env
# Edit .env — set database credentials and encryption key
php spark key:generate

Minimum .env settings:

CI_ENVIRONMENT = development
database.default.database = pharos_db
database.default.username = pharos_user
database.default.password = your_password

3. Database Setup

# Create the database first
mysql -u root -p -e "CREATE DATABASE pharos_db CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;"

# Run migrations (creates 5 tables + session table)
php spark migrate

# Seed the knowledge base (~15 templates across 8 intents)
php spark db:seed PharosTemplateSeeder

4. Train the ML Model

# Trains NaiveBayes on ~200 seed samples, saves to writable/models/
php spark pharos:train

Expected output:

╔══════════════════════════════════════════════════════╗
║          PHAROS — Intent Classifier Training          ║
╚══════════════════════════════════════════════════════╝
▶ Building pipeline: TextNormalizer → StopWordFilter → WordCountVectorizer(NGram 1-2) → TfIdfTransformer → NaiveBayes
▶ Loading seed corpus (~200+ samples, 11 intent classes)...
╔══════════════════════════════════════════════════════╗
║              ✓  Training Complete                     ║
╚══════════════════════════════════════════════════════╝
  Samples trained:     212
  Intent classes:      11
  Training time:       1.84s
  Model saved to:      /var/www/pharos/writable/models/pharos_intent_classifier.rbx

5. Launch

php spark serve
# Open: http://localhost:8080

📁 Project Structure

pharos/
├── app/
│   ├── Commands/
│   │   ├── PharosTrain.php          # php spark pharos:train
│   │   └── PharosHeal.php           # php spark pharos:heal
│   ├── Config/
│   │   ├── Filters.php              # Registers AdminApiFilter alias
│   │   └── Routes.php               # All API routes + SPA catch-all
│   ├── Controllers/
│   │   ├── HomeController.php       # Serves SPA shell
│   │   ├── ChatController.php       # Core conversation pipeline
│   │   ├── QuizController.php       # RIASEC 30-question assessment
│   │   └── AIController.php         # ML admin API (train/heal/queue)
│   ├── Entities/
│   │   └── StudentProfile.php       # CI4 Entity with computed RIASEC/academic methods
│   ├── Filters/
│   │   └── AdminApiFilter.php       # Bearer token auth for /api/admin/*
│   ├── Models/
│   │   ├── StudentProfileModel.php  # RIASEC + academic score persistence
│   │   ├── ResponseTemplateModel.php# Knowledge base + MCDA query methods
│   │   ├── ChatMessageModel.php     # Conversation history
│   │   └── TrainingQueueModel.php   # Self-healing training pipeline
│   ├── Services/
│   │   ├── AutonomousLearningService.php  # Rubix ML NaiveBayes engine
│   │   ├── DialogueStateManager.php       # Session FSM + slot-filling
│   │   └── GuidanceEngineService.php      # MCDA scoring + template injection
│   └── Views/
│       └── pharos/app.php           # Full SPA (HTML/CSS/JS, no framework)
├── database/
│   └── migrations/
│       ├── 2024-01-01-000001_CreatePharosTables.php
│       ├── 2024-01-01-000002_CreateSessionTable.php
│       └── PharosTemplateSeeder.php
├── writable/
│   └── models/                      # Serialized .rbx model files (gitignored)
├── composer.json
├── .env.example
└── README.md

🧠 Module Deep-Dive

Module 1: Student Profile & Knowledge Base

`StudentProfileModel` + `StudentProfile` Entity

The profile is the context window of the Expert System. Every ML decision is conditioned on its values.

Key computed properties on StudentProfile:

Method	Description
`getRiasecCode()`	Returns top-3 RIASEC dimensions by score (e.g., `"IRS"`)
`getDominantRiasec()`	Returns single dominant dimension letter
`getOverallAcademicScore()`	Weighted avg: Math×0.30, Sci×0.30, Eng×0.25, Com×0.15
`hasStemLearningGap()`	`true` if STEM-aspired AND (Math<60 OR Science<60)
`getAcademicLabel()`	`"High Achiever"` / `"Average"` / `"Needs Support"` etc.

`ResponseTemplateModel` — The Knowledge Base

Each row represents one response the system can give. Key columns:

Column	Type	Example
`intent_id`	VARCHAR	`exam_stress`
`required_riasec`	VARCHAR	`I,IA,IR` (comma-separated codes)
`academic_condition`	VARCHAR	`math < 60`
`grade_filter`	VARCHAR	`11,12`
`stream_filter`	VARCHAR	`PCM`
`eq_filter`	ENUM	`HIGH`
`priority_weight`	INT (1-100)	`85`
`template_text`	TEXT	`Dear {student_name}, your score in...`
`follow_up_prompts`	JSON	`["Tell me more about JEE", "What about NEET?"]`

Module 2: The ML Brain (`AutonomousLearningService`)

Pipeline Architecture

Raw User Input
     │
     ▼
TextNormalizer          → lowercase, unicode normalization
     │
     ▼
StopWordFilter          → removes 'the', 'is', 'mera', 'kya', etc.
     │
     ▼
WordCountVectorizer     → NGram(1,2) — builds unigram + bigram vocabulary
  (max 10,000 tokens)     e.g., ["jee", "main", "jee main", "eligibility"]
     │
     ▼
TfIdfTransformer        → term frequency × inverse document frequency
                           Reduces weight of high-frequency generic words
     │
     ▼
NaiveBayes(α=1.0)       → Laplace-smoothed multinomial NB classifier
     │
     ▼
proba()                 → Probability distribution over all intent classes
     │
     ├── max(proba) ≥ 0.45 ──▶ Intent classified → GuidanceEngineService
     │
     └── max(proba) < 0.45 ──▶ Saved to training_queue → out_of_scope response

Intent Classes (11 total)

Intent ID	Description	Example Queries
`confusion_pcm_vs_commerce`	PCM vs Commerce choice	"Should I take science or commerce?"
`query_jee_eligibility`	JEE eligibility/dates/syllabus	"What is JEE Mains eligibility?"
`query_neet_eligibility`	NEET queries	"Can I give NEET after Class 12?"
`exam_stress`	Anxiety, burnout, pressure	"I can't handle the JEE pressure"
`search_college`	College search/ranking	"Best NIT for computer science?"
`career_explore_riasec`	Career exploration	"What career suits me?"
`query_scholarship`	Scholarship info	"What scholarships are available?"
`study_tips_request`	Study strategies	"How to study Physics effectively?"
`doubt_resolution`	Subject doubts	"I don't understand integration"
`parent_conflict`	Parent pressure	"My parents want me to be a doctor"
`out_of_scope`	Unrecognized/off-topic	Anything below 0.45 confidence

Confidence Gating

// From AutonomousLearningService::classify()
$proba = $estimator->proba([$sample]);
arsort($proba[0]);
$topIntent     = array_key_first($proba[0]);
$topConfidence = reset($proba[0]);

if ($topConfidence < self::CONFIDENCE_GATE) {  // 0.45
    // Save to training_queue for human review
    $queueModel->insert([
        'raw_query'       => $query,
        'session_context' => json_encode($context),
        'status'          => 'pending',
    ]);
    return ['intent' => 'out_of_scope', 'confidence' => $topConfidence];
}

Self-Healing Loop (partialFit)

THEORY: Naive Bayes maintains per-class word count matrices.
        partialFit() adds NEW counts ON TOP of existing counts.
        It does NOT reset the model — it shifts the distributions.

EXAMPLE:
  Before heal:
    P("jee main" | query_jee_eligibility) = 0.043
    P("jee main" | exam_stress)            = 0.008

  After healing with 50 new resolved "exam_stress" samples
  that contain "jee main":
    P("jee main" | exam_stress) shifts to 0.014
    P("jee main" | query_jee_eligibility) stays ~0.043

  Net effect: classifier better distinguishes stress queries
  about JEE vs factual eligibility queries about JEE.

VOCABULARY NOTE: The NGram vocabulary is FROZEN after initial training.
  Any new n-grams in the healing batch are silently ignored.
  The model does NOT expand its vocabulary mid-life.
  This is intentional — vocabulary drift causes dimension mismatch errors.

Cron Setup (production):

# /etc/cron.hourly/pharos-heal
0 * * * * www-data /usr/bin/php /var/www/pharos/spark pharos:heal --batch=100 >> /var/log/pharos_heal.log 2>&1

Module 3: Memory & State (`DialogueStateManager`)

Session State Schema

{
  "dialogue_state": "normal | slot_filling | awaiting_quiz | follow_up",
  "active_entity": "JEE Mains",
  "entity_history": ["JEE Mains", "NIT Trichy"],
  "active_intent": "query_jee_eligibility",
  "slot": {
    "type": "city",
    "pending_intent": "search_college",
    "filled_value": null
  },
  "conversation_turn": 7,
  "topic_stack": ["exam_stress", "query_jee_eligibility"]
}

Entity Memory — Solving Conversational Amnesia

Turn 1: "Tell me about JEE Mains"
        → active_entity = "JEE Mains"
        → Response: "JEE Mains is a national entrance exam..."

Turn 2: "What is the syllabus?"    ← only 4 words, no entity
        → injectEntityContext() detects: query < 6 words AND no new entity
        → Injects: "JEE Mains What is the syllabus?"
        → ML now correctly classifies as query_jee_eligibility
        → Response gives JEE syllabus

Slot-Filling State Machine

Student: "Which college should I join?"
         → ML: intent = search_college
         → DialogueStateManager: intentRequiresSlot('search_college') = true
         → Shift to slot_filling mode
         → Return: "Sure! Which city are you looking in? 🏙️"

Student: "Mumbai"
         → State is slot_filling
         → resolveSlot("Mumbai") → slot.filled_value = "Mumbai"
         → Restore intent = search_college, state = normal
         → GuidanceEngineService gets slot_value = "Mumbai"
         → Template injection: {slot_value} = "Mumbai"
         → Response: "Here are top colleges in Mumbai for PCM students..."

Module 4: MCDA Conflict Resolution (`GuidanceEngineService`)

Multi-Criteria Decision Analysis Scoring

When the ML returns an intent, the Guidance Engine must select ONE template from potentially dozens of candidates for that intent. It does this using a weighted MCDA scoring model.

Criteria and Weights:

Criterion	Symbol	Weight	Description
Template Priority	C₁	0.40 (40%)	The hand-tuned quality weight set by content authors
RIASEC Match	C₂	0.30 (30%)	How well required_riasec matches student's code
Academic Condition	C₃	0.20 (20%)	Does the student satisfy the template's academic rule?
Specificity Bonus	C₄	0.10 (10%)	Templates with more constraints are more personalized

Scoring Formula (per template):

Score(T) = (C₁ × 0.40) + (C₂ × 0.30) + (C₃ × 0.20) + (C₄ × 0.10)

Where:
  C₁ = template.priority_weight / 100

  C₂ = RIASEC positional overlap score:
       Compare required_riasec against student's top-3 code.
       Position weights: [0.5, 0.3, 0.2] (dominant letter worth most)
       Each matching position contributes its weight to C₂.
       Empty required_riasec → 0.5 (neutral — template is universal)

  C₃ = Academic condition evaluation:
       Empty condition → 0.5 (neutral)
       Condition PASSES (e.g., student math=55, condition="math < 60") → 1.0
       Condition FAILS → 0.0

  C₄ = Specificity density:
       has_riasec_filter × 0.4 +
       has_academic_condition × 0.3 +
       has_grade_filter × 0.2 +
       has_stream_filter × 0.1

Conflict Resolution Override:

IF student.hasStemLearningGap() IS TRUE:

  The system detects a conflict:
    "Student wants Engineering (STEM) but has Math < 60 OR Science < 60"

  Apply override bonuses/penalties:
    Templates tagged 'learning_gap' category: score += LEARNING_GAP_BONUS (0.25)
    Templates tagged 'passion_match' category: score -= PASSION_MATCH_PENALTY (0.15)

  EFFECT: "Learning gap" templates — which say things like:
    "Your ambition for Engineering is real, {student_name}. But Math at {math_score}%
     needs attention first. Here is your 90-day recovery plan..."

  ...will always outrank "passion match" templates that blindly encourage
  the student toward Engineering without addressing the academic gap.

  This is mathematically guaranteed:
    A learning_gap template with priority=70 gets: ... + 0.25 = boosted
    A passion_match template with priority=80 gets: ... - 0.15 = penalized
    Net: learning_gap wins even with lower base priority.

Dynamic Variable Injection

Once the winner template is selected, all {tag} placeholders are replaced with live student data:

Tag	Source	Example Value
`{student_name}`	StudentProfile.name	`"Aryan"`
`{grade}`	StudentProfile.grade	`"11"`
`{stream}`	StudentProfile.stream	`"PCM"`
`{top_riasec}`	StudentProfile.getDominantRiasec()	`"I"`
`{riasec_code}`	StudentProfile.getRiasecCode()	`"IRS"`
`{aspired_career}`	StudentProfile.aspired_career	`"Engineer"`
`{math_score}`	StudentProfile.math_score	`"54"`
`{overall_score}`	StudentProfile.getOverallAcademicScore()	`"61.5"`
`{academic_label}`	StudentProfile.getAcademicLabel()	`"Average"`
`{active_entity}`	DialogueStateManager.active_entity	`"JEE Mains"`
`{slot_value}`	DialogueStateManager.slot.filled_value	`"Mumbai"`
`{preferred_location}`	StudentProfile.preferred_location	`"Delhi"`
`{eq_level}`	StudentProfile.eq_level	`"HIGH"`

🗄️ Database Schema

`student_profiles`

Column	Type	Description
id	BIGINT PK	Auto-increment
user_id	BIGINT	Foreign key to users table
name	VARCHAR(100)	Student's name
grade	TINYINT	9, 10, 11, or 12
stream	VARCHAR(20)	PCM, PCB, Commerce, Arts, Undecided
math_score	DECIMAL(5,2)	0.00 – 100.00
science_score	DECIMAL(5,2)
english_score	DECIMAL(5,2)
commerce_score	DECIMAL(5,2)
riasec_r through riasec_c	TINYINT	0 – 100 per dimension
eq_level	ENUM	LOW, MEDIUM, HIGH
aspired_career	VARCHAR(100)	Extracted from dialogue
preferred_location	VARCHAR(100)
conversation_count	INT	For adaptive behavior

`response_templates`

Column	Type	Description
id	BIGINT PK
intent_id	VARCHAR(100)	Maps to ML intent class
template_key	VARCHAR(100)	Unique slug
category	VARCHAR(50)	learning_gap, passion_match, etc.
required_riasec	VARCHAR(50)	Comma-separated codes or empty
academic_condition	VARCHAR(100)	e.g., `math < 60`
grade_filter	VARCHAR(20)	e.g., `11,12`
stream_filter	VARCHAR(20)	e.g., `PCM`
eq_filter	VARCHAR(10)	e.g., `HIGH`
priority_weight	TINYINT	1 – 100
template_text	TEXT	Response with {injection_tags}
follow_up_prompts	JSON	Array of suggested follow-ups
usage_count	INT	Tracks which templates are used most

`training_queue`

Column	Type	Description
id	BIGINT PK
raw_query	TEXT	User's original message
assigned_intent	VARCHAR(100)	Set by admin reviewer
session_context	JSON	Session state at time of failure
status	ENUM	pending → resolved → trained / rejected

`chat_messages`

Column	Type	Description
id	BIGINT PK
session_id	VARCHAR(128)	CI4 session ID
user_id	BIGINT
role	ENUM	user, assistant
content	TEXT	Message text
intent	VARCHAR(100)	Detected intent (for assistant messages)
confidence	DECIMAL(5,4)	ML confidence score
template_id	BIGINT	Which template was used
metadata	JSON	MCDA scores, entity, slot data

🔌 API Reference

Chat

POST /api/chat
Content-Type: application/json

{
  "message": "I want to become an engineer but my math is weak",
  "user_id": 42,
  "session_id": "abc123"
}

Response 200:
{
  "success": true,
  "response": "Dear Aryan, your ambition for Engineering is admirable...",
  "intent": "confusion_pcm_vs_commerce",
  "confidence": 0.78,
  "follow_ups": ["Tell me about JEE eligibility", "How to improve Math?"],
  "dialogue_state": "normal",
  "active_entity": null
}

Quiz

GET /api/quiz/questions
→ Returns 30 RIASEC questions

POST /api/quiz/submit
{ "user_id": 42, "responses": {"R1": 4, "R2": 3, ..., "C5": 5} }
→ { "riasec_code": "IRS", "scores": {...}, "suggested_careers": [...] }

Admin

# All admin routes require: Authorization: Bearer <token>

GET  /api/admin/ai/status       → Model info, queue counts
POST /api/admin/ai/train        → Full retrain from scratch
POST /api/admin/ai/heal         → partialFit with resolved queue items
GET  /api/admin/ai/queue        → List training queue (filter by status)
PUT  /api/admin/ai/queue/{id}   → Assign intent, set resolved
DELETE /api/admin/ai/queue/{id} → Reject (exclude from training)

⚙️ Configuration Reference

.env Key	Default	Description
`pharos.confidenceGate`	`0.45`	ML confidence threshold
`pharos.modelPath`	`writable/models/`	Model storage directory
`pharos.healBatchSize`	`100`	Samples per self-healing run
`pharos.healMinSamples`	`20`	Minimum before heal triggers
`pharos.adminToken`	(none)	Bearer token for /api/admin/*

📈 Scaling the Knowledge Base

The system is designed to scale from ~15 templates to thousands:

Add templates via SQL — each new row in response_templates is immediately available
Tune priority_weight — higher weight = preferred by MCDA
Add intent classes — add seed corpus samples in AutonomousLearningService::getSeedCorpus(), retrain
Run the heal loop — as students use the system, low-confidence queries flow to training_queue, humans review and resolve, and php spark pharos:heal absorbs them

Recommended template distribution:

Intent	Min templates	Recommended
exam_stress	5	20+
query_jee_eligibility	5	30+ (grade×stream combos)
confusion_pcm_vs_commerce	5	25+ (score range combos)
career_explore_riasec	6	36+ (one per RIASEC code)
search_college	5	50+ (stream×city combos)

🛠️ Development Commands

# Full setup (first run)
composer install && php spark key:generate && php spark migrate \
  && php spark db:seed PharosTemplateSeeder && php spark pharos:train

# Development server
php spark serve --port=8080

# Train model from scratch
php spark pharos:train --verbose

# Run self-healing loop
php spark pharos:heal --batch=50

# Dry-run heal (preview without training)
php spark pharos:heal --dry-run

# Database operations
php spark migrate:refresh          # Drop and re-run all migrations
php spark db:seed PharosTemplateSeeder  # Re-seed templates

# Clear caches
php spark cache:clear

📄 License

MIT License — see LICENSE file.

Built with ❤️ for every Indian student standing at the crossroads of Class 10 and Class 11.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
app		app
database/migrations		database/migrations
nbproject		nbproject
.gitignore		.gitignore
README.md		README.md
composer.json		composer.json
composer.lock		composer.lock

Folders and files

Latest commit

History

Repository files navigation

🔭 PHAROS — Autonomous Career Counseling Expert System

📐 System Architecture

🚀 Quick Start

Prerequisites

1. Clone & Install

2. Environment Setup

3. Database Setup

4. Train the ML Model

5. Launch

📁 Project Structure

🧠 Module Deep-Dive

Module 1: Student Profile & Knowledge Base

StudentProfileModel + StudentProfile Entity

ResponseTemplateModel — The Knowledge Base

Module 2: The ML Brain (AutonomousLearningService)

Pipeline Architecture

Intent Classes (11 total)

Confidence Gating

Self-Healing Loop (partialFit)

Module 3: Memory & State (DialogueStateManager)

Session State Schema

Entity Memory — Solving Conversational Amnesia

Slot-Filling State Machine

Module 4: MCDA Conflict Resolution (GuidanceEngineService)

Multi-Criteria Decision Analysis Scoring

Dynamic Variable Injection

🗄️ Database Schema

student_profiles

response_templates

training_queue

chat_messages

🔌 API Reference

Chat

Quiz

Admin

⚙️ Configuration Reference

📈 Scaling the Knowledge Base

🛠️ Development Commands

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`StudentProfileModel` + `StudentProfile` Entity

`ResponseTemplateModel` — The Knowledge Base

Module 2: The ML Brain (`AutonomousLearningService`)

Module 3: Memory & State (`DialogueStateManager`)

Module 4: MCDA Conflict Resolution (`GuidanceEngineService`)

`student_profiles`

`response_templates`

`training_queue`

`chat_messages`

Packages