"Pharos" — the ancient lighthouse of Alexandria. Here, a lighthouse for Indian students navigating the sea of career choices.
A fully self-contained, zero-external-API career counseling chatbot for Indian Class 9–12 students. Built on CodeIgniter 4 + Rubix ML, Pharos is a dynamic Expert System that reasons with psychometric data, machine-learned intent classification, and a database of thousands of empathetic response templates — no OpenAI, no Anthropic, no third-party AI services.
┌─────────────────────────────────────────────────────────────────────┐
│ BROWSER (SPA) │
│ HTML / Vanilla JS — Claude-style chat UI (dark navy + saffron) │
└─────────────────────────┬───────────────────────────────────────────┘
│ HTTPS JSON
┌─────────────────────────▼───────────────────────────────────────────┐
│ CODEIGNITER 4 (PHP 8.2+) │
│ │
│ ChatController ──▶ AutonomousLearningService (Rubix ML NB) │
│ │ │ │
│ │ [Confidence Gate ≥ 0.45] │
│ │ │ │
│ ▼ ▼ PASS FAIL ▼ │
│ DialogueStateManager GuidanceEngineService TrainingQueueModel │
│ (CI4 Session FSM) (MCDA Scoring Engine) (human review) │
│ │ │ │
│ │ ResponseTemplateModel │
│ │ (knowledge base) │
│ ▼ ▼ │
│ StudentProfileModel │
│ (RIASEC + academics + session state) │
│ │
│ AIController ──▶ /api/admin/ai/* (train, heal, queue mgmt) │
│ QuizController ──▶ /api/quiz/* (RIASEC 30-question assessment) │
└──────────────────────────────────┬──────────────────────────────────┘
│
┌────────▼─────────┐
│ MySQL / MariaDB │
│ 5 core tables │
└──────────────────┘
| Dependency | Version | Notes |
|---|---|---|
| PHP | ≥ 8.2 | ext-intl, ext-mbstring, ext-json required |
| Composer | ≥ 2.5 | Package manager |
| MySQL / MariaDB | ≥ 8.0 / 10.6 | Main database |
| PHP-CLI | same as above | For Spark commands |
git clone https://github.com/your-org/pharos.git
cd pharos
composer installcp .env.example .env
# Edit .env — set database credentials and encryption key
php spark key:generateMinimum .env settings:
CI_ENVIRONMENT = development
database.default.database = pharos_db
database.default.username = pharos_user
database.default.password = your_password# Create the database first
mysql -u root -p -e "CREATE DATABASE pharos_db CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;"
# Run migrations (creates 5 tables + session table)
php spark migrate
# Seed the knowledge base (~15 templates across 8 intents)
php spark db:seed PharosTemplateSeeder# Trains NaiveBayes on ~200 seed samples, saves to writable/models/
php spark pharos:trainExpected output:
╔══════════════════════════════════════════════════════╗
║ PHAROS — Intent Classifier Training ║
╚══════════════════════════════════════════════════════╝
▶ Building pipeline: TextNormalizer → StopWordFilter → WordCountVectorizer(NGram 1-2) → TfIdfTransformer → NaiveBayes
▶ Loading seed corpus (~200+ samples, 11 intent classes)...
╔══════════════════════════════════════════════════════╗
║ ✓ Training Complete ║
╚══════════════════════════════════════════════════════╝
Samples trained: 212
Intent classes: 11
Training time: 1.84s
Model saved to: /var/www/pharos/writable/models/pharos_intent_classifier.rbx
php spark serve
# Open: http://localhost:8080pharos/
├── app/
│ ├── Commands/
│ │ ├── PharosTrain.php # php spark pharos:train
│ │ └── PharosHeal.php # php spark pharos:heal
│ ├── Config/
│ │ ├── Filters.php # Registers AdminApiFilter alias
│ │ └── Routes.php # All API routes + SPA catch-all
│ ├── Controllers/
│ │ ├── HomeController.php # Serves SPA shell
│ │ ├── ChatController.php # Core conversation pipeline
│ │ ├── QuizController.php # RIASEC 30-question assessment
│ │ └── AIController.php # ML admin API (train/heal/queue)
│ ├── Entities/
│ │ └── StudentProfile.php # CI4 Entity with computed RIASEC/academic methods
│ ├── Filters/
│ │ └── AdminApiFilter.php # Bearer token auth for /api/admin/*
│ ├── Models/
│ │ ├── StudentProfileModel.php # RIASEC + academic score persistence
│ │ ├── ResponseTemplateModel.php# Knowledge base + MCDA query methods
│ │ ├── ChatMessageModel.php # Conversation history
│ │ └── TrainingQueueModel.php # Self-healing training pipeline
│ ├── Services/
│ │ ├── AutonomousLearningService.php # Rubix ML NaiveBayes engine
│ │ ├── DialogueStateManager.php # Session FSM + slot-filling
│ │ └── GuidanceEngineService.php # MCDA scoring + template injection
│ └── Views/
│ └── pharos/app.php # Full SPA (HTML/CSS/JS, no framework)
├── database/
│ └── migrations/
│ ├── 2024-01-01-000001_CreatePharosTables.php
│ ├── 2024-01-01-000002_CreateSessionTable.php
│ └── PharosTemplateSeeder.php
├── writable/
│ └── models/ # Serialized .rbx model files (gitignored)
├── composer.json
├── .env.example
└── README.md
The profile is the context window of the Expert System. Every ML decision is conditioned on its values.
Key computed properties on StudentProfile:
| Method | Description |
|---|---|
getRiasecCode() |
Returns top-3 RIASEC dimensions by score (e.g., "IRS") |
getDominantRiasec() |
Returns single dominant dimension letter |
getOverallAcademicScore() |
Weighted avg: Math×0.30, Sci×0.30, Eng×0.25, Com×0.15 |
hasStemLearningGap() |
true if STEM-aspired AND (Math<60 OR Science<60) |
getAcademicLabel() |
"High Achiever" / "Average" / "Needs Support" etc. |
Each row represents one response the system can give. Key columns:
| Column | Type | Example |
|---|---|---|
intent_id |
VARCHAR | exam_stress |
required_riasec |
VARCHAR | I,IA,IR (comma-separated codes) |
academic_condition |
VARCHAR | math < 60 |
grade_filter |
VARCHAR | 11,12 |
stream_filter |
VARCHAR | PCM |
eq_filter |
ENUM | HIGH |
priority_weight |
INT (1-100) | 85 |
template_text |
TEXT | Dear {student_name}, your score in... |
follow_up_prompts |
JSON | ["Tell me more about JEE", "What about NEET?"] |
Raw User Input
│
▼
TextNormalizer → lowercase, unicode normalization
│
▼
StopWordFilter → removes 'the', 'is', 'mera', 'kya', etc.
│
▼
WordCountVectorizer → NGram(1,2) — builds unigram + bigram vocabulary
(max 10,000 tokens) e.g., ["jee", "main", "jee main", "eligibility"]
│
▼
TfIdfTransformer → term frequency × inverse document frequency
Reduces weight of high-frequency generic words
│
▼
NaiveBayes(α=1.0) → Laplace-smoothed multinomial NB classifier
│
▼
proba() → Probability distribution over all intent classes
│
├── max(proba) ≥ 0.45 ──▶ Intent classified → GuidanceEngineService
│
└── max(proba) < 0.45 ──▶ Saved to training_queue → out_of_scope response
| Intent ID | Description | Example Queries |
|---|---|---|
confusion_pcm_vs_commerce |
PCM vs Commerce choice | "Should I take science or commerce?" |
query_jee_eligibility |
JEE eligibility/dates/syllabus | "What is JEE Mains eligibility?" |
query_neet_eligibility |
NEET queries | "Can I give NEET after Class 12?" |
exam_stress |
Anxiety, burnout, pressure | "I can't handle the JEE pressure" |
search_college |
College search/ranking | "Best NIT for computer science?" |
career_explore_riasec |
Career exploration | "What career suits me?" |
query_scholarship |
Scholarship info | "What scholarships are available?" |
study_tips_request |
Study strategies | "How to study Physics effectively?" |
doubt_resolution |
Subject doubts | "I don't understand integration" |
parent_conflict |
Parent pressure | "My parents want me to be a doctor" |
out_of_scope |
Unrecognized/off-topic | Anything below 0.45 confidence |
// From AutonomousLearningService::classify()
$proba = $estimator->proba([$sample]);
arsort($proba[0]);
$topIntent = array_key_first($proba[0]);
$topConfidence = reset($proba[0]);
if ($topConfidence < self::CONFIDENCE_GATE) { // 0.45
// Save to training_queue for human review
$queueModel->insert([
'raw_query' => $query,
'session_context' => json_encode($context),
'status' => 'pending',
]);
return ['intent' => 'out_of_scope', 'confidence' => $topConfidence];
}THEORY: Naive Bayes maintains per-class word count matrices.
partialFit() adds NEW counts ON TOP of existing counts.
It does NOT reset the model — it shifts the distributions.
EXAMPLE:
Before heal:
P("jee main" | query_jee_eligibility) = 0.043
P("jee main" | exam_stress) = 0.008
After healing with 50 new resolved "exam_stress" samples
that contain "jee main":
P("jee main" | exam_stress) shifts to 0.014
P("jee main" | query_jee_eligibility) stays ~0.043
Net effect: classifier better distinguishes stress queries
about JEE vs factual eligibility queries about JEE.
VOCABULARY NOTE: The NGram vocabulary is FROZEN after initial training.
Any new n-grams in the healing batch are silently ignored.
The model does NOT expand its vocabulary mid-life.
This is intentional — vocabulary drift causes dimension mismatch errors.
Cron Setup (production):
# /etc/cron.hourly/pharos-heal
0 * * * * www-data /usr/bin/php /var/www/pharos/spark pharos:heal --batch=100 >> /var/log/pharos_heal.log 2>&1{
"dialogue_state": "normal | slot_filling | awaiting_quiz | follow_up",
"active_entity": "JEE Mains",
"entity_history": ["JEE Mains", "NIT Trichy"],
"active_intent": "query_jee_eligibility",
"slot": {
"type": "city",
"pending_intent": "search_college",
"filled_value": null
},
"conversation_turn": 7,
"topic_stack": ["exam_stress", "query_jee_eligibility"]
}Turn 1: "Tell me about JEE Mains"
→ active_entity = "JEE Mains"
→ Response: "JEE Mains is a national entrance exam..."
Turn 2: "What is the syllabus?" ← only 4 words, no entity
→ injectEntityContext() detects: query < 6 words AND no new entity
→ Injects: "JEE Mains What is the syllabus?"
→ ML now correctly classifies as query_jee_eligibility
→ Response gives JEE syllabus
Student: "Which college should I join?"
→ ML: intent = search_college
→ DialogueStateManager: intentRequiresSlot('search_college') = true
→ Shift to slot_filling mode
→ Return: "Sure! Which city are you looking in? 🏙️"
Student: "Mumbai"
→ State is slot_filling
→ resolveSlot("Mumbai") → slot.filled_value = "Mumbai"
→ Restore intent = search_college, state = normal
→ GuidanceEngineService gets slot_value = "Mumbai"
→ Template injection: {slot_value} = "Mumbai"
→ Response: "Here are top colleges in Mumbai for PCM students..."
When the ML returns an intent, the Guidance Engine must select ONE template from potentially dozens of candidates for that intent. It does this using a weighted MCDA scoring model.
Criteria and Weights:
| Criterion | Symbol | Weight | Description |
|---|---|---|---|
| Template Priority | C₁ | 0.40 (40%) | The hand-tuned quality weight set by content authors |
| RIASEC Match | C₂ | 0.30 (30%) | How well required_riasec matches student's code |
| Academic Condition | C₃ | 0.20 (20%) | Does the student satisfy the template's academic rule? |
| Specificity Bonus | C₄ | 0.10 (10%) | Templates with more constraints are more personalized |
Scoring Formula (per template):
Score(T) = (C₁ × 0.40) + (C₂ × 0.30) + (C₃ × 0.20) + (C₄ × 0.10)
Where:
C₁ = template.priority_weight / 100
C₂ = RIASEC positional overlap score:
Compare required_riasec against student's top-3 code.
Position weights: [0.5, 0.3, 0.2] (dominant letter worth most)
Each matching position contributes its weight to C₂.
Empty required_riasec → 0.5 (neutral — template is universal)
C₃ = Academic condition evaluation:
Empty condition → 0.5 (neutral)
Condition PASSES (e.g., student math=55, condition="math < 60") → 1.0
Condition FAILS → 0.0
C₄ = Specificity density:
has_riasec_filter × 0.4 +
has_academic_condition × 0.3 +
has_grade_filter × 0.2 +
has_stream_filter × 0.1
Conflict Resolution Override:
IF student.hasStemLearningGap() IS TRUE:
The system detects a conflict:
"Student wants Engineering (STEM) but has Math < 60 OR Science < 60"
Apply override bonuses/penalties:
Templates tagged 'learning_gap' category: score += LEARNING_GAP_BONUS (0.25)
Templates tagged 'passion_match' category: score -= PASSION_MATCH_PENALTY (0.15)
EFFECT: "Learning gap" templates — which say things like:
"Your ambition for Engineering is real, {student_name}. But Math at {math_score}%
needs attention first. Here is your 90-day recovery plan..."
...will always outrank "passion match" templates that blindly encourage
the student toward Engineering without addressing the academic gap.
This is mathematically guaranteed:
A learning_gap template with priority=70 gets: ... + 0.25 = boosted
A passion_match template with priority=80 gets: ... - 0.15 = penalized
Net: learning_gap wins even with lower base priority.
Once the winner template is selected, all {tag} placeholders are replaced with live student data:
| Tag | Source | Example Value |
|---|---|---|
{student_name} |
StudentProfile.name | "Aryan" |
{grade} |
StudentProfile.grade | "11" |
{stream} |
StudentProfile.stream | "PCM" |
{top_riasec} |
StudentProfile.getDominantRiasec() | "I" |
{riasec_code} |
StudentProfile.getRiasecCode() | "IRS" |
{aspired_career} |
StudentProfile.aspired_career | "Engineer" |
{math_score} |
StudentProfile.math_score | "54" |
{overall_score} |
StudentProfile.getOverallAcademicScore() | "61.5" |
{academic_label} |
StudentProfile.getAcademicLabel() | "Average" |
{active_entity} |
DialogueStateManager.active_entity | "JEE Mains" |
{slot_value} |
DialogueStateManager.slot.filled_value | "Mumbai" |
{preferred_location} |
StudentProfile.preferred_location | "Delhi" |
{eq_level} |
StudentProfile.eq_level | "HIGH" |
| Column | Type | Description |
|---|---|---|
| id | BIGINT PK | Auto-increment |
| user_id | BIGINT | Foreign key to users table |
| name | VARCHAR(100) | Student's name |
| grade | TINYINT | 9, 10, 11, or 12 |
| stream | VARCHAR(20) | PCM, PCB, Commerce, Arts, Undecided |
| math_score | DECIMAL(5,2) | 0.00 – 100.00 |
| science_score | DECIMAL(5,2) | |
| english_score | DECIMAL(5,2) | |
| commerce_score | DECIMAL(5,2) | |
| riasec_r through riasec_c | TINYINT | 0 – 100 per dimension |
| eq_level | ENUM | LOW, MEDIUM, HIGH |
| aspired_career | VARCHAR(100) | Extracted from dialogue |
| preferred_location | VARCHAR(100) | |
| conversation_count | INT | For adaptive behavior |
| Column | Type | Description |
|---|---|---|
| id | BIGINT PK | |
| intent_id | VARCHAR(100) | Maps to ML intent class |
| template_key | VARCHAR(100) | Unique slug |
| category | VARCHAR(50) | learning_gap, passion_match, etc. |
| required_riasec | VARCHAR(50) | Comma-separated codes or empty |
| academic_condition | VARCHAR(100) | e.g., math < 60 |
| grade_filter | VARCHAR(20) | e.g., 11,12 |
| stream_filter | VARCHAR(20) | e.g., PCM |
| eq_filter | VARCHAR(10) | e.g., HIGH |
| priority_weight | TINYINT | 1 – 100 |
| template_text | TEXT | Response with {injection_tags} |
| follow_up_prompts | JSON | Array of suggested follow-ups |
| usage_count | INT | Tracks which templates are used most |
| Column | Type | Description |
|---|---|---|
| id | BIGINT PK | |
| raw_query | TEXT | User's original message |
| assigned_intent | VARCHAR(100) | Set by admin reviewer |
| session_context | JSON | Session state at time of failure |
| status | ENUM | pending → resolved → trained / rejected |
| Column | Type | Description |
|---|---|---|
| id | BIGINT PK | |
| session_id | VARCHAR(128) | CI4 session ID |
| user_id | BIGINT | |
| role | ENUM | user, assistant |
| content | TEXT | Message text |
| intent | VARCHAR(100) | Detected intent (for assistant messages) |
| confidence | DECIMAL(5,4) | ML confidence score |
| template_id | BIGINT | Which template was used |
| metadata | JSON | MCDA scores, entity, slot data |
POST /api/chat
Content-Type: application/json
{
"message": "I want to become an engineer but my math is weak",
"user_id": 42,
"session_id": "abc123"
}
Response 200:
{
"success": true,
"response": "Dear Aryan, your ambition for Engineering is admirable...",
"intent": "confusion_pcm_vs_commerce",
"confidence": 0.78,
"follow_ups": ["Tell me about JEE eligibility", "How to improve Math?"],
"dialogue_state": "normal",
"active_entity": null
}
GET /api/quiz/questions
→ Returns 30 RIASEC questions
POST /api/quiz/submit
{ "user_id": 42, "responses": {"R1": 4, "R2": 3, ..., "C5": 5} }
→ { "riasec_code": "IRS", "scores": {...}, "suggested_careers": [...] }
# All admin routes require: Authorization: Bearer <token>
GET /api/admin/ai/status → Model info, queue counts
POST /api/admin/ai/train → Full retrain from scratch
POST /api/admin/ai/heal → partialFit with resolved queue items
GET /api/admin/ai/queue → List training queue (filter by status)
PUT /api/admin/ai/queue/{id} → Assign intent, set resolved
DELETE /api/admin/ai/queue/{id} → Reject (exclude from training)
| .env Key | Default | Description |
|---|---|---|
pharos.confidenceGate |
0.45 |
ML confidence threshold |
pharos.modelPath |
writable/models/ |
Model storage directory |
pharos.healBatchSize |
100 |
Samples per self-healing run |
pharos.healMinSamples |
20 |
Minimum before heal triggers |
pharos.adminToken |
(none) | Bearer token for /api/admin/* |
The system is designed to scale from ~15 templates to thousands:
- Add templates via SQL — each new row in
response_templatesis immediately available - Tune
priority_weight— higher weight = preferred by MCDA - Add intent classes — add seed corpus samples in
AutonomousLearningService::getSeedCorpus(), retrain - Run the heal loop — as students use the system, low-confidence queries flow to
training_queue, humans review and resolve, andphp spark pharos:healabsorbs them
Recommended template distribution:
| Intent | Min templates | Recommended |
|---|---|---|
| exam_stress | 5 | 20+ |
| query_jee_eligibility | 5 | 30+ (grade×stream combos) |
| confusion_pcm_vs_commerce | 5 | 25+ (score range combos) |
| career_explore_riasec | 6 | 36+ (one per RIASEC code) |
| search_college | 5 | 50+ (stream×city combos) |
# Full setup (first run)
composer install && php spark key:generate && php spark migrate \
&& php spark db:seed PharosTemplateSeeder && php spark pharos:train
# Development server
php spark serve --port=8080
# Train model from scratch
php spark pharos:train --verbose
# Run self-healing loop
php spark pharos:heal --batch=50
# Dry-run heal (preview without training)
php spark pharos:heal --dry-run
# Database operations
php spark migrate:refresh # Drop and re-run all migrations
php spark db:seed PharosTemplateSeeder # Re-seed templates
# Clear caches
php spark cache:clearMIT License — see LICENSE file.
Built with ❤️ for every Indian student standing at the crossroads of Class 10 and Class 11.