[Infra] Add PoC of overfitting judging#140
Conversation
|
Do you have a design doc or can you share more info on what this does? |
There was a problem hiding this comment.
Pull request overview
This pull request adds a proof-of-concept LLM-based overfitting detection system to the skill-validator infrastructure. The feature analyzes whether evaluation tests genuinely measure skill quality or merely test for memorization of specific skill vocabulary and syntax patterns.
Changes:
- Introduces an
OverfittingJudgeservice that uses LLM prompting to classify rubric items and assertions into categories (outcome/technique/vocabulary for rubrics, broad/narrow for assertions) and computes an overfitting score - Integrates overfitting checks into the validation pipeline as a parallel, non-blocking operation with configurable CLI flags (
--no-overfitting-check,--overfitting-fix) - Extends reporting to display overfitting scores in console output, markdown tables, and dashboard visualizations with severity-based icons and colors
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| eng/skill-validator/src/Models/Models.cs | Adds OverfittingSeverity enum, assessment record types (RubricOverfitAssessment, AssertionOverfitAssessment), OverfittingResult, and configuration options to SkillVerdict and ValidatorConfig |
| eng/skill-validator/src/Services/OverfittingJudge.cs | New service implementing LLM-based overfitting analysis with prompt engineering, score computation, retry logic, and optional fix generation |
| eng/skill-validator/src/Commands/ValidateCommand.cs | Adds CLI options and integrates overfitting check as a parallel task that doesn't fail the verdict on error |
| eng/skill-validator/src/Services/Reporter.cs | Extends console and markdown output to display overfitting scores with severity-based formatting; adds FormatOverfitCell helper |
| eng/skill-validator/tests/OverfittingJudgeTests.cs | Comprehensive test suite covering score computation, JSON parsing, severity mapping, prompt building, and reporter integration |
| eng/dashboard/generate-benchmark-data.ps1 | Adds overfitting severity and score fields to benchmark entries when moderate or high overfitting is detected |
| eng/dashboard/dashboard.js | Adds overfitting summary cards, chart markers (star symbols), tooltips, and legend notes using shared helper functions |
| eng/skill-validator/OverfittingDetection.md | Comprehensive implementation plan documenting prompt design, architecture, scoring methodology, and integration approach |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
I generaly try to avoid checking docs as they can stale ... but here it is a good idea - so added it to the changeset |
Skill Validation Results
Model: claude-opus-4.6 | Judge: claude-opus-4.6 |
|
GH shows a few possible null-reference warnings in the build in the pull request review page. New feature? :) |
|
BTW I didn't understand the intent of this score until I finally read the prompt in the judge. I think the eng/skill-validator/README.md describes the different judges. It would be good to provide a simple and understandable description of this judge there. |
Good point! |
Motivation
To have visibility into whether the eval tests are providing results that reliably judge skill quality versus just overfit on specific skill features
Sample