Skip to content

[Infra] Add PoC of overfitting judging#140

Merged
JanKrivanek merged 4 commits into
mainfrom
dev/jankrivanek/overfit-eval
Feb 27, 2026
Merged

[Infra] Add PoC of overfitting judging#140
JanKrivanek merged 4 commits into
mainfrom
dev/jankrivanek/overfit-eval

Conversation

@JanKrivanek

Copy link
Copy Markdown
Member

Motivation

To have visibility into whether the eval tests are providing results that reliably judge skill quality versus just overfit on specific skill features

Sample

image image

@ViktorHofer

ViktorHofer commented Feb 27, 2026

Copy link
Copy Markdown
Member

Do you have a design doc or can you share more info on what this does?

Copilot AI review requested due to automatic review settings February 27, 2026 14:50

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds a proof-of-concept LLM-based overfitting detection system to the skill-validator infrastructure. The feature analyzes whether evaluation tests genuinely measure skill quality or merely test for memorization of specific skill vocabulary and syntax patterns.

Changes:

  • Introduces an OverfittingJudge service that uses LLM prompting to classify rubric items and assertions into categories (outcome/technique/vocabulary for rubrics, broad/narrow for assertions) and computes an overfitting score
  • Integrates overfitting checks into the validation pipeline as a parallel, non-blocking operation with configurable CLI flags (--no-overfitting-check, --overfitting-fix)
  • Extends reporting to display overfitting scores in console output, markdown tables, and dashboard visualizations with severity-based icons and colors

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
eng/skill-validator/src/Models/Models.cs Adds OverfittingSeverity enum, assessment record types (RubricOverfitAssessment, AssertionOverfitAssessment), OverfittingResult, and configuration options to SkillVerdict and ValidatorConfig
eng/skill-validator/src/Services/OverfittingJudge.cs New service implementing LLM-based overfitting analysis with prompt engineering, score computation, retry logic, and optional fix generation
eng/skill-validator/src/Commands/ValidateCommand.cs Adds CLI options and integrates overfitting check as a parallel task that doesn't fail the verdict on error
eng/skill-validator/src/Services/Reporter.cs Extends console and markdown output to display overfitting scores with severity-based formatting; adds FormatOverfitCell helper
eng/skill-validator/tests/OverfittingJudgeTests.cs Comprehensive test suite covering score computation, JSON parsing, severity mapping, prompt building, and reporter integration
eng/dashboard/generate-benchmark-data.ps1 Adds overfitting severity and score fields to benchmark entries when moderate or high overfitting is detected
eng/dashboard/dashboard.js Adds overfitting summary cards, chart markers (star symbols), tooltips, and legend notes using shared helper functions
eng/skill-validator/OverfittingDetection.md Comprehensive implementation plan documenting prompt design, architecture, scoring methodology, and integration approach

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread eng/skill-validator/src/Services/OverfittingJudge.cs Outdated
@JanKrivanek

Copy link
Copy Markdown
Member Author

Do you have a design doc or can you share more info on what this does?

I generaly try to avoid checking docs as they can stale ... but here it is a good idea - so added it to the changeset

@github-actions

github-actions Bot commented Feb 27, 2026

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Baseline With Skill Δ Skills Loaded Overfit Verdict
csharp-scripts Test a C# language feature with a script 3.0/5 4.5/5 +1.5 ✅ csharp-scripts; tools: skill, create 🟡 0.30
analyzing-dotnet-performance Detects compiled regex startup budget and regex chain allocations 1.0/5 4.0/5 ⏰ timeout +3.0 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.17
analyzing-dotnet-performance Detects CurrentCulture comparer and compiled regex budget in inflection rules 1.0/5 5.0/5 +4.0 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.17
analyzing-dotnet-performance Finds per-call Dictionary allocation not hoisted to static 1.0/5 5.0/5 +4.0 ✅ analyzing-dotnet-performance; tools: skill, task, glob ✅ 0.17
analyzing-dotnet-performance Catches compound allocations in recursive number converter with ToLower 1.0/5 4.5/5 +3.5 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.17
analyzing-dotnet-performance Finds StringComparison.Ordinal missing and FrozenDictionary opportunities 1.0/5 5.0/5 +4.0 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.17
analyzing-dotnet-performance Detects Aggregate+Replace chain and struct missing IEquatable 1.0/5 4.0/5 ⏰ timeout +3.0 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.17
analyzing-dotnet-performance Finds branched Replace chain in format string manipulation 1.0/5 3.5/5 +2.5 ✅ analyzing-dotnet-performance; tools: skill, task, glob, read_agent ✅ 0.17
analyzing-dotnet-performance Catches LINQ on hot-path string processing and All(char.IsUpper) 1.0/5 5.0/5 +4.0 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.17
analyzing-dotnet-performance Detects LINQ pipeline in TimeSpan formatting and collection processing 1.0/5 4.5/5 +3.5 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.17
analyzing-dotnet-performance Flags Span inconsistencies and compound method chains in truncation library 1.0/5 5.0/5 +4.0 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.17
analyzing-dotnet-performance Identifies unsealed leaf classes and locale hierarchy patterns 1.0/5 5.0/5 +4.0 ✅ analyzing-dotnet-performance; tools: skill, task, glob ✅ 0.17
dotnet-pinvoke Generate LibraryImport declaration from C header (.NET 8+) 4.5/5 5.0/5 +0.5 ✅ dotnet-pinvoke; tools: skill ✅ 0.08
dotnet-pinvoke Generate LibraryImport declaration from C header (.NET Framework) 4.0/5 5.0/5 +1.0 ✅ dotnet-pinvoke; tools: report_intent, skill ✅ 0.08
dump-collect Configure automatic crash dumps for CoreCLR app on Linux 5.0/5 5.0/5 0.0 ✅ dump-collect; tools: report_intent, skill, view, glob, bash ✅ 0.11
dump-collect Set up NativeAOT crash dumps with createdump in Kubernetes 2.0/5 4.0/5 +2.0 ✅ dump-collect; tools: skill ✅ 0.11
dump-collect Recover crash dump from macOS NativeAOT without createdump 4.0/5 3.5/5 -0.5 ✅ dump-collect; tools: skill, view, glob, bash, report_intent ✅ 0.11
dump-collect Configure CoreCLR dump collection in Alpine Docker as non-root 3.0/5 4.0/5 +1.0 ✅ dump-collect; tools: skill, bash ✅ 0.11
dump-collect Advisory: macOS NativeAOT crash dump recovery steps 4.0/5 4.0/5 0.0 ✅ dump-collect; tools: skill, glob ✅ 0.11
dump-collect Advisory: CoreCLR Alpine Docker non-root configuration 3.0/5 5.0/5 +2.0 ✅ dump-collect; tools: skill ✅ 0.11
dump-collect Advisory: NativeAOT Kubernetes dump collection setup 2.0/5 2.0/5 0.0 ✅ dump-collect; tools: skill, bash ✅ 0.11
dump-collect Detect runtime and configure crash dumps for unknown .NET app on Linux 3.5/5 4.0/5 +0.5 ✅ dump-collect; tools: skill, glob ✅ 0.11
dump-collect Decline dump analysis request 2.5/5 4.5/5 +2.0 ℹ️ not activated (expected) ✅ 0.11
build-parallelism Analyze build parallelism bottlenecks 1.5/5 ⏰ timeout 4.0/5 +2.5 ✅ build-parallelism; binlog-generation; binlog-failure-analysis; tools: skill, task, glob ✅ 0.14
including-generated-files Diagnose generated file inclusion failure 3.0/5 5.0/5 +2.0 ✅ including-generated-files; tools: skill ✅ 0.20
msbuild-antipatterns Review MSBuild files for anti-patterns and style issues 5.0/5 5.0/5 0.0 ✅ msbuild-antipatterns; tools: skill, glob, edit ✅ 0.12
build-perf-baseline Establish build performance baseline and recommend optimizations 3.0/5 4.0/5 +1.0 ✅ build-perf-baseline; tools: skill, binlog-mcp-load_binlog, binlog-mcp-list_projects, binlog-mcp-get_expensive_projects, binlog-mcp-get_expensive_tasks, binlog-mcp-get_expensive_targets, binlog-mcp-get_node_timeline, binlog-mcp-search_targets_by_name, binlog-mcp-search_tasks_by_name, binlog-mcp-get_expensive_analyzers, binlog-mcp-list_evaluations 🟡 0.21
msbuild-modernization Modernize legacy project to SDK-style 5.0/5 5.0/5 0.0 ✅ msbuild-modernization; tools: skill ✅ 0.06
directory-build-organization Organize build infrastructure for a multi-project repo 3.0/5 5.0/5 +2.0 ✅ directory-build-organization; msbuild-antipatterns; tools: skill ✅ 0.13
check-bin-obj-clash Diagnose bin/obj output path clashes 4.0/5 5.0/5 +1.0 ✅ check-bin-obj-clash; binlog-generation; tools: skill, glob, binlog-mcp-load_binlog, binlog-mcp-list_projects, binlog-mcp-list_evaluations, binlog-mcp-get_evaluation_properties_by_name, binlog-mcp-get_evaluation_global_properties, edit ✅ 0.17
incremental-build Analyze incremental build issues 3.0/5 3.5/5 +0.5 ✅ incremental-build; tools: skill, edit ✅ 0.14
eval-performance Analyze MSBuild evaluation performance issues 4.0/5 5.0/5 +1.0 ✅ eval-performance; tools: skill ✅ 0.11
build-perf-diagnostics Analyze analyzer performance impact on builds 4.0/5 ⏰ timeout 4.0/5 ⏰ timeout 0.0 ✅ binlog-generation; tools: skill, binlog-mcp-load_binlog, binlog-mcp-list_projects, binlog-mcp-get_expensive_tasks, binlog-mcp-get_expensive_targets, binlog-mcp-get_expensive_analyzers, binlog-mcp-search_tasks_by_name, binlog-mcp-get_node_timeline, binlog-mcp-get_project_target_list, binlog-mcp-search_binlog, binlog-mcp-get_target_info_by_name, binlog-mcp-list_tasks_in_target, binlog-mcp-get_task_analyzers, binlog-mcp-get_project_build_time, binlog-mcp-get_evaluation_properties_by_name, binlog-mcp-list_evaluations 🟡 0.29
binlog-generation Build project with /bl flag 1.0/5 5.0/5 +4.0 ✅ binlog-generation; tools: skill 🟡 0.44
binlog-generation Build with /bl in PowerShell 2.0/5 5.0/5 +3.0 ✅ binlog-generation; tools: skill 🟡 0.44
binlog-generation Build multiple configurations with unique binlogs 2.0/5 5.0/5 +3.0 ✅ binlog-generation; tools: skill 🟡 0.44
binlog-failure-analysis Diagnose build failures from binlog only (no source files) 4.0/5 5.0/5 +1.0 ✅ binlog-failure-analysis; tools: skill, binlog-mcp-load_binlog, binlog-mcp-get_diagnostics, binlog-mcp-list_projects, binlog-mcp-get_file_from_binlog, binlog-mcp-search_binlog, binlog-mcp-list_files_from_binlog ✅ 0.13

timeout — run hit the scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output

Model: claude-opus-4.6 | Judge: claude-opus-4.6

Full results

@ViktorHofer

Copy link
Copy Markdown
Member

GH shows a few possible null-reference warnings in the build in the pull request review page. New feature? :)

@ViktorHofer

Copy link
Copy Markdown
Member

BTW I didn't understand the intent of this score until I finally read the prompt in the judge. I think the eng/skill-validator/README.md describes the different judges. It would be good to provide a simple and understandable description of this judge there.

@JanKrivanek

Copy link
Copy Markdown
Member Author

BTW I didn't understand the intent of this score until I finally read the prompt in the judge. I think the eng/skill-validator/README.md describes the different judges. It would be good to provide a simple and understandable description of this judge there.

Good point!
Added brief info there

@JanKrivanek JanKrivanek merged commit 00b9fb1 into main Feb 27, 2026
5 checks passed
@JanKrivanek JanKrivanek deleted the dev/jankrivanek/overfit-eval branch February 27, 2026 17:13
@JanKrivanek JanKrivanek restored the dev/jankrivanek/overfit-eval branch March 1, 2026 18:20
@ViktorHofer ViktorHofer deleted the dev/jankrivanek/overfit-eval branch March 4, 2026 12:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants