[Infra] Add PoC of overfitting judging by JanKrivanek · Pull Request #140 · dotnet/skills

JanKrivanek · 2026-02-27T14:40:21Z

Motivation

To have visibility into whether the eval tests are providing results that reliably judge skill quality versus just overfit on specific skill features

Sample

ViktorHofer · 2026-02-27T14:48:02Z

Do you have a design doc or can you share more info on what this does?

Copilot

Pull request overview

This pull request adds a proof-of-concept LLM-based overfitting detection system to the skill-validator infrastructure. The feature analyzes whether evaluation tests genuinely measure skill quality or merely test for memorization of specific skill vocabulary and syntax patterns.

Changes:

Introduces an OverfittingJudge service that uses LLM prompting to classify rubric items and assertions into categories (outcome/technique/vocabulary for rubrics, broad/narrow for assertions) and computes an overfitting score
Integrates overfitting checks into the validation pipeline as a parallel, non-blocking operation with configurable CLI flags (--no-overfitting-check, --overfitting-fix)
Extends reporting to display overfitting scores in console output, markdown tables, and dashboard visualizations with severity-based icons and colors

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
eng/skill-validator/src/Models/Models.cs	Adds OverfittingSeverity enum, assessment record types (RubricOverfitAssessment, AssertionOverfitAssessment), OverfittingResult, and configuration options to SkillVerdict and ValidatorConfig
eng/skill-validator/src/Services/OverfittingJudge.cs	New service implementing LLM-based overfitting analysis with prompt engineering, score computation, retry logic, and optional fix generation
eng/skill-validator/src/Commands/ValidateCommand.cs	Adds CLI options and integrates overfitting check as a parallel task that doesn't fail the verdict on error
eng/skill-validator/src/Services/Reporter.cs	Extends console and markdown output to display overfitting scores with severity-based formatting; adds FormatOverfitCell helper
eng/skill-validator/tests/OverfittingJudgeTests.cs	Comprehensive test suite covering score computation, JSON parsing, severity mapping, prompt building, and reporter integration
eng/dashboard/generate-benchmark-data.ps1	Adds overfitting severity and score fields to benchmark entries when moderate or high overfitting is detected
eng/dashboard/dashboard.js	Adds overfitting summary cards, chart markers (star symbols), tooltips, and legend notes using shared helper functions
eng/skill-validator/OverfittingDetection.md	Comprehensive implementation plan documenting prompt design, architecture, scoring methodology, and integration approach

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

JanKrivanek · 2026-02-27T14:56:44Z

Do you have a design doc or can you share more info on what this does?

I generaly try to avoid checking docs as they can stale ... but here it is a good idea - so added it to the changeset

github-actions · 2026-02-27T15:13:53Z

Skill Validation Results

Skill	Scenario	Baseline	With Skill	Δ	Skills Loaded	Overfit	Verdict
csharp-scripts	Test a C# language feature with a script	3.0/5	4.5/5	+1.5	✅ csharp-scripts; tools: skill, create	🟡 0.30	✅
analyzing-dotnet-performance	Detects compiled regex startup budget and regex chain allocations	1.0/5	4.0/5 ⏰ timeout	+3.0	✅ analyzing-dotnet-performance; tools: skill	✅ 0.17	✅
analyzing-dotnet-performance	Detects CurrentCulture comparer and compiled regex budget in inflection rules	1.0/5	5.0/5	+4.0	✅ analyzing-dotnet-performance; tools: skill	✅ 0.17	✅
analyzing-dotnet-performance	Finds per-call Dictionary allocation not hoisted to static	1.0/5	5.0/5	+4.0	✅ analyzing-dotnet-performance; tools: skill, task, glob	✅ 0.17	✅
analyzing-dotnet-performance	Catches compound allocations in recursive number converter with ToLower	1.0/5	4.5/5	+3.5	✅ analyzing-dotnet-performance; tools: skill	✅ 0.17	✅
analyzing-dotnet-performance	Finds StringComparison.Ordinal missing and FrozenDictionary opportunities	1.0/5	5.0/5	+4.0	✅ analyzing-dotnet-performance; tools: skill	✅ 0.17	✅
analyzing-dotnet-performance	Detects Aggregate+Replace chain and struct missing IEquatable	1.0/5	4.0/5 ⏰ timeout	+3.0	✅ analyzing-dotnet-performance; tools: skill	✅ 0.17	✅
analyzing-dotnet-performance	Finds branched Replace chain in format string manipulation	1.0/5	3.5/5	+2.5	✅ analyzing-dotnet-performance; tools: skill, task, glob, read_agent	✅ 0.17	✅
analyzing-dotnet-performance	Catches LINQ on hot-path string processing and All(char.IsUpper)	1.0/5	5.0/5	+4.0	✅ analyzing-dotnet-performance; tools: skill	✅ 0.17	✅
analyzing-dotnet-performance	Detects LINQ pipeline in TimeSpan formatting and collection processing	1.0/5	4.5/5	+3.5	✅ analyzing-dotnet-performance; tools: skill	✅ 0.17	✅
analyzing-dotnet-performance	Flags Span inconsistencies and compound method chains in truncation library	1.0/5	5.0/5	+4.0	✅ analyzing-dotnet-performance; tools: skill	✅ 0.17	✅
analyzing-dotnet-performance	Identifies unsealed leaf classes and locale hierarchy patterns	1.0/5	5.0/5	+4.0	✅ analyzing-dotnet-performance; tools: skill, task, glob	✅ 0.17	✅
dotnet-pinvoke	Generate LibraryImport declaration from C header (.NET 8+)	4.5/5	5.0/5	+0.5	✅ dotnet-pinvoke; tools: skill	✅ 0.08	✅
dotnet-pinvoke	Generate LibraryImport declaration from C header (.NET Framework)	4.0/5	5.0/5	+1.0	✅ dotnet-pinvoke; tools: report_intent, skill	✅ 0.08	✅
dump-collect	Configure automatic crash dumps for CoreCLR app on Linux	5.0/5	5.0/5	0.0	✅ dump-collect; tools: report_intent, skill, view, glob, bash	✅ 0.11	❌
dump-collect	Set up NativeAOT crash dumps with createdump in Kubernetes	2.0/5	4.0/5	+2.0	✅ dump-collect; tools: skill	✅ 0.11	✅
dump-collect	Recover crash dump from macOS NativeAOT without createdump	4.0/5	3.5/5	-0.5	✅ dump-collect; tools: skill, view, glob, bash, report_intent	✅ 0.11	✅
dump-collect	Configure CoreCLR dump collection in Alpine Docker as non-root	3.0/5	4.0/5	+1.0	✅ dump-collect; tools: skill, bash	✅ 0.11	✅
dump-collect	Advisory: macOS NativeAOT crash dump recovery steps	4.0/5	4.0/5	0.0	✅ dump-collect; tools: skill, glob	✅ 0.11	✅
dump-collect	Advisory: CoreCLR Alpine Docker non-root configuration	3.0/5	5.0/5	+2.0	✅ dump-collect; tools: skill	✅ 0.11	✅
dump-collect	Advisory: NativeAOT Kubernetes dump collection setup	2.0/5	2.0/5	0.0	✅ dump-collect; tools: skill, bash	✅ 0.11	✅
dump-collect	Detect runtime and configure crash dumps for unknown .NET app on Linux	3.5/5	4.0/5	+0.5	✅ dump-collect; tools: skill, glob	✅ 0.11	❌
dump-collect	Decline dump analysis request	2.5/5	4.5/5	+2.0	ℹ️ not activated (expected)	✅ 0.11	✅
build-parallelism	Analyze build parallelism bottlenecks	1.5/5 ⏰ timeout	4.0/5	+2.5	✅ build-parallelism; binlog-generation; binlog-failure-analysis; tools: skill, task, glob	✅ 0.14	✅
including-generated-files	Diagnose generated file inclusion failure	3.0/5	5.0/5	+2.0	✅ including-generated-files; tools: skill	✅ 0.20	✅
msbuild-antipatterns	Review MSBuild files for anti-patterns and style issues	5.0/5	5.0/5	0.0	✅ msbuild-antipatterns; tools: skill, glob, edit	✅ 0.12	✅
build-perf-baseline	Establish build performance baseline and recommend optimizations	3.0/5	4.0/5	+1.0	✅ build-perf-baseline; tools: skill, binlog-mcp-load_binlog, binlog-mcp-list_projects, binlog-mcp-get_expensive_projects, binlog-mcp-get_expensive_tasks, binlog-mcp-get_expensive_targets, binlog-mcp-get_node_timeline, binlog-mcp-search_targets_by_name, binlog-mcp-search_tasks_by_name, binlog-mcp-get_expensive_analyzers, binlog-mcp-list_evaluations	🟡 0.21	✅
msbuild-modernization	Modernize legacy project to SDK-style	5.0/5	5.0/5	0.0	✅ msbuild-modernization; tools: skill	✅ 0.06	✅
directory-build-organization	Organize build infrastructure for a multi-project repo	3.0/5	5.0/5	+2.0	✅ directory-build-organization; msbuild-antipatterns; tools: skill	✅ 0.13	✅
check-bin-obj-clash	Diagnose bin/obj output path clashes	4.0/5	5.0/5	+1.0	✅ check-bin-obj-clash; binlog-generation; tools: skill, glob, binlog-mcp-load_binlog, binlog-mcp-list_projects, binlog-mcp-list_evaluations, binlog-mcp-get_evaluation_properties_by_name, binlog-mcp-get_evaluation_global_properties, edit	✅ 0.17	✅
incremental-build	Analyze incremental build issues	3.0/5	3.5/5	+0.5	✅ incremental-build; tools: skill, edit	✅ 0.14	✅
eval-performance	Analyze MSBuild evaluation performance issues	4.0/5	5.0/5	+1.0	✅ eval-performance; tools: skill	✅ 0.11	✅
build-perf-diagnostics	Analyze analyzer performance impact on builds	4.0/5 ⏰ timeout	4.0/5 ⏰ timeout	0.0	✅ binlog-generation; tools: skill, binlog-mcp-load_binlog, binlog-mcp-list_projects, binlog-mcp-get_expensive_tasks, binlog-mcp-get_expensive_targets, binlog-mcp-get_expensive_analyzers, binlog-mcp-search_tasks_by_name, binlog-mcp-get_node_timeline, binlog-mcp-get_project_target_list, binlog-mcp-search_binlog, binlog-mcp-get_target_info_by_name, binlog-mcp-list_tasks_in_target, binlog-mcp-get_task_analyzers, binlog-mcp-get_project_build_time, binlog-mcp-get_evaluation_properties_by_name, binlog-mcp-list_evaluations	🟡 0.29	❌
binlog-generation	Build project with /bl flag	1.0/5	5.0/5	+4.0	✅ binlog-generation; tools: skill	🟡 0.44	✅
binlog-generation	Build with /bl in PowerShell	2.0/5	5.0/5	+3.0	✅ binlog-generation; tools: skill	🟡 0.44	✅
binlog-generation	Build multiple configurations with unique binlogs	2.0/5	5.0/5	+3.0	✅ binlog-generation; tools: skill	🟡 0.44	✅
binlog-failure-analysis	Diagnose build failures from binlog only (no source files)	4.0/5	5.0/5	+1.0	✅ binlog-failure-analysis; tools: skill, binlog-mcp-load_binlog, binlog-mcp-get_diagnostics, binlog-mcp-list_projects, binlog-mcp-get_file_from_binlog, binlog-mcp-search_binlog, binlog-mcp-list_files_from_binlog	✅ 0.13	✅

⏰ timeout — run hit the scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output

Model: claude-opus-4.6 | Judge: claude-opus-4.6

Full results

ViktorHofer · 2026-02-27T16:08:34Z

GH shows a few possible null-reference warnings in the build in the pull request review page. New feature? :)

ViktorHofer · 2026-02-27T16:23:18Z

BTW I didn't understand the intent of this score until I finally read the prompt in the judge. I think the eng/skill-validator/README.md describes the different judges. It would be good to provide a simple and understandable description of this judge there.

JanKrivanek · 2026-02-27T17:13:21Z

BTW I didn't understand the intent of this score until I finally read the prompt in the judge. I think the eng/skill-validator/README.md describes the different judges. It would be good to provide a simple and understandable description of this judge there.

Good point!
Added brief info there

Add overfitting judging

2ef7f70

JanKrivanek requested review from ViktorHofer and Copilot and removed request for Copilot February 27, 2026 14:40

Copilot started reviewing on behalf of JanKrivanek February 27, 2026 14:40 View session

Add the design doc

1b0cf47

Copilot AI review requested due to automatic review settings February 27, 2026 14:50

Copilot started reviewing on behalf of JanKrivanek February 27, 2026 14:51 View session

Copilot AI reviewed Feb 27, 2026

View reviewed changes

Comment thread eng/skill-validator/src/Services/OverfittingJudge.cs Outdated

Add async I/O

51cd4de

ViktorHofer approved these changes Feb 27, 2026

View reviewed changes

Add overfit detection description

dd889e1

JanKrivanek merged commit 00b9fb1 into main Feb 27, 2026
5 checks passed

JanKrivanek deleted the dev/jankrivanek/overfit-eval branch February 27, 2026 17:13

JanKrivanek restored the dev/jankrivanek/overfit-eval branch March 1, 2026 18:20

ViktorHofer deleted the dev/jankrivanek/overfit-eval branch March 4, 2026 12:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Infra] Add PoC of overfitting judging#140

[Infra] Add PoC of overfitting judging#140
JanKrivanek merged 4 commits into
mainfrom
dev/jankrivanek/overfit-eval

JanKrivanek commented Feb 27, 2026

Uh oh!

ViktorHofer commented Feb 27, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

JanKrivanek commented Feb 27, 2026

Uh oh!

github-actions Bot commented Feb 27, 2026 •

edited

Loading

Uh oh!

ViktorHofer commented Feb 27, 2026

Uh oh!

ViktorHofer commented Feb 27, 2026

Uh oh!

JanKrivanek commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JanKrivanek commented Feb 27, 2026

Motivation

Sample

Uh oh!

ViktorHofer commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

JanKrivanek commented Feb 27, 2026

Uh oh!

github-actions Bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Skill Validation Results

Uh oh!

ViktorHofer commented Feb 27, 2026

Uh oh!

ViktorHofer commented Feb 27, 2026

Uh oh!

JanKrivanek commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ViktorHofer commented Feb 27, 2026 •

edited

Loading

github-actions Bot commented Feb 27, 2026 •

edited

Loading