Copilot Chat Prompt Evaluator

A VSCode extension for evaluating GitHub Copilot Chat prompts against test datasets.

This repository accompanies the blog post Building an Evaluation Harness for VSCode Copilot Chat and provides a proof-of-concept implementation that developers can use and extend. While not currently published to the marketplace, if it proves useful and stable it may be published in the future. In the meantime, you can use it through the development host in VSCode or package it as a VSIX file for local installation.

Why?

When building custom prompts for VSCode Copilot Chat, you need systematic testing to:

Measure performance across different inputs
Catch regressions when modifying prompts
Make data-driven improvements

Without evaluation, you're stuck with "it seems to work fine" - and changing prompts becomes a game of whack-a-mole.

How it works

This extension automates VSCode's chat interface to:

Load your prompt file (*.prompt.md)
Run it against each test case in your dataset
Export results for analysis

Installation

Clone this repository
Run npm install
Run npm run compile
Open the project in VSCode
Press F5 (or Debug > Start Debugging) on extension.ts to launch a new VSCode window with the extension loaded or follow the packaging instructions to create a VSIX file for local installation.

Quick Start

In the extension development host window, create a prompt file (e.g., capital.prompt.md):

---
mode: agent
tools: []
---
The user provides a country and you should answer with only the capital of that country.

Create a test dataset (dataset.json). Only input and waitMs are required by the extension, the rest are ignored. Example:

[
    {"input": "France", "waitMs": 4000, "capital": "Paris"},
    {"input": "Japan", "waitMs": 4000, "capital": "Tokyo"},
    {"input": "Spain", "waitMs": 4000, "capital": "Madrid"}
]

Open the prompt file and run: "Evaluate Active Prompt" (Cmd/Ctrl+Shift+P)
Select your dataset file
Press Enter when the save dialog appears (for each test case)
Find results in .github/evals/<prompt>/<timestamp>.json

Limitations

Sequential only - No parallel execution
Fixed wait times - Must guess completion time
Manual saves - Press Enter for each export
Read-only prompts - No file modifications or API calls
Single-turn only - One input, one output

Example Evaluation

The script used in the demo for illustrative purposes. Note this is not part of the extension itself, but shows how to analyze results:

import json
import sys
from pathlib import Path

dataset = json.loads(Path(sys.argv[1]).read_text())
results = json.loads(Path(sys.argv[2]).read_text())

correct = 0
for record, result_chat in zip(dataset, results):
    answer = result_chat["requests"][0]["response"][0]["value"]
    correct += 1 * (record["capital"] in answer)

accuracy = correct / len(dataset) * 100
print(f"Accuracy: {accuracy:.2f}% ({correct}/{len(dataset)})")

When to Use

Perfect for early experimentation with prompt ideas. Despite limitations, having imperfect evaluation beats having none - you get concrete signals about what works instead of guessing.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src		src
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
demo.gif		demo.gif
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Copilot Chat Prompt Evaluator

Why?

How it works

Installation

Quick Start

Limitations

Example Evaluation

When to Use

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Copilot Chat Prompt Evaluator

Why?

How it works

Installation

Quick Start

Limitations

Example Evaluation

When to Use

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages