A VSCode extension for evaluating GitHub Copilot Chat prompts against test datasets.
This repository accompanies the blog post Building an Evaluation Harness for VSCode Copilot Chat and provides a proof-of-concept implementation that developers can use and extend. While not currently published to the marketplace, if it proves useful and stable it may be published in the future. In the meantime, you can use it through the development host in VSCode or package it as a VSIX file for local installation.
When building custom prompts for VSCode Copilot Chat, you need systematic testing to:
- Measure performance across different inputs
- Catch regressions when modifying prompts
- Make data-driven improvements
Without evaluation, you're stuck with "it seems to work fine" - and changing prompts becomes a game of whack-a-mole.
This extension automates VSCode's chat interface to:
- Load your prompt file (
*.prompt.md) - Run it against each test case in your dataset
- Export results for analysis
- Clone this repository
- Run
npm install - Run
npm run compile - Open the project in VSCode
- Press F5 (or Debug > Start Debugging) on
extension.tsto launch a new VSCode window with the extension loaded or follow the packaging instructions to create a VSIX file for local installation.
-
In the extension development host window, create a prompt file (e.g.,
capital.prompt.md):--- mode: agent tools: [] --- The user provides a country and you should answer with only the capital of that country.
-
Create a test dataset (
dataset.json). OnlyinputandwaitMsare required by the extension, the rest are ignored. Example:[ {"input": "France", "waitMs": 4000, "capital": "Paris"}, {"input": "Japan", "waitMs": 4000, "capital": "Tokyo"}, {"input": "Spain", "waitMs": 4000, "capital": "Madrid"} ] -
Open the prompt file and run: "Evaluate Active Prompt" (Cmd/Ctrl+Shift+P)
-
Select your dataset file
-
Press Enter when the save dialog appears (for each test case)
-
Find results in
.github/evals/<prompt>/<timestamp>.json
- Sequential only - No parallel execution
- Fixed wait times - Must guess completion time
- Manual saves - Press Enter for each export
- Read-only prompts - No file modifications or API calls
- Single-turn only - One input, one output
The script used in the demo for illustrative purposes. Note this is not part of the extension itself, but shows how to analyze results:
import json
import sys
from pathlib import Path
dataset = json.loads(Path(sys.argv[1]).read_text())
results = json.loads(Path(sys.argv[2]).read_text())
correct = 0
for record, result_chat in zip(dataset, results):
answer = result_chat["requests"][0]["response"][0]["value"]
correct += 1 * (record["capital"] in answer)
accuracy = correct / len(dataset) * 100
print(f"Accuracy: {accuracy:.2f}% ({correct}/{len(dataset)})")Perfect for early experimentation with prompt ideas. Despite limitations, having imperfect evaluation beats having none - you get concrete signals about what works instead of guessing.