glue-review-eval

Benchmark + harness for the glue-review code-review agent.

Why

glue-review is being polished into a single-purpose tool: produce one GitHub comment per PR, optimized for AI coding agents to consume — short human-readable summary on top, copy-pasteable markdown fix-instruction block on the bottom. This repo is the test bed we measure against while we iterate the prompt and template.

Layout

hosts/                      # three realistic host projects, baseline-green
  go-glog/                  # Go: structured-log tail/filter CLI
  python-linkr/             # Python: FastAPI URL shortener
  ts-notepad/               # TS: Vite + React markdown notepad

cases/                      # per-PR scenarios applied against the hosts
  <lang>/<case-id>/
    case.yaml               # ground-truth sidecar (planted bug, expected fix, acceptance test)
    patch.diff              # the PR's diff against the host's baseline
    README.md               # what this case is testing and why

harness/
  runner/                   # invokes glue-review against each case, captures output
  judge/                    # LLM-as-judge scorer using ground truth from case.yaml
  scorer/                   # Layer 1 structural recall + aggregation

results/                    # per-iteration scorecards + raw outputs (gitignored except for summaries)
prompts-snapshots/          # frozen copies of agent prompts at each iteration
docs/                       # design docs: SIDECAR.md, OUTPUT_FORMAT.md, METRICS.md

The three-layer eval stack

Layer	What it measures	Cost	Cadence
1	Structural recall: did the comment mention the planted file/line?	~free	every iteration
2	LLM-as-judge rubric (0–10 on identify / specificity / severity / conciseness / false-positives) using planted-bug ground truth	~1¢ / case	every iteration
3	Downstream fix-success: hand the fix block to a real coding agent (codex / opencode / claude headless), check if the planted acceptance test goes red→green	~5–10¢ / case	every 5 iterations on a 5-case subset

Status

See PROGRESS.md for the current iteration log.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
cases		cases
docs		docs
harness		harness
hosts		hosts
prompts-snapshots		prompts-snapshots
results		results
smoke-demo		smoke-demo
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
PROGRESS.md		PROGRESS.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

glue-review-eval

Why

Layout

The three-layer eval stack

Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

glue-review-eval

Why

Layout

The three-layer eval stack

Status

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages