fix(docs-eval): align scenario 02 with TS SDK publish() and surface failing checks#950
Conversation
…ailing checks Trigger: PR #945 hit a heuristic FAIL on scenario 02 because the TS SDK flattened `outpost.publish.event(...)` to `outpost.publish(...)` in v1.3.0 (commit d875c66), but the eval check + scenario criterion + prompt were never updated. Prior runs masked it: the agent often echoed the literal "publish.event" from the prompt in comments, accidentally satisfying the string-presence check. This run stuck to the SDK README wording and the check exit-1'd with no clue in the GH Actions log — only in the artifact. Impact: PR #945 (docs only) was blocked by an unrelated stale check. After this fix the heuristic matches the current SDK shape, and any future heuristic/LLM failure prints the failing check id and detail directly in the main CI log. Changes: - scoreScenario02 regex now matches `.publish(` and keeps `publish.event` as a fallback for older transcripts. - run-agent-eval logs each failing heuristic check + LLM criterion on pass=false (was: silent exit 1). - Scenario 02 success criterion lists `outpost.publish` not `publish.event`. - Prompt's TS counter-example uses `outpost.publish({ ... })`. - Trajectory SDK hint pattern renamed `ts_publish` with `/\.publish\s*\(/`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the docs-eval harness for Scenario 02 (TypeScript) to align heuristic + prompt wording with the current TS SDK outpost.publish(...) API shape, and improves CI debuggability by emitting failing check details to stderr when scoring fails.
Changes:
- Update Scenario 02 scoring to recognize
outpost.publish(...)(while retainingpublish.eventbackward compatibility). - Emit per-check / per-criterion failure details to stderr when heuristic or LLM scoring fails.
- Align Scenario 02 docs + agent prompt examples with
outpost.publish({ ... })wording.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/agent-evaluation/src/transcript-trajectory.ts | Updates the TS publish SDK hint regex/id used in trajectory tagging. |
| docs/agent-evaluation/src/score-transcript.ts | Adjusts Scenario 02 heuristic to match .publish( (and still allow legacy publish.event). |
| docs/agent-evaluation/src/run-agent-eval.ts | Logs failing heuristic checks and failing LLM criteria to stderr for easier CI triage. |
| docs/agent-evaluation/scenarios/02-basics-typescript.md | Updates success criteria bullet to reflect outpost.publish. |
| docs/agent-evaluation/hookdeck-outpost-agent-prompt.md | Updates TS counter-example to use outpost.publish({ ... }). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // TS SDK >=1.3.0 exposes `outpost.publish(...)` directly. Keep `publish.event` | ||
| // as a fallback so older transcripts (and stray references in comments) still match. | ||
| const pub = /\.publish\s*\(|publish\.event|publish\?\.event/.test(t); | ||
| checks.push({ | ||
| id: "publish_event", | ||
| pass: pub, | ||
| detail: pub ? "Calls publish.event" : "Expected publish.event", | ||
| detail: pub ? "Calls outpost.publish(…) or publish.event" : "Expected outpost.publish(…) or publish.event", | ||
| }); |
There was a problem hiding this comment.
Thanks for the flag. Pushing back here after looking at the surrounding checks:
By the time we hit publish_event, four earlier checks have already established the corpus is an Outpost SDK script — ts_sdk_dependency (@hookdeck/outpost-sdk in package.json), outpost_client (new Outpost( / Outpost({), tenants_upsert, and destinations_create. A corpus that passes those four isn't realistically going to also contain an unrelated .publish( from another vendor.
On the specific examples: sqs.publish( doesn't exist (AWS SQS uses sendMessage); npm publish doesn't match the regex (no . before publish). The plausible alternatives are sns.publish(, MQTT client.publish(, RabbitMQ channel.publish(, GCP pubsub.publish( — but if any of those landed in a 30-line Outpost quickstart corpus, the agent would have already failed the four prior checks and we'd have a much bigger silent multi-vendor hallucination problem that no regex tightening would help with.
Leaving as-is. Happy to revisit if scenario 02 ever broadens beyond the quickstart shape.
| const SDK_HINT_PATTERNS: ReadonlyArray<{ id: string; re: RegExp }> = [ | ||
| { id: "ts_publish.event", re: /\bpublish\.event\b/i }, | ||
| { id: "ts_publish", re: /\.publish\s*\(/ }, | ||
| { id: "ts_tenants.upsert", re: /\btenants\.upsert\b/i }, |
There was a problem hiding this comment.
Same reasoning as the scoreScenario02 thread — these SDK hints are descriptive trajectory tags, not gating checks, scoped per-tool-input slice within scenario 02's Outpost SDK corpus. The constellation of other hints (ts_tenants.upsert, ts_destinations.create) plus the scenario context makes a stray non-Outpost .publish( unrealistic; if one did appear, that would be a much bigger silent-hallucination problem than the tag accuracy. Leaving as-is.
Summary
publish_eventinscoreScenario02was looking for the literal stringpublish.event, which matched the pre-1.3.0 TS SDK shape (outpost.publish.event(...)). The SDK regen in chore: 🐝 Update SDK - Generate OUTPOST-TS 1.3.0 #905 (commit d875c66) flattened it tooutpost.publish(...), but the eval check, scenario 02 success criterion, and the agent prompt were never updated.publish.eventfrom the prompt's prose in a comment — accidentally satisfying a string-presence check. docs: Clarify signature behavior #945 stuck to the current SDK README's wording, so the rescue stopped working and CI exited 1 with no detail in the GH Actions main log.Changes
src/score-transcript.ts—scoreScenario02regex now matches.publish((current SDK) and still acceptspublish.eventas a fallback for older transcripts.src/run-agent-eval.ts— on heuristic/LLMpass=false, log each failing check id+detail / criterion+evidence to stderr so the CI log surfaces the cause without needing to download the artifact.scenarios/02-basics-typescript.md— success criterion listsoutpost.publishinstead ofpublish.event.hookdeck-outpost-agent-prompt.md— TS counter-example usesoutpost.publish({ ... }).src/transcript-trajectory.ts— TS SDK hint pattern renamed tots_publishwith regex/\.publish\s*\(/.Verified the new regex matches
.publish(in the failing PR #945 transcript;npm run typecheckandnpm run test:trajectorypass.Test plan
.publish(, so the heuristic now passes).[FAIL] <id>: <detail>appears in the GH Actions main log if a future heuristic check fails (cosmetic — only triggers on failure).🤖 Generated with Claude Code