Skip to content

fix(docs-eval): align scenario 02 with TS SDK publish() and surface failing checks#950

Merged
leggetter merged 1 commit into
mainfrom
fix/docs-agent-eval-ts-publish-check
Jun 12, 2026
Merged

fix(docs-eval): align scenario 02 with TS SDK publish() and surface failing checks#950
leggetter merged 1 commit into
mainfrom
fix/docs-agent-eval-ts-publish-check

Conversation

@leggetter

Copy link
Copy Markdown
Collaborator

Summary

  • Heuristic check publish_event in scoreScenario02 was looking for the literal string publish.event, which matched the pre-1.3.0 TS SDK shape (outpost.publish.event(...)). The SDK regen in chore: 🐝 Update SDK - Generate OUTPOST-TS 1.3.0 #905 (commit d875c66) flattened it to outpost.publish(...), but the eval check, scenario 02 success criterion, and the agent prompt were never updated.
  • Prior runs passed only because the agent often echoed publish.event from the prompt's prose in a comment — accidentally satisfying a string-presence check. docs: Clarify signature behavior #945 stuck to the current SDK README's wording, so the rescue stopped working and CI exited 1 with no detail in the GH Actions main log.

Changes

  • src/score-transcript.tsscoreScenario02 regex now matches .publish( (current SDK) and still accepts publish.event as a fallback for older transcripts.
  • src/run-agent-eval.ts — on heuristic/LLM pass=false, log each failing check id+detail / criterion+evidence to stderr so the CI log surfaces the cause without needing to download the artifact.
  • scenarios/02-basics-typescript.md — success criterion lists outpost.publish instead of publish.event.
  • hookdeck-outpost-agent-prompt.md — TS counter-example uses outpost.publish({ ... }).
  • src/transcript-trajectory.ts — TS SDK hint pattern renamed to ts_publish with regex /\.publish\s*\(/.

Verified the new regex matches .publish( in the failing PR #945 transcript; npm run typecheck and npm run test:trajectory pass.

Test plan

  • CI eval slice runs green on this branch.
  • Sanity: re-run PR docs: Clarify signature behavior #945's failing eval against this branch's scorer (the agent transcript on file already contains .publish(, so the heuristic now passes).
  • Confirm the log line [FAIL] <id>: <detail> appears in the GH Actions main log if a future heuristic check fails (cosmetic — only triggers on failure).

🤖 Generated with Claude Code

…ailing checks

Trigger: PR #945 hit a heuristic FAIL on scenario 02 because the TS SDK
flattened `outpost.publish.event(...)` to `outpost.publish(...)` in v1.3.0
(commit d875c66), but the eval check + scenario criterion + prompt were
never updated. Prior runs masked it: the agent often echoed the literal
"publish.event" from the prompt in comments, accidentally satisfying the
string-presence check. This run stuck to the SDK README wording and the
check exit-1'd with no clue in the GH Actions log — only in the artifact.

Impact: PR #945 (docs only) was blocked by an unrelated stale check. After
this fix the heuristic matches the current SDK shape, and any future
heuristic/LLM failure prints the failing check id and detail directly in
the main CI log.

Changes:
- scoreScenario02 regex now matches `.publish(` and keeps `publish.event`
  as a fallback for older transcripts.
- run-agent-eval logs each failing heuristic check + LLM criterion on
  pass=false (was: silent exit 1).
- Scenario 02 success criterion lists `outpost.publish` not `publish.event`.
- Prompt's TS counter-example uses `outpost.publish({ ... })`.
- Trajectory SDK hint pattern renamed `ts_publish` with `/\.publish\s*\(/`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 12, 2026 07:50

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the docs-eval harness for Scenario 02 (TypeScript) to align heuristic + prompt wording with the current TS SDK outpost.publish(...) API shape, and improves CI debuggability by emitting failing check details to stderr when scoring fails.

Changes:

  • Update Scenario 02 scoring to recognize outpost.publish(...) (while retaining publish.event backward compatibility).
  • Emit per-check / per-criterion failure details to stderr when heuristic or LLM scoring fails.
  • Align Scenario 02 docs + agent prompt examples with outpost.publish({ ... }) wording.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
docs/agent-evaluation/src/transcript-trajectory.ts Updates the TS publish SDK hint regex/id used in trajectory tagging.
docs/agent-evaluation/src/score-transcript.ts Adjusts Scenario 02 heuristic to match .publish( (and still allow legacy publish.event).
docs/agent-evaluation/src/run-agent-eval.ts Logs failing heuristic checks and failing LLM criteria to stderr for easier CI triage.
docs/agent-evaluation/scenarios/02-basics-typescript.md Updates success criteria bullet to reflect outpost.publish.
docs/agent-evaluation/hookdeck-outpost-agent-prompt.md Updates TS counter-example to use outpost.publish({ ... }).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +326 to 333
// TS SDK >=1.3.0 exposes `outpost.publish(...)` directly. Keep `publish.event`
// as a fallback so older transcripts (and stray references in comments) still match.
const pub = /\.publish\s*\(|publish\.event|publish\?\.event/.test(t);
checks.push({
id: "publish_event",
pass: pub,
detail: pub ? "Calls publish.event" : "Expected publish.event",
detail: pub ? "Calls outpost.publish(…) or publish.event" : "Expected outpost.publish(…) or publish.event",
});

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the flag. Pushing back here after looking at the surrounding checks:

By the time we hit publish_event, four earlier checks have already established the corpus is an Outpost SDK script — ts_sdk_dependency (@hookdeck/outpost-sdk in package.json), outpost_client (new Outpost( / Outpost({), tenants_upsert, and destinations_create. A corpus that passes those four isn't realistically going to also contain an unrelated .publish( from another vendor.

On the specific examples: sqs.publish( doesn't exist (AWS SQS uses sendMessage); npm publish doesn't match the regex (no . before publish). The plausible alternatives are sns.publish(, MQTT client.publish(, RabbitMQ channel.publish(, GCP pubsub.publish( — but if any of those landed in a 30-line Outpost quickstart corpus, the agent would have already failed the four prior checks and we'd have a much bigger silent multi-vendor hallucination problem that no regex tightening would help with.

Leaving as-is. Happy to revisit if scenario 02 ever broadens beyond the quickstart shape.

Comment on lines 243 to 245
const SDK_HINT_PATTERNS: ReadonlyArray<{ id: string; re: RegExp }> = [
{ id: "ts_publish.event", re: /\bpublish\.event\b/i },
{ id: "ts_publish", re: /\.publish\s*\(/ },
{ id: "ts_tenants.upsert", re: /\btenants\.upsert\b/i },

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reasoning as the scoreScenario02 thread — these SDK hints are descriptive trajectory tags, not gating checks, scoped per-tool-input slice within scenario 02's Outpost SDK corpus. The constellation of other hints (ts_tenants.upsert, ts_destinations.create) plus the scenario context makes a stray non-Outpost .publish( unrealistic; if one did appear, that would be a much bigger silent-hallucination problem than the tag accuracy. Leaving as-is.

@leggetter leggetter merged commit 6a3eb14 into main Jun 12, 2026
8 checks passed
@leggetter leggetter deleted the fix/docs-agent-eval-ts-publish-check branch June 12, 2026 08:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants