Skip to content

docs: Clarify signature behavior#945

Merged
alexluong merged 1 commit into
mainfrom
docs/clarify-signature
Jun 12, 2026
Merged

docs: Clarify signature behavior#945
alexluong merged 1 commit into
mainfrom
docs/clarify-signature

Conversation

@alexbouchardd

Copy link
Copy Markdown
Contributor

No description provided.

@alexbouchardd alexbouchardd requested a review from alexluong June 11, 2026 18:23
@alexbouchardd

Copy link
Copy Markdown
Contributor Author

@leggetter The agent eval is failing. Is it that expected to pass?

@alexluong

Copy link
Copy Markdown
Collaborator

The agent eval is pretty flaky from what I've seen.

@alexluong alexluong merged commit 45a4130 into main Jun 12, 2026
2 of 3 checks passed
@alexluong alexluong deleted the docs/clarify-signature branch June 12, 2026 06:46
@leggetter

Copy link
Copy Markdown
Collaborator

The agent eval is pretty flaky from what I've seen.

The deterministic tests shouldn't be flaky. "LLM as a Judge" tests can be because they're based on the agent's judgment.

However, in this case the problem was with a deterministic/heuristic test caused by a change in the public signature of the TypeScript SDK. Basically a regex check.

So, a bug in the test that wasn't kept up to date with the SDK changes. #950 fixes this.

@alexluong, if you see these failing 🐛 , please do ping me. We either need to make them more reliable and therefore useful, or remove them.

leggetter added a commit that referenced this pull request Jun 12, 2026
…ailing checks (#950)

Trigger: PR #945 hit a heuristic FAIL on scenario 02 because the TS SDK
flattened `outpost.publish.event(...)` to `outpost.publish(...)` in v1.3.0
(commit d875c66), but the eval check + scenario criterion + prompt were
never updated. Prior runs masked it: the agent often echoed the literal
"publish.event" from the prompt in comments, accidentally satisfying the
string-presence check. This run stuck to the SDK README wording and the
check exit-1'd with no clue in the GH Actions log — only in the artifact.

Impact: PR #945 (docs only) was blocked by an unrelated stale check. After
this fix the heuristic matches the current SDK shape, and any future
heuristic/LLM failure prints the failing check id and detail directly in
the main CI log.

Changes:
- scoreScenario02 regex now matches `.publish(` and keeps `publish.event`
  as a fallback for older transcripts.
- run-agent-eval logs each failing heuristic check + LLM criterion on
  pass=false (was: silent exit 1).
- Scenario 02 success criterion lists `outpost.publish` not `publish.event`.
- Prompt's TS counter-example uses `outpost.publish({ ... })`.
- Trajectory SDK hint pattern renamed `ts_publish` with `/\.publish\s*\(/`.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants