Tags · METR/hawk

v2026.06.08

fix(api): make health checks independent of the database (#544)

The Hawk API's /health ran DB + S3 checks and was wired to both the ALB
target-group check and the ECS container check, so a database slowdown
returned 503 — pulling the task from the load balancer and, on a sustained
outage, letting ECS recycle it (the dev-faber 502 incident; a latent
crash-loop risk in prod).

Add a shallow, dependency-free /health/live and point both the ECS and ALB
checks at it; keep the deep /health (DB + S3 + migrations) as an unprobed
monitoring endpoint. A DB outage can no longer fail the health check, so the
task is never de-registered or restarted because the database is down.

On ECS+ALB the ALB health check governs task replacement (not just routing),
so a DB-gated readiness check can't shed traffic without also restarting the
task — hence a shallow check, not a liveness/readiness split. Verified on
dev-faber1.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

Jun 8, 2026
dd1e801
zip
tar.gz
Notes

v2026.06.01

fix(infra/datadog): qualify K8s resource keys with API groups (#468)

The DatadogAgent CRD's kubernetesResourcesAnnotationsAsTags and
kubernetesResourcesLabelsAsTags use schema.ParseGroupResource() to derive
the API group from each key. Unqualified keys (e.g. 'statefulsets') parse
as core API group ('' / v1), which is wrong for non-core resources.

The Datadog operator then tries to create the
'datadog-cluster-agent-annotations-and-labels-as-tags' ClusterRole with
rules like {APIGroups: [""], Resources: ["statefulsets"], Verbs: [list, watch]},
and K8s RBAC escalation prevention blocks it because the operator only
holds those permissions in the proper groups (apps/, batch/, cilium.io/).

Net effect: the ClusterRole is never created, and the cluster agent
cannot tag Job or CiliumNetworkPolicy resources with their inspect-ai.metr.org/*
annotations and labels. The operator retries every ~17 minutes,
producing ~86 denied actions/day per environment in the EKS audit log
(visible in Datadog at: service:operator "Dependencies apply error").

Co-authored-by: Rafael Carvalho <rafaelcarvalho@metr.org>

Jun 1, 2026
86bb7d7
zip
tar.gz
Notes

v2026.05.25

Route per-model Anthropic auth via Workload Identity Federation (#459)

Add AnthropicCredentialBroker (middleman/src/middleman/anthropic_wif.py)
that resolves an Anthropic credential per profile: either a static API
key env var or a WIF lane. WIF lanes mint short-lived sk-ant-oat01-...
tokens by exchanging an Okta client-credentials JWT at
api.anthropic.com/v1/oauth/token. Refresh follows the WIF SDK contract
(advisory exp-120s, mandatory exp-30s) with a single-flight asyncio
lock per profile and a shared aiohttp session.

ModelInfo gains anthropic_account; passthrough.py picks a profile via
that field and swaps x-api-key for Authorization: Bearer when the
credential is a bearer token. Unknown profile = fail-loud (500) so a
typo doesn't silently route to the default Anthropic org. Models
without anthropic_account keep the env-var API key path; no behavior
change for them.

Error mapping:
  - CredentialNotConfiguredError -> HTTP 500 (local misconfig:
    unknown profile name, missing Okta client_secret in SM).
  - CredentialExchangeError -> HTTP 502 (upstream IdP / Anthropic
    exchange failed).

server.py loads profiles at boot (fails fast on bad config) and only
invalidates the broker token cache from the periodic key refresh when
SM contents actually changed (returned by provider_key_store.reload),
so rotating an Okta client_secret takes effect within one cycle
without forcing re-mint every 5 minutes.

Pulumi wires MIDDLEMAN_ANTHROPIC_PROFILES from a new
middlemanAnthropicProfiles config key (both from_pulumi_config and
from_dev_env). Admin schema exposes anthropic_account on POST/PATCH
/admin/models/ so the field is settable without direct DB writes.

New smoke test tests/smoke/scenarios/test_middleman_anthropic_wif.py
creates a temp WIF-routed model via the admin API, hits /v1/messages,
asserts Anthropic answers, then cleans up. Skipped unless
HAWK_SMOKE_WIF_PROFILE is set.

Tests: 802 middleman tests pass (33 new broker tests, 2 new
passthrough tests). End-to-end validated against a dev stack: broker
logs anthropic_wif.exchange.ok, response routed via Authorization:
Bearer.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

May 22, 2026
242d455
zip
tar.gz
Notes

v2026.05.18

Refresh tailscale subnet router on user-data changes (#397)

* fix: refresh tailscale subnet router on user-data changes

The ASG referenced its launch template as `version="$Latest"` (a constant
string from Pulumi's perspective), so changes to user-data produced new LT
versions that Pulumi never saw as a diff on the ASG. The configured
`instance_refresh` block therefore never fired, and the running EC2 instance
kept its first-boot configuration indefinitely — prd has been running LT v6
from 2026-04-13 while the template is now at v11.

Pin the ASG to `lt.latest_version` so Pulumi sees concrete version diffs and
triggers a rolling refresh on each user-data change.

Also fix a regression in #192 that broke multi-CIDR VPCs: the `head -1` on
`vpc-ipv4-cidr-blocks` IMDS output kept only the primary CIDR, dropping
return-traffic routing for any secondary CIDRs (10.51/10.52 in prd,
10.111/10.112 in stg). Loop over all CIDRs instead.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Revert unneeded user-data loop; subnet router SNATs forwarded traffic

The head -1 OS route in PR #192 is intentional: Tailscale subnet routers
SNAT subnet-routed traffic by default, so forwarded packets carry the
router primary ENI IP as source. That source IP passes src/dst check on
the primary ENI, and AWS handles VPC-internal routing to secondary CIDRs.
No explicit per-CIDR OS routes are needed.

Verified empirically on stg LT v15: head -1 only routes 10.110/16 via the
persistent ENI, yet tailscale ping to 10.110.x.x works correctly and the
device advertises + has approved all three stg CIDRs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

May 16, 2026
f61ce0a
zip
tar.gz
Notes

v2026.05.11

fix: update inspect_k8s_sandbox to valid commit (#382)

* fix: update inspect_k8s_sandbox to valid commit

The previous pin (725637fa) was force-pushed away from the upstream
repo, causing all runner pods to fail on startup with:
  "failed to find branch, tag, or commit 725637fa..."

Updates to 3419f2a3 (v0.5.0).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: update stale comment for inspect_k8s_sandbox pin

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

May 8, 2026
a7ff09e
zip
tar.gz
Notes

v2026.05.04

fix(www): restore markdown list bullets/numbers in embedded viewers (#…

…330)

Tailwind preflight's `ol, ul, menu { list-style: none }` reset cascades into
the embedded Inspect/Scout viewers and strips markers from markdown-rendered
lists. Re-apply disc/decimal list styles within `.markdown-content`.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

May 1, 2026
94396c8
zip
tar.gz
Notes

v2026.04.27

Avoid hacking the function object for cache_clear (#291)

Apr 25, 2026
bc10809
zip
tar.gz
Notes

v2026.04.20

Upgrade inspect-ai to 0.3.207 and inspect-scout to 0.4.26 (#222)

* chore: prepare release release/20260416144609

* feat: update prepare-release.py for ts-mono and npm web auth

- Update viewer paths for ts-mono submodule structure
- Add submodule init and pnpm monorepo install steps
- Use pnpm pack + npm publish to resolve workspace:* deps
- Strip @tsmono/* internal deps from package.json before publish
- Add PTY-based npm web auth with webbrowser.open() for security key 2FA
- Process packages sequentially to avoid overlapping auth prompts
- Add --otp CLI flag, --force for git fetch tags
- Remove private field from package.json before publish

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add explicit types in useInspectApi for new viewer library

The updated inspect-log-viewer changes the return type of get_logs,
causing implicit any errors in the flatMap/map callbacks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: add comments documenting cherry-picked PRs in uv sources

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* uv.lock

* Include sandbox tools

* chore: remove design doc from release branch

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: fix lint errors in prepare-release.py

- StrEnum instead of (str, Enum)
- Remove walrus operators from assert statements
- Remove unnecessary variable before return
- Remove stale noqa directive

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: apply prettier formatting to useInspectApi.ts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* ci: trigger fresh run after LFS fix

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(ci): skip LFS smudge during python-lint install

Release branches pin inspect-scout to a git commit on the METR fork,
which may not have LFS objects available. The dist files aren't needed
for linting or type-checking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(ci): unset checkout credentials before uv sync for LFS access

actions/checkout sets http.https://github.com/.extraheader with a
GITHUB_TOKEN scoped to METR/hawk. This causes git-lfs downloads to
fail for other repos (e.g. inspect_scout) because the token isn't
authorized for their LFS storage. Unsetting the header lets LFS
fall back to unauthenticated access, which works for public repos.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(ci): skip LFS smudge during uv sync and bust stale cache

The uv cache persists broken LFS checkouts between runs. Using
GIT_LFS_SKIP_SMUDGE=1 avoids the issue entirely — the viewer dist
files in inspect-scout aren't needed for linting, type-checking, or
tests. Added cache-suffix to invalidate the stale cache.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: pin inspect-scout to hotfix branch with de-LFS'd dist files

The previous pin (75f3837e) had LFS-tracked dist files that weren't
available on the METR fork, breaking uv install in CI and Docker
builds. The hotfix branch has the same files committed as regular
git objects.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: update inspect-scout pin to de-LFS'd hotfix (e63c1154)

The METR fork's dist files are now stored as regular git objects
instead of LFS pointers, fixing CI and deploy failures caused by
GitHub forks not sharing LFS storage with upstream repos.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: update inspect-scout-viewer to rebuilt npm package

The previous beta package was built from the wrong hotfix branch.
Rebuild from the correct v0.4.26 + PR #367 source.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* revert: remove LFS workarounds from hawk-ci.yml

No longer needed now that METR/inspect_scout hotfix branch
stores dist files as regular git objects instead of LFS.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: fix ruff formatting in test_sanitization.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove unnecessary pyright ignore comments in converter.py

The match is no longer exhaustive with the updated inspect-ai types,
so the suppress comments are now flagged as unnecessary.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: update eval log 404 test for new inspect-ai behavior

inspect-ai now catches FileNotFoundError internally in api_log()
and returns a plain 404 Response (no JSON body), so the exception
never reaches hawk's exception handler. Update test to match.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* last bump

* chore: upgrade inspect-ai from v0.3.206 to v0.3.207 (with same cherry-picks)

Rebase METR fork hotfix branch onto 0.3.207 tag, keeping PRs #3376,
add automatic npm login when unauthenticated, and fix stale terraform/modules
glob path to services/modules.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: spec for retry error grouping in inspect log viewer

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: prepare release release/20260420083319

* chore: update all uv.lock files and document ts-mono PRs

Run uv-lock-all.sh to catch lock files missed by prepare-release.py.
Add ts-mono PR references to pyproject.toml comments.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: smoke test ECR sync and TUI exit behavior

- Look up inspect_tasks_ecr_url instead of renamed docker_image_repo from Pulumi
- Add staging fallback in _apply_env_overrides for cached envs missing source_image_repo
- Handle Pulumi "does not have output property" error as missing key in get_stack_output
- Exit TUI on full success instead of staying open
- Remove unused _OUTPUT_KEYS constant

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: shield smoke test job submissions from cancellation

When the smoke test TUI is quit mid-run, CancelledError could interrupt
the API call after the server created the eval-set/scan but before the
client registered it for cleanup, leaving orphaned resources.

Wrap the submission in asyncio.shield so cancellation doesn't interrupt
the HTTP call. If cancelled, await the task to get the job ID and
register it for cleanup before re-raising.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Apr 20, 2026
8f4f698
zip
tar.gz
Notes

v2026.04.15

Fix weekly release: set GH_TOKEN for gh CLI

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Apr 15, 2026
d6520b7
zip
tar.gz
Notes

v2026.03.30

perf: lazy-load scan_events and optimize S3 parquet reads (#1000)

## Overview

Dramatically improves scan viewer load times by lazy-loading
`scan_events` and optimizing S3 parquet reads.

**Issue:** Scan pages taking 2-5+ minutes to load due to large
`scan_events` columns being included in the initial Arrow stream.

## Approach and Alternatives

- **Lazy-load `scan_events`**: Excluded from the initial Arrow IPC
stream. Fetched on-demand via a new `/fields` endpoint when viewing
individual result details.
- **Optimized S3 reads**: Uses PyArrow's native S3FileSystem with
`pre_buffer` for efficient range requests instead of reading entire
parquet files into memory.

## Testing & Validation

- [x] Covered by automated tests (inspect_scout test suite: 33/33
passing)
- [x] Manual testing: Deployed to dev1, verified scan page loads in ~10s
vs 2-5min previously
- [ ] Manual testing instructions:
  1. Open a scan page with large scan results
  2. Verify initial load is fast (scan_events not in initial stream)
3. Click on individual results to verify scan_events loads via /fields
endpoint

## Checklist
- [x] Code follows the project's style guidelines
- [x] Self-review completed
- [x] Documentation updated (if applicable)
- [x] Tests added or updated (if applicable)

## Additional Context

meridianlabs-ai/inspect_scout#367

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Mar 27, 2026
fb1a4bc
zip
tar.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2026.06.08

v2026.06.01

v2026.05.25

v2026.05.18

v2026.05.11

v2026.05.04

v2026.04.27

v2026.04.20

v2026.04.15

v2026.03.30

Tags: METR/hawk